Welcome to ReDATA’s documentation!

Overview

This ReadTheDocs landing page provides general documentation for software pertaining to ReDATA, the University of Arizona Research Data Repository. ReDATA is a Figshare for Institution instance that is managed by Figshare, our third-party Software-as-a-Service (SaaS) vendor.

The GitHub repository is available here.

All ReDATA-related repositories are under the GitHub organization (UAL-RE) of Research Engagement, University of Arizona Libraries.

Unless indicated, all software are under an MIT License.

Repositories Overview

Repositories purposes

Our codebases fall in one of six categories:
  1. Common/general software used throughout ReDATA codebases

  2. Documentation

  3. Identity and access management (IAM)

  4. Data curation

  5. Data preservation

  6. Infrastructure as Code (IaC)

Software name

Category

Purpose

LD-Cool-P

Curation

Python command-line API for data curation

ReBACH

Preservation

Software to support data preservations with Dart and other tools

ReQUIAM

IAM

Python command-line API for IAM

ReQUIAM_csv

IAM

Python command-line API and database of groups for IAM

figshare

Curation, Preservation

A forked copy of cognoma’s repository used to gather public/private data from Figshare API

ldcoolp-figshare

Curation

Python backend API for access to the Figshare API for Figshare for Institutions instances

redata-commons

General

A set of common modules, code, and external libraries used throughout ReDATA codebases.

redata-docs

Documentation

The repository hosting the current pages you are viewing on Read The Docs

redata-iac

IaC

Repository containing Infrastructure as Code (IaC) and scripts used on the operational side of ReDATA

Repositories details

More details about each repository:

Software name

Tag version

Changelog

Documentation

Main branch

PyPI

LD-Cool-P

LD-Cool-P GitHub tag version

CHANGELOG

README

master

TBD

ReBACH

N/A

TBC

README

main

TBD

ReQUIAM

ReQUIAM GitHub tag version

CHANGELOG

README

master

N/A

ReQUIAM_csv

ReQUIAM_csv GitHub tag version

TBC

RTD

master

N/A

figshare

figshare GitHub tag version

N/A

N/A

master

N/A

ldcoolp-figshare

ldcoolp-figshare GitHub tag version

CHANGELOG

RTD

main

ldcoolp-figshare

redata-commons

redata-commons GitHub tag version

CHANGELOG

RTD

main

redata

redata-docs

redata-docs GitHub tag version

N/A

RTD

main

N/A

redata-iac

redata-iac GitHub tag version

TBC

N/A

master

N/A

Repositories status

Below summarizes open and closed issues and pull requests.

Software name

Open and closed issues

Pull requests

LD-Cool-P

LD-Cool-P GitHub open issues LD-Cool-P Github closed issues

LD-Cool-P GitHub open PRs LD-Cool-P Github closed PRs

ReBACH

ReBACH GitHub open issues ReBACH Github closed issues

ReBACH GitHub open PRs ReBACH Github closed PRs

ReQUIAM

ReQUIAM GitHub open issues ReQUIAM Github closed issues

ReQUIAM GitHub open PRs ReQUIAM Github closed PRs

ReQUIAM_csv

ReQUIAM_csv GitHub open issues ReQUIAM_csv Github closed issues

ReQUIAM_csv GitHub open PRs ReQUIAM_csv Github closed PRs

figshare

N/A

N/A

ldcoolp-figshare

ldcoolp-figshare GitHub open issues ldcoolp-figshare GitHub closed issues

ldcoolp-figshare GitHub open PRs ldcoolp-figshare GitHub closed PRs

redata-commons

redata-commons GitHub open issues redata-commons GitHub closed issues

redata-commons GitHub open PRs redata-commons GitHub closed PRs

redata-docs

redata-docs GitHub open issues redata-docs GitHub closed issues

redata-docs GitHub open PRs redata-docs GitHub closed PRs

redata-iac

redata-iac GitHub open issues redata-iac GitHub closed issues

redata-iac GitHub open PRs redata-iac GitHub closed PRs

Project Management

An Overview

For software development purposes, we utilize git and GitHub extensively for version control and project management. This is crucial since we must keep track of hundreds of bugs, improvements, and changes for several repositories.

We use GitHub tools to track and implement changes to the software. First, we use GitHub issues to identify and track bugs/issues/features, and GitHub pull requests or “PR” so that a developer can suggest a set of changes to be merged into the master/main branch. Within these issue and PR tracking, we use labels to indicate what these changes/problems pertain to. Each repository has a set of labels. Labels are helpful to understand scope and impact and aids in GitHub search engine optimization. To understand the scope of any work, we use GitHub milestone tracking. Finally, we use GitHub project boards to illustrate and manage issues and PRs. Each repository has its own project board. These are kanban style boards with several columns/lists.

DevOps workflow

The general workflow are as follow when starting any improvement:

  1. Create a new GitHub issue if one does not exist. Begin tracking it in the project board

  2. Create a new branch locally

  3. Commit changes to branch and push them to the new branch on the remote repository (i.e. GitHub)

  4. Create a PR within the repository to merge the new branch into the master/main branch

  5. A team member reviews the PR (if enough developers are on staff). Self-review are OK if staff is limited.

  6. The changes are merged into the master/main branch and any associated tags are pushed to the remote repository

  7. The software is manually deployed

Branching

It is strongly recommended to use git branches for software development. This is because, at any point, multiple features/bugs are being addressed, and changes pushed directly to the main branch could break the software if it is untested or has not been reviewed. Branching is a common Developer + Operations (“DevOps”) best practice. To create a new git branch, use the following git commands:

$ git pull master
$ git checkout -b <name_of_branch>

To checkout an existing branch:

$ git branch  # To see existing branches
$ git checkout <name_of_branch>

In terms of branch names, it is strongly recommended to name branches so it is clear and concise. We strongly recommend including:

  1. The GitHub issue number

  2. Whether it is a feature/enhancement or a bug fix

  3. A short description

The above ensures an easier understanding to the software development team. Examples include:

  1. feature/235_preserve_prep for LD-Cool-P#235

  2. hotfix/229_400_error for LD-Cool-P#229

Note: Our branching model initially followed a git-flow workflow with features, hotfixes, and releases; however, we later moved away from that model and now use a GitHub flow workflow where all changes are merged into the master/main branch after review and testing.

Versioning and tagging

In all of our software, we conduct version tagging. Here, each new version refers to a change to the codebase that is to be deployed. We loosely follow Semantic versioning (SemVer), which denotes changes as MAJOR, MINOR, and PATCH. There are two differences with our method of versioning against SemVer:

  1. We use the patch denotation for both hotfixes and small enhancements to software.

  2. We use MINOR denotation for large/larger enhancements (e.g. a completely new feature rather than an improvement to an existing feature).

MAJOR remains the same, for incompatible API changes. We try to avoid the latter as much as possible.

While some open-source software teams may not use version tagging, there are many advantages. First, this step ensures that we have continuous delivery of our software. Second, for some of our software, we automatically deploy them on PyPI, a python package manager that allows for easy installation of the software. Finally, our logging tools records version information for each software, so this allows the team to trace an issue back to a specific PR. To tag a specific commit:

$ git tag vX.Y.Z -m

A vim prompt will appear so you can provide a message for the tag. Often a short message referring to the GitHub issue number will suffice. You will then push the tag via:

$ git push --tags

Merging code

TBD on using git over GitHub merge tool.

Milestone tracking

More details needed here.

Status of GitHub repositories

See Repositories status

Identity and Access Management

IAM Overview

Our Figshare for Institution instance, has a couple of features to maintain identity and access management (IAM) settings and to assist in data repository administration.

First, we have the ability to set a quota of available space for each user. Our default quotas, applicable to most ReDATA users, are:

Classification

Quota

Undergraduates

0 (initially), 100MB after they contact us

Graduates

0.5 GB

Faculty/Staff/DCC

2 GB

Second, we have the ability to assign each users to groups on Figshare (a.k.a. “portals”). This allows for the easily exploration of data through these portals. For our deployment we chose to do it by following common research themes for our University. To identify researcher’s discipline, we utilize their primary affiliation at the University.

Software/Services Overview

There are a number of software and services that we use for IAM. They are:

Software/Services

Maintainer(s)

Purpose

Enterprise Directory Service (EDS)

UITS

UArizona’s LDAP directory used to gather metadata about their users from a central UA datastore in order to make authorization decisions.

Grouper

UITS

UArizona’s tool to create groups for UA organization. This is populated into EDS and Shibboleth

Shibboleth / WebAuth

UITS

UArizona’s SAML-based access to UA IAM information

ReQUIAM

ReDATA team

Python command-line API for IAM

ReQUIAM_csv

ReDATA team

Python command-line API and database of groups for IAM

Services

First, we utilize three services provided and administered by University Information Technology Services (UITS):

  1. EDS

  2. Shibboleth

  3. Grouper

Users who login to ReDATA uses their NetID credentials to login (WebAuth). A user who is no longer part of the University will not have NetID and thus will not be able to log in.

Software

The two codebases that the ReDATA team develops and maintains are ReQUIAM and ReQUIAM_csv. The former is the primary software that manages all ReDATA IAM with a daily “cronjob” that sets research theme association (“portals”) and quotas through the Grouper API. That information is then propagated into EDS and Shibboleth with users logging in. Also, ReQUIAM has a command-line API to enable other manual IAM changes for the ReDATA team, such as setting a higher quota from default quota settings (See IAM Overview)

The ReQUIAM_csv software contains the mapping between the groups on ReDATA’s Figshare for Institution instance and UArizona organizational codes. The spreadsheet is available through Google Docs.

The Grouper-to-Figshare-group mapping is provided as a CSV file to be consumed by ReQUIAM, which are publicly available on GitHub at:

  1. Raw version

  2. Rendered version

Grouper settings

To control IAM, we update Grouper group memberships, which are metadata that is passed into EDS and ultimately Shibboleth and consumed by our Figshare for Institution instance for account creation (for first login) and update when users re-login. This metadata record is called ismemberof.

The three ismemberof settings that ensures proper IAM are:

ismemberof

Type

Purpose

active

Group

This enable login to ReDATA. Non-membership means the individual is no longer an active member by Libraries privileges

portal

Stem

Folder containing various research themes Grouper groups

quota

Stem

Folder containing Grouper groups of quotas in bytes

The Grouper stem prefix for the above is arizona.edu:Dept:LBRY:figshare.

ReQUIAM maintains direct membership for portal and quota groups. For the active group, this is done using indirect membership from other Grouper groups set by the University Libraries patron software, patron-groups.

Our Figshare instance maps the portal and quota settings accordingly such that:

  1. A quota is set to ensure that a user has enough space for small deposits, which is most often the case. The user can request more space, which a ReDATA administrator would need to approve. The latter allows for the ReDATA team to understand the user’s needs and to identify cases where there are large deposits requiring more assistance.

  2. A researcher’s data deposits are placed in a proper Figshare group/portal.

If a user does not have a portal set then their data publication will not appear in any group/portal, but part of the University wide group. If a quota is not set (for undergraduates logging in for the first time), then the quota is set to zero.

Indices and tables