Gitlab Tutorial 2. Create and Update a Data Index
A workflow to use Gitlab pages as an open data Index
Why is this useful?
Researchers need to be able to find and share their data and to showcase it to internal and external collaborations. Creating copies or transferring files can be cumbersome and lead to propagating errors or sharing of outdated file versions.
With Gitlab and R Quarto we can easily integrate all metadata and documentation stored and maintained in our repository, and render it into a website. This website, which can have a private or public access, becomes a safe ‘data index’ or central platform from which researchers and collaborators can navigate through research outputs, find the location of data files in specialized repositories or preview thumbnails of some of the data.
Overview
Here we propose to use Gitlab pages to host an easy-to-build website where we can share documentation, metadata and some data (e.g., preview images) of our projects. The website can be either public or private, requiring users to log in. We propose that this approach can be useful to improve the research workflow in many labs and projects, irrespective of whether they have a strong programming focus.
The different processes required for this workflow are explained in the tutorials in this website. They include step-by-step guides, templates, video demos and links intended for non-coders, as well as other tutorials for more advanced users.
The main requirements for this workflow are:
- (Required) A Gitlab repository and basic familiarity with it
- (Required) A way of creating HTML pages: we provide R Quarto markdown templates
- (Recommended) A Gitlab continuous integration script to allow updating the directly from the browser without having to clone the repo/push changes. We provide examples and templates.
Basic workflow
(dashed line = manual or semi-manual; solid line = automatic process)
Hub content
In this workflow we propose that the owner
of the data (e.g., a lab or a researcher) creates and maintains a main Gitlab repository containing:
- Metadata tables e.g., CSV or Excel tables
- Data, if applicable, e.g., thumbnail images
- Scripts to render an HTML landing site, i.e. the Data Index, which provides access for collaborators. This can be done without web development skill using in R markdown or R Quarto markdown.
- Documentation associated with those data is collected in the Wiki of the main repository. The advantage of using the Wiki is that it can be directly edited in the browser by anyone with permissions. If needed, the markdown files with the wiki pages could be downloaded (they are stored is a separate repository).
Gitlab can display and allows editing of .csv tables from the browser. For excel tables, you will need to download it for editing or work locally and then push the changes to the repository. Note that this refers to editing source files in the repository and not to the website, where tables are visible (but not editable) irrespective of the source table format (R Quarto scripts will read any format and render them in html format for interoperability and interactivity)
The files will have three URL locations, which can all be accessed through the main landing site (Data Index):
- The landing page shows interactive tables to navigate data and metadata the documentation and has links to the main repository and the wiki.
- The main repository to access the source metadata tables, data files and the code necessary to generate the Hub.
- The Wiki to access all relevant documentation to understand how the data was collected and preprocessed, such as protocols and standard operating procedures (SOPs). It should also have information about filenames, folder structures and how metadata are organized. It is also accessible through the main repository.
Required Files
The main input files are:
metadata table
(preferrably in .csv format for interoperability). It should contain just the filename of the pictures that will be displayedimages
files (Note: Gitlab is not meant for data storage. Images should be compressed to avoid exceeding volume allowed per repo ( usually 5 Gb in free versions, details here)R Quarto markdown
scripts that has the code to render the tables as well as some text with instructions- A
.gitlab-ci.yml
file is the Continuous integration file that makes it possible to run the Rmd file and render the html again from Gitlab
The outputs for the website
- The
html
files that make up the website are
Note: index.html and images folder ARE EXPECTED to be saved in the public
folder. This does not mean they are publicly accessible (you need login to access this site)
Privacy
This work flow can work in both a private
or a public
repository. If we want to keep access restricted we can set the repository as private. Then we need:
- Accounts. A SWITCH edu ID account is required to access the hub website and the repository with the code to generate it
- Members. As owners of a repository we can set up different roles for the members that we allow access. Any member with the role of
maintainer
can add new members (Go to Manage/Members). They must also have maintainer role to be able to access the full-size images after clicking on the table thumbnail. Members enrolled asguests
can only see the thumbnails but will not be able to access the full-size images.
After having acquired permissions to access the repository, any user/collaborator
can just go to the main website URL and access with the SWITCH-ID credentials
The access to the website using SWITCH edu-id credentials works with Gitlab pages because it is supported by UZH ( Gitlab.uzh). If are using repositories at Gitlab.com, your collaborators will need a Gitlab account. If you use Github (with a free acount) instead of Gitlab you will not be able to make a private website, Github pages will only work with public repositories.
Required actions (for data owners)
Once the Continuous Integration (CI) file and R Quarto scripts are set, the only action required by the owner will be to edit the metadata tables and/or some of the content of the R Quarto files in the repository. Any change in the repository will trigger the CI pipeline and the website will be rendered again.
1. ACTION: edit metadata
The owner
can do this through:
- Browser (RECOMMENDED FOR MINOR CHANGES) Making changes from Gitlab using the browser, e.g., upload or edit scripts.
- Clone repository (RECOMMENDED FOR MAJOR UPDATES) Cloning this repo, working locally and committing and pushing the changes to Gitlab (e.g., using Github desktop). The local metadata file should be always in Sync with that one in the remote repository.
This workflow considers the Gitlab repository as the main source for retrieving your projects’ metadata. Thus, we recommend to edit and maintain the tables in this repository and to avoid creating copies of those tables to prevent losing control over their versions.
2. AUTO: changes in the repository trigger the Continuous Integration pipeline
It may take around 5 minutes to update the page html. The member with at least maintainer
status can click on Build/Pipelines or Jobs (sidebar in Gitlab) to see what pipeline or job (within pipeline) is running, and if there are any errors.
3. AUTO: the continuous integration pipeline runs the script
The .gitlab-ci.yml file defines this pipeline. It uses a Docker image with R, Gitlab pages to produce the website, and Gitlab runner to run the Docker where we have our Rmarkdown script (rendering the html from the table). The owner can edit the yml and the Rmarkdown to make if changes in this flow are to be done. No edit of Docker image is needed.
See our tutorial on customizing Gitlab Continuous Integration for more advanced details.