Metadata: tables

General tips on making better human and machine-readable tabular metadata

Basic
Tips
Author
Affiliation

G.Fraga Gonzalez

Center for Reproducible Science, UZH

Published

December 16, 2024

Tables are a form of structured metadata with essential information to interpret the data. A common scenario in research is to have a table describing records, cases or subjects, with one row per record and columns describing different features. Then, there are often multiple larger tables with data associated with each record or subject (e.g., physiological measurements, images, test results, etc). These tables may content entries that are automatically generated and others that have to be manually entered by researchers. The tables can be in different formats, although .csv or .tsv are recommended for the sake of interoperability. Spreadsheet software formats like Microsoft Excel or Google Sheets are powerful and widely used by many researchers, as they have more human-friendly features.

Publication review: six tips for better spreadsheets

We distill here the main points of the highly recommendable two-pager from the technology editor at Nature (Perkel 2022):

1. Keep raw data raw

Recommendation: make the original spreadsheet a read-only document and work on copies. Beware of the dangers of autoformatting functions of your spreadsheet software, and on how easy it is to loose track of the user-friendly manipulations done to the file.

2. Make data machine-readable

Do not encode any information with formatting (e.g., in color-coding schemes), create additional columns that can be used as filter if necessary. Formatting can be easily lost during routine table maintenance. Researchers may also struggle to remember what the formatting represents. When read from a machine, the information encoded by formatting may be lost.

3. Be consistent

Data-analysis tools expect one specific format: one row of column titles. No merged cells and one table per page. Ideally all cells are filled (blank cells may contain hidden characters) and contain one piece of data (e.g. a cell with ‘3 red beetles’, better use two columns: one for count and one for insect type). Do not use to subdivided a table into different groups.

4. Document your work

Create a sheet (or tab) with a codebook that documents abbreviations, units of measurement, missing values, etc.

5. Cross-check your data

If possible, protect parts of the spreadsheet to prevent undesired manipulations, apply data validation checks to prevent data entry errors. Finally, double-check your work and analysis workflow.

6. Think ahead

Plan ahead what data formats and transformations you may need, and what variables and covariates you may use. Consulting your collaborators will increase the chances of having a table that useful to other researchers.

Table operations

Researchers have to manipulate tabular metadata and this is better done by code so that any operation on the raw data can be reproduced. Visit our Templates section for some code snippets and templates showing frequent use cases of these operations.

Back to top

References

Perkel, Jeffrey M. 2022. “Six Tips for Better Spreadsheets.” Nature 608 (7921): 229–30. https://doi.org/10.1038/d41586-022-02076-1.