Skip to content
Andrea Antonacci edited this page Sep 2, 2021 · 4 revisions

The main principles we follow in storing data are summarized in chapter 5 of Code and Data for the Social Sciences. The key points are:

  • Store all data in tables with unique, non-missing keys
  • Keep data normalized as far into the code pipeline as possible
  • Eliminate redundancy -- a given set of data cleaning / building steps should only be executed once

Raw Data

Raw data files must be stored in a raw data directory that follows specific rules. This is normally in the /raw/ directory of a Github repository or in an analogous directory on Dropbox.

Every raw directory must have a detailed readme.md file which includes the source of the data, when and how it was obtained, and other any information necessary to understand the provenance and meaning of the data.

Codebooks, data use agreements, and other documentation should be placed in a /docs/ subdirectory.

We aim to store enough documentation in readme.md and /docs/ that if we lost access to the original data source we would have everything we need to understand the data, reference it in a paper, and make sure we are adhering to the terms of our agreements.

Raw directories can contain code to perform preprocessing steps necessary to produce files ready to be used downstream (e.g., file conversions, appending files together, etc.). In this case the data in its original form should be stored in an /orig/ subdirectory and the preprocessed data should be stored in an /output/ or /data/ subdirectory.

Storage

We store small to medium sized data files that relate to a single project in our repositories either directly or using Git LFS.

  • Diffable files (.txt, .csv, .R, .do, etc.) that are under 5 MB can be stored directly in Git
  • Non-diffable files (binaries, such as: .pdf, .dta, .rds, etc.) and diffable files over 5 MB should be stored using Git LFS

The total size of the files within the regular GitHub storage should not exceed 1 GB per repository (including version history). The total size of a repository, including all files in LFS and their version history, should not exceed 5 GB. If a repository uses input or produces output that is larger than this, the large input or output files should be stored separately.

It is important to remember that because GitHub stores every commit, once a file is committed, its impact on repository size is permanent. Think very carefully before committing large files, especially large binaries--this is one of the few mistakes that cannot easily be undone in GitHub.

We store large data files and data that need to be shared across multiple projects on Research Drive when possible, or occasionally in other large-scale storage locations.

When tasks are completed, all related files should either completed (all files are clean, documented, and stored in shared locations ready for others to use them) or abandoned (all files deleted), but never left indefinitely in a half-finished state. At any given time, we should be able to wipe clean the storage on our local machines, scratch spaces, etc. with little or no substantive loss.

Download data programmatically

Use rclone to selectively access, download, upload, move and delete data on Research Drive. This will save you disk space - you won't need to download the entire project directory on your local disk.

Learn other ways to download data programmatically with our guide on Tilburg Science Hub.

Documentation

Ideally, your data description includes the very elaborate questions outlined in Datasheets for datasets by Gebru, Timnit, et al. (2018). We strongly refer you to the original paper, which explains in detail the seven key ingredients of a proper dataset documentation.

We have reproduced these questions, and we recommend you to include those as a readme.txt, together with your datasets. You can find our template here.

Clone this wiki locally