Skip to content

code_and_data

Griffin Chure edited this page Nov 21, 2019 · 2 revisions

Adding links to data, code, and figures

Making your code, data, and figures easily accessible is central to the concept of reproducible research in the modern era. To make this less painful, I have included three data structures, data_files.yaml, code.yaml, and figures.yaml. The data encoded in these files are populated in page02_code_data.md.

Path Navigation

All of your code and "small" data should be stored on this repository or, if you are building the website in your research repository, located on the gh-pages branch. In this template, there are two specific folders for the code scripts and data files. These folders are named software, and datasets respectively.

code.yaml

This is a simple yaml data structure which contains an arbitrary number of script fields which have associated properties. An example of a script field is given below.

- script:
  name: script1.R # The file name of the script
  desc: >
    A one or two sentence description of what the code actually does.

The name field is the script file name that is stored in the software/ subfolder of the root directory.

datasets.yaml

This file is similar to code.yaml as it encodes dataset fields which have associated properties. As datasets can sometimes be very big, this structure requires a definition of whether the data is stored locally (in the repository) or hosted on a cloud storage service like DataDryad, CaltechDATA, or Zenodo. An example of a dataset field is shown below for a data set that is stored remotely.

- dataset:
  storage: "remote"
  name: "A short description of the data set."
  filetype: "The type of file (for example, csv, hdf5, or tiff)"
  filesize: 24 MB # An approximate data size 
  link: "https://data.caltech.edu/" # Link to the remote storage OR the filename of the local file.
  DOI: "10.1.1/journal.0000" # DOI of the remotely stored data. If not remote, this field is ignored. 

If the data set is stored remotely, the link field should be a URL to the precise data set. As these URLs can change over time, you should also provide a DOI. If the data set is stored locally, the DOI field will be ignored and the filename of the data set should be in the link: field.

figures.yaml

Every figure in the main text of your paper and/or supplementary information that has data that is presented should be easily reproducible. If you are following the reproducible research repository, you will be generating a single script for every figure that is produced. The figures.yaml file encodes information about each figure including the datasets and script needed to reproduce the figure. An example fig field is given below.

- fig:
  title: "A descriptive title of the figure, including the figure number."
  filename: fig1.py # The file name of the code used to generate the figure. 
  desc: "A one or two sentence description of the figure."
  pic: "A thumbnail image of the figure to be reproduced. 
  req: # Begins the set of required data sets
   - ds: 
     storage: "local"
     title: "Title of the data set"
     link: dataset1.csv
   - ds:
     .
     .
     .

This is the most complicated data structure defined in this template and includes a few nested field. The first field fig defines a figure to be presented as its own object on the website. The filename is the filename of the code used to generate the figure. The desc field describes the figure in broad strokes and should be enough to jog someone's memory about what the figure was presenting. The pic field is a thumbnail image of the figure. This is the filename of the thumbnail in the assets/img/ subfolder in the root directory.

The req field denotes a series of required data sets needed to reproduce the figure. Beneath req, we define a new required data set field with - ds:. Beneath this field, we have information about that particular data set. The field storage defines whether the link should point towards a locally stored data set or if the link should point to external storage. Finally, the link: field is the filename of or the link to the required data set.