Skip to content

Commit

Permalink
[ENH] introduce GeneratedBy to "core" BIDS
Browse files Browse the repository at this point in the history
#487 (and originally #439) is a `WIP ENH` to introduce standardized provenance
capture/expression for BIDS datasets.  This PR just follows the idea of #371
(small atomic ENHs), and is based on current state of the specification where
we have GeneratedBy to describe how a BIDS derivative dataset came to its
existence.

## Rationale

As I had  previously stated in many (face-to-face when it was still
possible ;)) conversations, in my view, any BIDS dataset is a derivative
dataset.  Even if it contains "raw" data, it is never given by gods, but is a
result of some process (let's call it pipeline for consistency) which produced
it out of some other data. That is why there is 1) `sourcedata/` to provide
placement for such original (as "raw" in terms of processing, but "raw"er in
terms of its relation to actual data acquired by equipment), and 2) `code/` to
provide placement for scripts used to produce or "tune" the dataset.  Typically
"sourcedata" is either a collection of DICOMs or a collection of data in some
other formats (e.g. nifti) which is then either converted or just renamed into
BIDS layout. When encountering a new BIDS dataset ATM it requires forensics
and/or data archaeology to discover how this BIDS dataset came about, to e.g.
possibly figure out the source of the buggy (meta)data it contains.

At the level of individual files, some tools already add ad-hoc fields
during conversion into side car .json files they produce,

<details>
<summary>e.g. dcm2niix adds ConversionSoftware and ConversionSoftwareVersion</summary>

```shell
(git-annex)lena:~/datalad/dbic/QA[master]git
$> git grep ConversionSoftware | head -n 2
sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json:  "ConversionSoftware": "dcm2niix",
sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json:  "ConversionSoftwareVersion": "v1.0.20170923 (OpenJPEG build) GCC6.3.0",

```
</details>

ATM I need to add such metadata to datasets produced by heudiconv to make
sure that in case of incremental conversions there is no switch in versions of
the software.
  • Loading branch information
yarikoptic committed Oct 11, 2021
1 parent 94112b6 commit 2a638c6
Show file tree
Hide file tree
Showing 2 changed files with 38 additions and 22 deletions.
58 changes: 37 additions & 21 deletions src/03-modality-agnostic-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,22 @@ Every dataset MUST include this file with the following fields:
"EthicsApprovals": "OPTIONAL",
"ReferencesAndLinks": "OPTIONAL",
"DatasetDOI": "OPTIONAL",
"GeneratedBy": "RECOMMENDED",
"SourceDatasets": "RECOMMENDED",
}
) }}

Each object in the `GeneratedBy` list includes the following REQUIRED, RECOMMENDED
and OPTIONAL keys:

| **Key name** | **Requirement level** | **Data type** | **Description** |
|--------------|-----------------------|---------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Name | REQUIRED | [string][] | Name of the pipeline or process that generated the outputs. Use `"Manual"` to indicate the derivatives were generated by hand, or adjusted manually after an initial run of an automated pipeline. |
| Version | RECOMMENDED | [string][] | Version of the pipeline. |
| Description | OPTIONAL | [string][] | Plain-text description of the pipeline or process that generated the outputs. RECOMMENDED if `Name` is `"Manual"`. |
| CodeURL | OPTIONAL | [string][] | URL where the code used to generate the dataset may be found. |
| Container | OPTIONAL | [object][] | Used to specify the location and relevant attributes of software container image used to produce the dataset. Valid keys in this object include `Type`, `Tag` and [`URI`][uri] with [string][] values. |

Example:

```JSON
Expand All @@ -57,7 +70,23 @@ Example:
"Alzheimer A., & Kraepelin, E. (2015). Neural correlates of presenile dementia in humans. Journal of Neuroscientific Data, 2, 234001. doi:1920.8/jndata.2015.7"
],
"DatasetDOI": "doi:10.0.2.3/dfjj.10",
"HEDVersion": "7.1.1"
"HEDVersion": "7.1.1",
"GeneratedBy": [
{
"Name": "reproin",
"Version": "0.6.0",
"Container": {
"Type": "docker",
"Tag": "repronim/reproin:0.6.0"
}
}
],
"SourceDatasets": [
{
"URL": "s3://dicoms/studies/correlates",
"Version": "April 11 2011"
}
]
}
```

Expand All @@ -67,27 +96,19 @@ As for any BIDS dataset, a `dataset_description.json` file MUST be found at the
top level of every derived dataset:
`<dataset>/derivatives/<pipeline_name>/dataset_description.json`.

In addition to the keys for raw BIDS datasets,
derived BIDS datasets include the following REQUIRED and RECOMMENDED
`dataset_description.json` keys:
In contrast to raw BIDS datasets, derived BIDS datasets MUST include
`GeneratedBy` key:

{{ MACROS___make_metadata_table(
{
"GeneratedBy": "REQUIRED",
"SourceDatasets": "RECOMMENDED",
"GeneratedBy": "REQUIRED"
}
) }}

Each object in the `GeneratedBy` list includes the following REQUIRED, RECOMMENDED
and OPTIONAL keys:

| **Key name** | **Requirement level** | **Data type** | **Description** |
|--------------|-----------------------|---------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Name | REQUIRED | [string][] | Name of the pipeline or process that generated the outputs. Use `"Manual"` to indicate the derivatives were generated by hand, or adjusted manually after an initial run of an automated pipeline. |
| Version | RECOMMENDED | [string][] | Version of the pipeline. |
| Description | OPTIONAL | [string][] | Plain-text description of the pipeline or process that generated the outputs. RECOMMENDED if `Name` is `"Manual"`. |
| CodeURL | OPTIONAL | [string][] | URL where the code used to generate the derivatives may be found. |
| Container | OPTIONAL | [object][] | Used to specify the location and relevant attributes of software container image used to produce the derivative. Valid keys in this object include `Type`, `Tag` and [`URI`][uri] with [string][] values. |
If a derived dataset is stored as a subfolder of the raw dataset, then the `Name` field
of the first `GeneratedBy` object MUST be a substring of the derived dataset folder name.
That is, in a directory `<dataset>/derivatives/<pipeline>[-<variant>]/`, the first
`GeneratedBy` object should have a `Name` of `<pipeline>`.

Example:

Expand Down Expand Up @@ -120,11 +141,6 @@ Example:
}
```

If a derived dataset is stored as a subfolder of the raw dataset, then the `Name` field
of the first `GeneratedBy` object MUST be a substring of the derived dataset folder name.
That is, in a directory `<dataset>/derivatives/<pipeline>[-<variant>]/`, the first
`GeneratedBy` object should have a `Name` of `<pipeline>`.

### `README`

Every BIDS dataset SHOULD come with a free form text file (`README`) describing the dataset in more detail.
Expand Down
2 changes: 1 addition & 1 deletion src/schema/objects/metadata.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -866,7 +866,7 @@ Funding:
GeneratedBy:
name: GeneratedBy
description: |
Used to specify provenance of the derived dataset.
Used to specify provenance of the dataset.
See table below for contents of each object.
type: array
minItems: 1
Expand Down

0 comments on commit 2a638c6

Please sign in to comment.