Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] introduce GeneratedBy to "core" BIDS #440

Merged

Conversation

yarikoptic
Copy link
Collaborator

@yarikoptic yarikoptic commented Mar 25, 2020

#439 is a WIP ENH to introduce standardized provenance capture/expression for BIDS datasets. This PR just follows the idea of #371 (small atomic ENHs), and is based on current state (as of v1.2.2-189-g4550458) of #265 (common derivatives), which introduced PipelineDescription to describe how a BIDS derivative dataset came to its existence.

Rationale

As I had previously stated in many (face-to-face when it was still possible ;)) conversations, in my view, any BIDS dataset is a derivative dataset. Even if it contains "raw" data, it is never given by gods, but is a result of some process (let's call it pipeline for consistency) which produced it out of some other data. That is why there is 1) sourcedata/ to provide placement for such original (as "raw" in terms of processing, but "raw"er in terms of its relation to actual data acquired by equipment), and 2) code/ to provide placement for scripts used to produce or "tune" the dataset. Typically "sourcedata" is either a collection of DICOMs or a collection of data in some other formats (e.g. nifti) which is then either converted or just renamed into BIDS layout. When encountering a new BIDS dataset ATM it requires forensics and/or data archaeology to discover how this BIDS dataset came about, to e.g. possibly figure out the source of the buggy (meta)data it contains.

At the level of individual files, some tools already add ad-hoc fields during conversion into side car .json files they produce,

e.g. dcm2niix adds ConversionSoftware and ConversionSoftwareVersion
(git-annex)lena:~/datalad/dbic/QA[master]git
$> git grep ConversionSoftware | head -n 2
sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json:  "ConversionSoftware": "dcm2niix",
sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json:  "ConversionSoftwareVersion": "v1.0.20170923 (OpenJPEG build) GCC6.3.0",

ATM I need to add such metadata to datasets produced by heudiconv to make sure that in case of incremental conversions there is no switch in versions of the software.

TODOs:

  • consider/discuss replacing PipelineDescription with Pipeline (or ProducedBy)
  • consider/discuss making it not a single dictionary but a list (order is irrelevant) to account for possibly multiple tools (or multiple versions of the same tool) being used to produce dataset
  • decide on how to point to included scripts (under code/) where applicable (I think it is OOS for this PR)
  • decide if/how to annotate "manual conversion", where files literally were manually renamed
  • contribute to bids-validator to recommend adding this field to dataset_description.json (will file separate issue with validator whenever merged)
  • adjust common derivatives PR to account for this addition (it is part of the spec already)

@effigies
Copy link
Collaborator

effigies commented Mar 27, 2020

I think this is a good idea.

One issue I can foresee is that we may need to easily identify raw vs derived datasets, where the rules differ (I believe derivatives describes itself as a "subclass", with some rules overriding raw rules). I had been thinking of the "PipelineDescription" field as an easy discriminating feature. I think your argument is good, and my vague plans don't reach the level or a reasonable objection. But we should think about how to manage that.

  • consider/discuss replacing PipelineDescription with Pipeline (or ProducedBy)

I think "PipelineDescription" might be too derivatives oriented. We would probably want something that can reasonably describe manual curation, a single tool, or a series of tools. "ProducedBy", "GeneratedBy" seem reasonable.

  • consider/discuss making it not a single dictionary but a list (order is irrelevant) to account for possibly multiple tools being used to produce dataset

I would be okay with this.

  • decide on how to point to included scripts (under code/) where applicable

I think we currently use stim_file entries in events.tsv files as paths relative to <bids-root>/stimuli. Perhaps relative paths are to be interpreted relative to <bids-root>/code?

  • decide if/how to annotate "manual conversion", where files literally were manually renamed

Even if a tool generates my dataset, there's nothing stopping me from making changes after the fact without adding an entry to PipelineDescription.

If we list multiple tools, it's probably not a bad idea to define an explicit manual curation entry to enable a curator to distinguish an automatically produced dataset from one that is additionally curated, but I also think it would be unwise for a downstream tool to interpret its absence as meaningful.

@sappelhoff
Copy link
Member

Generally, I am in favor of this addition. Thanks for the proposal @yarikoptic

consider/discuss replacing PipelineDescription with Pipeline (or ProducedBy)

I like ProducedBy better.

consider/discuss making it not a single dictionary but a list (order is irrelevant) to account for possibly multiple tools (or multiple versions of the same tool) being used to produce dataset

Yes, accounting for multiple tools INCLUDING manual (really manual, not even a script) curation would be better. I think many dataset curation processes require more than a single tool.

We should also add an explicit request that IF truly manual fiddling with the data has been done for conversion, this should be described in a specific README file, possibly placed in /code. So to say, a "poor man's script".

decide on how to point to included scripts (under code/) where applicable

yes, I find this important and it would enhance the present implementation of the /code directory

decide if/how to annotate "manual conversion", where files literally were manually renamed

see what I wrote above

decide if/how to annotate "manual conversion", where files literally were manually renamed

yes, and preferably also editing an example dataset under bids-standard/bids-examples for ilustration and testing purposes.

@yarikoptic
Copy link
Collaborator Author

Thank you @effigies and @sappelhoff for the feedback! Very much appreciated. It seems that we are very much "in line" ;)
I (or someone who beats me to it, unlikely we would overlap in time) will adjust this PR accordingly in upcoming days.

@satra
Copy link
Collaborator

satra commented Apr 1, 2020

ATM I need to add such metadata to datasets produced by heudiconv to make sure that in case of incremental conversions there is no switch in versions of the software.

i would expect for long running studies the details may vary over time, unless the group decides to stick with a particular containerized release. i think there should be a distinction between best practices and what actually happened.

i would recommend aligning with #439 as much as possible. otherwise the keys you introduce are going to overlap with that. already "generatedBy" and perhaps given the goals are similar to focus on a common discussion there.

further different parts of a dataset may be generated by different pipelines. for example, for "raw" datasets:

  • MRI conversion (e.g., dcm2niix followed by gradient nonlinearity correction, or e.g., some pieces of bep001 are generated by postprocessing software)
  • stimulus information conversion (e.g. uses both eprime and psychopy)
  • assessment information (e.g., from redcap, nih toolbox, pavlovia, etc.,.)

i agree that at a basic level it would be useful to add "wasGeneratedBy" (or some such key) to each json file, but the value of this key could potentially be a list of the number of things directly editing/transforming the file (for example the gradient nonlinearity corrected nifti files).

@yarikoptic
Copy link
Collaborator Author

yarikoptic commented Apr 1, 2020

Thank you @satra !

i would expect for long running studies the details may vary over time, unless the group decides to stick with a particular containerized release. i think there should be a distinction between best practices and what actually happened.

I think it should be STRONGLY ( ;-) ) RECOMMENDED to use particular release, and containers as the best way ATM to make that happen. Tools could provide easy ways to "reconvert" happen change is desired interim.

i would recommend aligning with #439 as much as possible

yes -- aligning with #439 should be in mind, so I will need your reviews/fixes ;)

further different parts of a dataset may be generated by different pipelines...

that is what eventually we might arrive at here, that dataset_description.json would contain a "summary" over detailed provenance descriptions #439 arrives with. Meanwhile, I will aim at somewhat "high level" description of the dataset producer... may be later we should indeed allow similar record in any side car .json file to describe particulars of the associated data (.nii.gz, .tsv, etc) file.

@yarikoptic yarikoptic mentioned this pull request May 27, 2020
5 tasks
@francopestilli
Copy link
Collaborator

@yarikoptic this is a good proposal. I am wondering how PipelineDescription with Pipeline (or ProducedBy) will handle products that are generated by multiple Pipelines. Another way to say this, is this mean to only track 1-back pipeline but not further beyond the latest processing step?

@yarikoptic
Copy link
Collaborator Author

I would leave proper provenance/graph for PROV BEP, and here just keep a list as a set (so no particular order) of tools which had produced anything in this dataset.

@effigies
Copy link
Collaborator

effigies commented Jun 5, 2020

Notes from BEP 003:

  • PipelineDescription => GeneratedBy
  • Value is always a list.

See https://bids-specification.readthedocs.io/en/common-derivatives/03-modality-agnostic-files.html#derived-dataset-and-pipeline-description.

@effigies effigies changed the title [WIP ENH] introduce PipelineDescription to "core" BIDS [WIP ENH] introduce GeneratedBy to "core" BIDS Nov 19, 2020
@yarikoptic
Copy link
Collaborator Author

After long time "doing nothing", I have updated this PR reflecting current state of the specification.
The only peculiar thing is the dichotomy in RECOMMENDED (for the "raw" BIDS) and REQUIRED (for the "derived").
Since "manual" is a viable name for the activity, I would even be ok to make it REQUIRED for raw datasets, but that would mean "breakage" thus I guess should not be done. WDYT in how this aspect could/should be mitigated?

@yarikoptic yarikoptic force-pushed the enh-pipeline-description branch 2 times, most recently from 00de6ff to 0d095e3 Compare October 11, 2021 19:38
bids-standard#487 (and originally bids-standard#439) is a `WIP ENH` to introduce standardized provenance
capture/expression for BIDS datasets.  This PR just follows the idea of bids-standard#371
(small atomic ENHs), and is based on current state of the specification where
we have GeneratedBy to describe how a BIDS derivative dataset came to its
existence.

## Rationale

As I had  previously stated in many (face-to-face when it was still
possible ;)) conversations, in my view, any BIDS dataset is a derivative
dataset.  Even if it contains "raw" data, it is never given by gods, but is a
result of some process (let's call it pipeline for consistency) which produced
it out of some other data. That is why there is 1) `sourcedata/` to provide
placement for such original (as "raw" in terms of processing, but "raw"er in
terms of its relation to actual data acquired by equipment), and 2) `code/` to
provide placement for scripts used to produce or "tune" the dataset.  Typically
"sourcedata" is either a collection of DICOMs or a collection of data in some
other formats (e.g. nifti) which is then either converted or just renamed into
BIDS layout. When encountering a new BIDS dataset ATM it requires forensics
and/or data archaeology to discover how this BIDS dataset came about, to e.g.
possibly figure out the source of the buggy (meta)data it contains.

At the level of individual files, some tools already add ad-hoc fields
during conversion into side car .json files they produce,

<details>
<summary>e.g. dcm2niix adds ConversionSoftware and ConversionSoftwareVersion</summary>

```shell
(git-annex)lena:~/datalad/dbic/QA[master]git
$> git grep ConversionSoftware | head -n 2
sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json:  "ConversionSoftware": "dcm2niix",
sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json:  "ConversionSoftwareVersion": "v1.0.20170923 (OpenJPEG build) GCC6.3.0",

```
</details>

ATM I need to add such metadata to datasets produced by heudiconv to make
sure that in case of incremental conversions there is no switch in versions of
the software.
yarikoptic added a commit to dbic/heudiconv that referenced this pull request Oct 11, 2021
Unfortunately there is no convention yet in BIDS on storing such information in
a standardized way.

bids-standard/bids-specification#440
 proposes to add GeneratedBy (within dataset_description.json)
 which could provide detailed high level information which should then be
 consistent through out dataset (so we would need to add safeguards)

bids-standard/bids-specification#487
 is WiP to introduce PROV into BIDS standard, which would allow to establish
 _prov.json with all needed gory details.

For now, since fields in side car .json files are not strictly regulated,
I think it would be benefitial to user to have heudiconv version stored there
along with other "Version" fields, such as

	$> grep -e Version -e dcm2ni fmap/sub-phantom1sid1_ses-localizer_acq-3mm_phasediff.json
	  "ConversionSoftware": "dcm2niix",
	  "ConversionSoftwareVersion": "v1.0.20211006",
	  "SoftwareVersions": "syngo MR E11",

and although strictly speaking Heudiconv is a "conversion software", since
dcm2niix decided to use that pair, I have decided to leave it alone and just
come up with yet another descriptive key

  "HeudiconvVersion": "0.10.0",
@yarikoptic
Copy link
Collaborator Author

ping @effigies for guidance with this PR -- need feedback.

@effigies
Copy link
Collaborator

@yarikoptic Overall this LGTM. I'll try to submit a detailed review in the near future, but I think we can do this. I suspect with general review, the only validatable things (as opposed to wording adjustments) that might change would be requirement levels, so it seems safe to go ahead and implement this for tools like heudiconv (if you haven't already) as it will simply be ignored by the validator in the meantime.

cc @bids-standard/maintainers for possible objections...

tsalo
tsalo previously requested changes Oct 27, 2021
Copy link
Member

@tsalo tsalo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no objections, just a couple of requests for wording changes.

src/schema/objects/metadata.yaml Outdated Show resolved Hide resolved
src/03-modality-agnostic-files.md Outdated Show resolved Hide resolved
@Remi-Gau
Copy link
Collaborator

No objections from me. Agree with @tsalo requested changes.

Copy link
Member

@sappelhoff sappelhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no objections from my side either, I agree that this could be useful.

And the validator schema has these fields already:

https://github.com/bids-standard/bids-validator/blob/8befd13c5d18efca79e4f5913f009b5811a7608b/bids-validator/validators/json/schemas/dataset_description.json#L55-L85

(required only for derivatives)

@yarikoptic
Copy link
Collaborator Author

And the validator schema has these fields already:

https://github.com/bids-standard/bids-validator/blob/8befd13c5d18efca79e4f5913f009b5811a7608b/bids-validator/validators/json/schemas/dataset_description.json#L55-L85

(required only for derivatives)

my worry is about making it RECOMMENDED for non-derivatives and REQUIRED for non-derivatives:

  • validator must validate formatting for any dataset but enforce REQUIRED only whenever it somehow (how?) determines that it is a derivative dataset
  • such difference would probably be not (easily) representative in schema

"Easy" way out is to make it REQUIRED for all but then it would introduce backward incompatible change to the specification so can't be done for 1.x series. Making it always RECOMMENDED is also suboptimal.

@effigies
Copy link
Collaborator

my worry is about making it RECOMMENDED for non-derivatives and REQUIRED for non-derivatives:

  • validator must validate formatting for any dataset but enforce REQUIRED only whenever it somehow (how?) determines that it is a derivative dataset

The DatasetType field can be derivative.

  • such difference would probably be not (easily) representative in schema

There are already rules that certain fields are required under certain cases (e.g., if PET data is present, all MRI data must define NonlinearGradientCorrection). We need to solve this problem one way or another.

"Easy" way out is to make it REQUIRED for all but then it would introduce backward incompatible change to the specification so can't be done for 1.x series. Making it always RECOMMENDED is also suboptimal.

Agree. We're kind of stuck here.

Co-authored-by: Stefan Appelhoff <[email protected]>
Co-authored-by: Taylor Salo <[email protected]>
@yarikoptic
Copy link
Collaborator Author

ok, am I re-reading it right that there are no changes to do for this PR?

Re DatasetType, we have

DatasetType:
  name: DatasetType
  description: |
    The interpretation of the dataset.
    MUST be one of `"raw"` or `"derivative"`.
    For backwards compatibility, the default value is `"raw"`.

I wonder -- why is it "for backwards compatibility" only? if it was not, then we could assume that it is mandatory for "derivative" datasets to be specified. With such wording it kinda suggests -- I should not even bother specifying it for derivatives? or my reading is incorrect?

@yarikoptic
Copy link
Collaborator Author

@effigies - ping on the above

Comment on lines +31 to +32
"GeneratedBy": "RECOMMENDED",
"SourceDatasets": "RECOMMENDED",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RECOMMENDED will generally come with a validator warning on absence. That's fine with me, but want to make sure that's your intent. Also, this table is sorted by requirement level, so if these are RECOMMENDED they should come before the OPTIONAL fields.

@effigies
Copy link
Collaborator

ok, am I re-reading it right that there are no changes to do for this PR?

Re DatasetType, we have

DatasetType:
  name: DatasetType
  description: |
    The interpretation of the dataset.
    MUST be one of `"raw"` or `"derivative"`.
    For backwards compatibility, the default value is `"raw"`.

I wonder -- why is it "for backwards compatibility" only? if it was not, then we could assume that it is mandatory for "derivative" datasets to be specified. With such wording it kinda suggests -- I should not even bother specifying it for derivatives? or my reading is incorrect?

A derivative dataset that does not declare it will be interpreted as raw. In effect it's optional for raw and mandatory for derivative, but we can make it recommended so people know to add it and be explicit.

Copy link
Collaborator

@Remi-Gau Remi-Gau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Remi-Gau
Copy link
Collaborator

@tsalo I think your change requests have been addressed.

@effigies effigies dismissed tsalo’s stale review January 18, 2022 18:05

Comments addressed. Please re-review.

@codecov
Copy link

codecov bot commented Jan 24, 2022

Codecov Report

❗ No coverage uploaded for pull request base (master@f888291). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master     #440   +/-   ##
=========================================
  Coverage          ?   36.16%           
=========================================
  Files             ?        8           
  Lines             ?      788           
  Branches          ?        0           
=========================================
  Hits              ?      285           
  Misses            ?      503           
  Partials          ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f888291...dfe4da5. Read the comment docs.

@effigies effigies changed the title [WIP ENH] introduce GeneratedBy to "core" BIDS [ENH] introduce GeneratedBy to "core" BIDS Feb 1, 2022
@yarikoptic
Copy link
Collaborator Author

I think it is blessed for a merge? ;)

Copy link
Member

@sappelhoff sappelhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me to get this into 1.7. Validator support seems to be in place as well.

@sappelhoff sappelhoff merged commit 04268fb into bids-standard:master Feb 4, 2022
@sappelhoff
Copy link
Member

Thanks @yarikoptic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants