[ENH] introduce GeneratedBy to "core" BIDS #440

yarikoptic · 2020-03-25T18:46:44Z

#439 is a WIP ENH to introduce standardized provenance capture/expression for BIDS datasets. This PR just follows the idea of #371 (small atomic ENHs), and is based on current state (as of v1.2.2-189-g4550458) of #265 (common derivatives), which introduced PipelineDescription to describe how a BIDS derivative dataset came to its existence.

Rationale

As I had previously stated in many (face-to-face when it was still possible ;)) conversations, in my view, any BIDS dataset is a derivative dataset. Even if it contains "raw" data, it is never given by gods, but is a result of some process (let's call it pipeline for consistency) which produced it out of some other data. That is why there is 1) sourcedata/ to provide placement for such original (as "raw" in terms of processing, but "raw"er in terms of its relation to actual data acquired by equipment), and 2) code/ to provide placement for scripts used to produce or "tune" the dataset. Typically "sourcedata" is either a collection of DICOMs or a collection of data in some other formats (e.g. nifti) which is then either converted or just renamed into BIDS layout. When encountering a new BIDS dataset ATM it requires forensics and/or data archaeology to discover how this BIDS dataset came about, to e.g. possibly figure out the source of the buggy (meta)data it contains.

At the level of individual files, some tools already add ad-hoc fields during conversion into side car .json files they produce,

e.g. dcm2niix adds ConversionSoftware and ConversionSoftwareVersion

(git-annex)lena:~/datalad/dbic/QA[master]git
$> git grep ConversionSoftware | head -n 2
sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json:  "ConversionSoftware": "dcm2niix",
sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json:  "ConversionSoftwareVersion": "v1.0.20170923 (OpenJPEG build) GCC6.3.0",

ATM I need to add such metadata to datasets produced by heudiconv to make sure that in case of incremental conversions there is no switch in versions of the software.

TODOs:

consider/discuss replacing PipelineDescription with Pipeline (or ProducedBy)
consider/discuss making it not a single dictionary but a list (order is irrelevant) to account for possibly multiple tools (or multiple versions of the same tool) being used to produce dataset
decide on how to point to included scripts (under code/) where applicable (I think it is OOS for this PR)
decide if/how to annotate "manual conversion", where files literally were manually renamed
contribute to bids-validator to recommend adding this field to dataset_description.json (will file separate issue with validator whenever merged)
~~adjust common derivatives PR to account for this addition~~ (it is part of the spec already)

effigies · 2020-03-27T14:36:38Z

I think this is a good idea.

One issue I can foresee is that we may need to easily identify raw vs derived datasets, where the rules differ (I believe derivatives describes itself as a "subclass", with some rules overriding raw rules). I had been thinking of the "PipelineDescription" field as an easy discriminating feature. I think your argument is good, and my vague plans don't reach the level or a reasonable objection. But we should think about how to manage that.

consider/discuss replacing PipelineDescription with Pipeline (or ProducedBy)

I think "PipelineDescription" might be too derivatives oriented. We would probably want something that can reasonably describe manual curation, a single tool, or a series of tools. "ProducedBy", "GeneratedBy" seem reasonable.

consider/discuss making it not a single dictionary but a list (order is irrelevant) to account for possibly multiple tools being used to produce dataset

I would be okay with this.

decide on how to point to included scripts (under code/) where applicable

I think we currently use stim_file entries in events.tsv files as paths relative to <bids-root>/stimuli. Perhaps relative paths are to be interpreted relative to <bids-root>/code?

decide if/how to annotate "manual conversion", where files literally were manually renamed

Even if a tool generates my dataset, there's nothing stopping me from making changes after the fact without adding an entry to PipelineDescription.

If we list multiple tools, it's probably not a bad idea to define an explicit manual curation entry to enable a curator to distinguish an automatically produced dataset from one that is additionally curated, but I also think it would be unwise for a downstream tool to interpret its absence as meaningful.

sappelhoff · 2020-03-31T20:13:27Z

Generally, I am in favor of this addition. Thanks for the proposal @yarikoptic

consider/discuss replacing PipelineDescription with Pipeline (or ProducedBy)

I like ProducedBy better.

consider/discuss making it not a single dictionary but a list (order is irrelevant) to account for possibly multiple tools (or multiple versions of the same tool) being used to produce dataset

Yes, accounting for multiple tools INCLUDING manual (really manual, not even a script) curation would be better. I think many dataset curation processes require more than a single tool.

We should also add an explicit request that IF truly manual fiddling with the data has been done for conversion, this should be described in a specific README file, possibly placed in /code. So to say, a "poor man's script".

decide on how to point to included scripts (under code/) where applicable

yes, I find this important and it would enhance the present implementation of the /code directory

decide if/how to annotate "manual conversion", where files literally were manually renamed

see what I wrote above

decide if/how to annotate "manual conversion", where files literally were manually renamed

yes, and preferably also editing an example dataset under bids-standard/bids-examples for ilustration and testing purposes.

yarikoptic · 2020-04-01T01:44:37Z

Thank you @effigies and @sappelhoff for the feedback! Very much appreciated. It seems that we are very much "in line" ;)
I (or someone who beats me to it, unlikely we would overlap in time) will adjust this PR accordingly in upcoming days.

satra · 2020-04-01T12:45:45Z

ATM I need to add such metadata to datasets produced by heudiconv to make sure that in case of incremental conversions there is no switch in versions of the software.

i would expect for long running studies the details may vary over time, unless the group decides to stick with a particular containerized release. i think there should be a distinction between best practices and what actually happened.

i would recommend aligning with #439 as much as possible. otherwise the keys you introduce are going to overlap with that. already "generatedBy" and perhaps given the goals are similar to focus on a common discussion there.

further different parts of a dataset may be generated by different pipelines. for example, for "raw" datasets:

MRI conversion (e.g., dcm2niix followed by gradient nonlinearity correction, or e.g., some pieces of bep001 are generated by postprocessing software)
stimulus information conversion (e.g. uses both eprime and psychopy)
assessment information (e.g., from redcap, nih toolbox, pavlovia, etc.,.)

i agree that at a basic level it would be useful to add "wasGeneratedBy" (or some such key) to each json file, but the value of this key could potentially be a list of the number of things directly editing/transforming the file (for example the gradient nonlinearity corrected nifti files).

yarikoptic · 2020-04-01T14:05:16Z

Thank you @satra !

i would expect for long running studies the details may vary over time, unless the group decides to stick with a particular containerized release. i think there should be a distinction between best practices and what actually happened.

I think it should be STRONGLY ( ;-) ) RECOMMENDED to use particular release, and containers as the best way ATM to make that happen. Tools could provide easy ways to "reconvert" happen change is desired interim.

i would recommend aligning with #439 as much as possible

yes -- aligning with #439 should be in mind, so I will need your reviews/fixes ;)

further different parts of a dataset may be generated by different pipelines...

that is what eventually we might arrive at here, that dataset_description.json would contain a "summary" over detailed provenance descriptions #439 arrives with. Meanwhile, I will aim at somewhat "high level" description of the dataset producer... may be later we should indeed allow similar record in any side car .json file to describe particulars of the associated data (.nii.gz, .tsv, etc) file.

francopestilli · 2020-06-03T21:36:20Z

@yarikoptic this is a good proposal. I am wondering how PipelineDescription with Pipeline (or ProducedBy) will handle products that are generated by multiple Pipelines. Another way to say this, is this mean to only track 1-back pipeline but not further beyond the latest processing step?

yarikoptic · 2020-06-04T00:27:56Z

I would leave proper provenance/graph for PROV BEP, and here just keep a list as a set (so no particular order) of tools which had produced anything in this dataset.

effigies · 2020-06-05T13:34:23Z

Notes from BEP 003:

PipelineDescription => GeneratedBy
Value is always a list.

See https://bids-specification.readthedocs.io/en/common-derivatives/03-modality-agnostic-files.html#derived-dataset-and-pipeline-description.

yarikoptic · 2021-10-11T18:12:05Z

After long time "doing nothing", I have updated this PR reflecting current state of the specification.
The only peculiar thing is the dichotomy in RECOMMENDED (for the "raw" BIDS) and REQUIRED (for the "derived").
Since "manual" is a viable name for the activity, I would even be ok to make it REQUIRED for raw datasets, but that would mean "breakage" thus I guess should not be done. WDYT in how this aspect could/should be mitigated?

bids-standard#487 (and originally bids-standard#439) is a `WIP ENH` to introduce standardized provenance capture/expression for BIDS datasets. This PR just follows the idea of bids-standard#371 (small atomic ENHs), and is based on current state of the specification where we have GeneratedBy to describe how a BIDS derivative dataset came to its existence. ## Rationale As I had previously stated in many (face-to-face when it was still possible ;)) conversations, in my view, any BIDS dataset is a derivative dataset. Even if it contains "raw" data, it is never given by gods, but is a result of some process (let's call it pipeline for consistency) which produced it out of some other data. That is why there is 1) `sourcedata/` to provide placement for such original (as "raw" in terms of processing, but "raw"er in terms of its relation to actual data acquired by equipment), and 2) `code/` to provide placement for scripts used to produce or "tune" the dataset. Typically "sourcedata" is either a collection of DICOMs or a collection of data in some other formats (e.g. nifti) which is then either converted or just renamed into BIDS layout. When encountering a new BIDS dataset ATM it requires forensics and/or data archaeology to discover how this BIDS dataset came about, to e.g. possibly figure out the source of the buggy (meta)data it contains. At the level of individual files, some tools already add ad-hoc fields during conversion into side car .json files they produce, <details> <summary>e.g. dcm2niix adds ConversionSoftware and ConversionSoftwareVersion</summary> ```shell (git-annex)lena:~/datalad/dbic/QA[master]git $> git grep ConversionSoftware | head -n 2 sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json: "ConversionSoftware": "dcm2niix", sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json: "ConversionSoftwareVersion": "v1.0.20170923 (OpenJPEG build) GCC6.3.0", ``` </details> ATM I need to add such metadata to datasets produced by heudiconv to make sure that in case of incremental conversions there is no switch in versions of the software.

Unfortunately there is no convention yet in BIDS on storing such information in a standardized way. bids-standard/bids-specification#440 proposes to add GeneratedBy (within dataset_description.json) which could provide detailed high level information which should then be consistent through out dataset (so we would need to add safeguards) bids-standard/bids-specification#487 is WiP to introduce PROV into BIDS standard, which would allow to establish _prov.json with all needed gory details. For now, since fields in side car .json files are not strictly regulated, I think it would be benefitial to user to have heudiconv version stored there along with other "Version" fields, such as $> grep -e Version -e dcm2ni fmap/sub-phantom1sid1_ses-localizer_acq-3mm_phasediff.json "ConversionSoftware": "dcm2niix", "ConversionSoftwareVersion": "v1.0.20211006", "SoftwareVersions": "syngo MR E11", and although strictly speaking Heudiconv is a "conversion software", since dcm2niix decided to use that pair, I have decided to leave it alone and just come up with yet another descriptive key "HeudiconvVersion": "0.10.0",

yarikoptic · 2021-10-26T17:17:29Z

ping @effigies for guidance with this PR -- need feedback.

effigies · 2021-10-27T14:23:18Z

@yarikoptic Overall this LGTM. I'll try to submit a detailed review in the near future, but I think we can do this. I suspect with general review, the only validatable things (as opposed to wording adjustments) that might change would be requirement levels, so it seems safe to go ahead and implement this for tools like heudiconv (if you haven't already) as it will simply be ignored by the validator in the meantime.

cc @bids-standard/maintainers for possible objections...

tsalo

I have no objections, just a couple of requests for wording changes.

src/schema/objects/metadata.yaml

src/03-modality-agnostic-files.md

Remi-Gau · 2021-10-27T14:36:51Z

No objections from me. Agree with @tsalo requested changes.

src/03-modality-agnostic-files.md

sappelhoff

no objections from my side either, I agree that this could be useful.

And the validator schema has these fields already:

https://github.com/bids-standard/bids-validator/blob/8befd13c5d18efca79e4f5913f009b5811a7608b/bids-validator/validators/json/schemas/dataset_description.json#L55-L85

(required only for derivatives)

yarikoptic · 2021-10-28T13:44:07Z

And the validator schema has these fields already:

https://github.com/bids-standard/bids-validator/blob/8befd13c5d18efca79e4f5913f009b5811a7608b/bids-validator/validators/json/schemas/dataset_description.json#L55-L85

(required only for derivatives)

my worry is about making it RECOMMENDED for non-derivatives and REQUIRED for non-derivatives:

validator must validate formatting for any dataset but enforce REQUIRED only whenever it somehow (how?) determines that it is a derivative dataset
such difference would probably be not (easily) representative in schema

"Easy" way out is to make it REQUIRED for all but then it would introduce backward incompatible change to the specification so can't be done for 1.x series. Making it always RECOMMENDED is also suboptimal.

effigies · 2021-10-28T15:06:22Z

my worry is about making it RECOMMENDED for non-derivatives and REQUIRED for non-derivatives:

validator must validate formatting for any dataset but enforce REQUIRED only whenever it somehow (how?) determines that it is a derivative dataset

The DatasetType field can be derivative.

such difference would probably be not (easily) representative in schema

There are already rules that certain fields are required under certain cases (e.g., if PET data is present, all MRI data must define NonlinearGradientCorrection). We need to solve this problem one way or another.

"Easy" way out is to make it REQUIRED for all but then it would introduce backward incompatible change to the specification so can't be done for 1.x series. Making it always RECOMMENDED is also suboptimal.

Agree. We're kind of stuck here.

Co-authored-by: Stefan Appelhoff <[email protected]> Co-authored-by: Taylor Salo <[email protected]>

yarikoptic · 2021-12-16T19:06:13Z

ok, am I re-reading it right that there are no changes to do for this PR?

Re DatasetType, we have

DatasetType:
  name: DatasetType
  description: |
    The interpretation of the dataset.
    MUST be one of `"raw"` or `"derivative"`.
    For backwards compatibility, the default value is `"raw"`.

I wonder -- why is it "for backwards compatibility" only? if it was not, then we could assume that it is mandatory for "derivative" datasets to be specified. With such wording it kinda suggests -- I should not even bother specifying it for derivatives? or my reading is incorrect?

yarikoptic · 2022-01-09T23:12:18Z

@effigies - ping on the above

effigies · 2022-01-10T13:32:05Z

src/03-modality-agnostic-files.md

+      "GeneratedBy": "RECOMMENDED",
+      "SourceDatasets": "RECOMMENDED",


RECOMMENDED will generally come with a validator warning on absence. That's fine with me, but want to make sure that's your intent. Also, this table is sorted by requirement level, so if these are RECOMMENDED they should come before the OPTIONAL fields.

effigies · 2022-01-10T13:34:54Z

ok, am I re-reading it right that there are no changes to do for this PR?

Re DatasetType, we have
DatasetType:
  name: DatasetType
  description: |
    The interpretation of the dataset.
    MUST be one of `"raw"` or `"derivative"`.
    For backwards compatibility, the default value is `"raw"`.
I wonder -- why is it "for backwards compatibility" only? if it was not, then we could assume that it is mandatory for "derivative" datasets to be specified. With such wording it kinda suggests -- I should not even bother specifying it for derivatives? or my reading is incorrect?

A derivative dataset that does not declare it will be interpreted as raw. In effect it's optional for raw and mandatory for derivative, but we can make it recommended so people know to add it and be explicit.

Remi-Gau

LGTM

Remi-Gau · 2022-01-18T17:28:02Z

@tsalo I think your change requests have been addressed.

Comments addressed. Please re-review.

codecov · 2022-01-24T20:02:55Z

Codecov Report

❗ No coverage uploaded for pull request base (master@f888291). Click here to learn what that means.
The diff coverage is n/a.

@@            Coverage Diff            @@
##             master     #440   +/-   ##
=========================================
  Coverage          ?   36.16%           
=========================================
  Files             ?        8           
  Lines             ?      788           
  Branches          ?        0           
=========================================
  Hits              ?      285           
  Misses            ?      503           
  Partials          ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f888291...dfe4da5. Read the comment docs.

yarikoptic · 2022-02-03T20:31:19Z

I think it is blessed for a merge? ;)

sappelhoff

Fine with me to get this into 1.7. Validator support seems to be in place as well.

sappelhoff · 2022-02-04T08:45:36Z

Thanks @yarikoptic

remiadon mentioned this pull request Mar 30, 2020

[WIP] [ENH] Provenance BEP028 #439

Closed

yarikoptic mentioned this pull request May 27, 2020

[ENH] BEP 003: Common Derivatives #265

Merged

5 tasks

effigies changed the title ~~[WIP ENH] introduce PipelineDescription to "core" BIDS~~ [WIP ENH] introduce GeneratedBy to "core" BIDS Nov 19, 2020

BF: typo + added trailing period

94112b6

yarikoptic force-pushed the enh-pipeline-description branch from 81693b4 to 75f90e6 Compare October 11, 2021 18:09

yarikoptic force-pushed the enh-pipeline-description branch 2 times, most recently from 00de6ff to 0d095e3 Compare October 11, 2021 19:38

yarikoptic force-pushed the enh-pipeline-description branch from 0d095e3 to 2a638c6 Compare October 11, 2021 20:29

yarikoptic mentioned this pull request Oct 11, 2021

ENH: add HeudiconvVersion to sidecar .json files nipy/heudiconv#529

Merged

tsalo mentioned this pull request Oct 20, 2021

Adding sourcedata filename to a column in the scans.tsv file #905

Closed

tsalo previously requested changes Oct 27, 2021

View reviewed changes

src/schema/objects/metadata.yaml Outdated Show resolved Hide resolved

src/03-modality-agnostic-files.md Outdated Show resolved Hide resolved

sappelhoff reviewed Oct 27, 2021

View reviewed changes

src/03-modality-agnostic-files.md Outdated Show resolved Hide resolved

sappelhoff reviewed Oct 27, 2021

View reviewed changes

yarikoptic force-pushed the enh-pipeline-description branch from 9b118e9 to cdacd3b Compare October 28, 2021 13:40

Apply suggestions from code review by @tsalo and @sappelhoff

d9060ae

Co-authored-by: Stefan Appelhoff <[email protected]> Co-authored-by: Taylor Salo <[email protected]>

yarikoptic force-pushed the enh-pipeline-description branch from cdacd3b to d9060ae Compare October 28, 2021 15:27

effigies reviewed Jan 10, 2022

View reviewed changes

effigies approved these changes Jan 10, 2022

View reviewed changes

Remi-Gau approved these changes Jan 18, 2022

View reviewed changes

Merge branch 'master' into enh-pipeline-description

dfe4da5

effigies changed the title ~~[WIP ENH] introduce GeneratedBy to "core" BIDS~~ [ENH] introduce GeneratedBy to "core" BIDS Feb 1, 2022

sappelhoff approved these changes Feb 4, 2022

View reviewed changes

sappelhoff merged commit 04268fb into bids-standard:master Feb 4, 2022

bids-maintenance added a commit that referenced this pull request Feb 4, 2022

[DOC] Auto-generate changelog entry for PR #440

5b90d78

This was referenced Jan 11, 2024

Add/Extend dataset_description.json:GeneratedBy with information about heudiconv/heuristic (if possible)/dcm2niix nipy/heudiconv#727

Open

Add metadata with "provenance" (version) of fw-heudiconv used PennLINC/fw-heudiconv#102

Open

yarikoptic deleted the enh-pipeline-description branch April 30, 2024 23:57

This was referenced Oct 25, 2024

Allow for GeneratedBy in any sidecar .json file to collect their provenance #1970

Open

Migrate from ConversionSoftware{,Version} to use GeneratedBy rordenlab/dcm2niix#884

Closed

satra mentioned this pull request Oct 28, 2024

Address "semantic conflict" for GeneratedBy bids-standard/bids-2-devel#89

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] introduce GeneratedBy to "core" BIDS #440

[ENH] introduce GeneratedBy to "core" BIDS #440

yarikoptic commented Mar 25, 2020 •

edited

Loading

effigies commented Mar 27, 2020 •

edited by sappelhoff

Loading

sappelhoff commented Mar 31, 2020

yarikoptic commented Apr 1, 2020

satra commented Apr 1, 2020

yarikoptic commented Apr 1, 2020 •

edited

Loading

francopestilli commented Jun 3, 2020

yarikoptic commented Jun 4, 2020

effigies commented Jun 5, 2020

yarikoptic commented Oct 11, 2021

yarikoptic commented Oct 26, 2021

effigies commented Oct 27, 2021

tsalo left a comment

Remi-Gau commented Oct 27, 2021

sappelhoff left a comment

yarikoptic commented Oct 28, 2021

effigies commented Oct 28, 2021

yarikoptic commented Dec 16, 2021

yarikoptic commented Jan 9, 2022

effigies Jan 10, 2022

effigies commented Jan 10, 2022

Remi-Gau left a comment

Remi-Gau commented Jan 18, 2022

codecov bot commented Jan 24, 2022 •

edited

Loading

yarikoptic commented Feb 3, 2022

sappelhoff left a comment

sappelhoff commented Feb 4, 2022

		"GeneratedBy": "RECOMMENDED",
		"SourceDatasets": "RECOMMENDED",

[ENH] introduce GeneratedBy to "core" BIDS #440

[ENH] introduce GeneratedBy to "core" BIDS #440

Conversation

yarikoptic commented Mar 25, 2020 • edited Loading

Rationale

effigies commented Mar 27, 2020 • edited by sappelhoff Loading

sappelhoff commented Mar 31, 2020

yarikoptic commented Apr 1, 2020

satra commented Apr 1, 2020

yarikoptic commented Apr 1, 2020 • edited Loading

francopestilli commented Jun 3, 2020

yarikoptic commented Jun 4, 2020

effigies commented Jun 5, 2020

yarikoptic commented Oct 11, 2021

yarikoptic commented Oct 26, 2021

effigies commented Oct 27, 2021

tsalo left a comment

Choose a reason for hiding this comment

Remi-Gau commented Oct 27, 2021

sappelhoff left a comment

Choose a reason for hiding this comment

yarikoptic commented Oct 28, 2021

effigies commented Oct 28, 2021

yarikoptic commented Dec 16, 2021

yarikoptic commented Jan 9, 2022

effigies Jan 10, 2022

Choose a reason for hiding this comment

effigies commented Jan 10, 2022

Remi-Gau left a comment

Choose a reason for hiding this comment

Remi-Gau commented Jan 18, 2022

codecov bot commented Jan 24, 2022 • edited Loading

Codecov Report

yarikoptic commented Feb 3, 2022

sappelhoff left a comment

Choose a reason for hiding this comment

sappelhoff commented Feb 4, 2022

yarikoptic commented Mar 25, 2020 •

edited

Loading

effigies commented Mar 27, 2020 •

edited by sappelhoff

Loading

yarikoptic commented Apr 1, 2020 •

edited

Loading

codecov bot commented Jan 24, 2022 •

edited

Loading