-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] introduce GeneratedBy to "core" BIDS #440
[ENH] introduce GeneratedBy to "core" BIDS #440
Conversation
I think this is a good idea. One issue I can foresee is that we may need to easily identify raw vs derived datasets, where the rules differ (I believe derivatives describes itself as a "subclass", with some rules overriding raw rules). I had been thinking of the "PipelineDescription" field as an easy discriminating feature. I think your argument is good, and my vague plans don't reach the level or a reasonable objection. But we should think about how to manage that.
I think "PipelineDescription" might be too derivatives oriented. We would probably want something that can reasonably describe manual curation, a single tool, or a series of tools. "ProducedBy", "GeneratedBy" seem reasonable.
I would be okay with this.
I think we currently use
Even if a tool generates my dataset, there's nothing stopping me from making changes after the fact without adding an entry to If we list multiple tools, it's probably not a bad idea to define an explicit manual curation entry to enable a curator to distinguish an automatically produced dataset from one that is additionally curated, but I also think it would be unwise for a downstream tool to interpret its absence as meaningful. |
Generally, I am in favor of this addition. Thanks for the proposal @yarikoptic
I like
Yes, accounting for multiple tools INCLUDING manual (really manual, not even a script) curation would be better. I think many dataset curation processes require more than a single tool. We should also add an explicit request that IF truly manual fiddling with the data has been done for conversion, this should be described in a specific README file, possibly placed in
yes, I find this important and it would enhance the present implementation of the
see what I wrote above
yes, and preferably also editing an example dataset under bids-standard/bids-examples for ilustration and testing purposes. |
Thank you @effigies and @sappelhoff for the feedback! Very much appreciated. It seems that we are very much "in line" ;) |
i would expect for long running studies the details may vary over time, unless the group decides to stick with a particular containerized release. i think there should be a distinction between best practices and what actually happened. i would recommend aligning with #439 as much as possible. otherwise the keys you introduce are going to overlap with that. already "generatedBy" and perhaps given the goals are similar to focus on a common discussion there. further different parts of a dataset may be generated by different pipelines. for example, for "raw" datasets:
i agree that at a basic level it would be useful to add "wasGeneratedBy" (or some such key) to each json file, but the value of this key could potentially be a list of the number of things directly editing/transforming the file (for example the gradient nonlinearity corrected nifti files). |
Thank you @satra !
I think it should be STRONGLY ( ;-) ) RECOMMENDED to use particular release, and containers as the best way ATM to make that happen. Tools could provide easy ways to "reconvert" happen change is desired interim.
yes -- aligning with #439 should be in mind, so I will need your reviews/fixes ;)
that is what eventually we might arrive at here, that dataset_description.json would contain a "summary" over detailed provenance descriptions #439 arrives with. Meanwhile, I will aim at somewhat "high level" description of the dataset producer... may be later we should indeed allow similar record in any side car .json file to describe particulars of the associated data (.nii.gz, .tsv, etc) file. |
@yarikoptic this is a good proposal. I am wondering how PipelineDescription with Pipeline (or ProducedBy) will handle products that are generated by multiple Pipelines. Another way to say this, is this mean to only track 1-back pipeline but not further beyond the latest processing step? |
I would leave proper provenance/graph for PROV BEP, and here just keep a list as a set (so no particular order) of tools which had produced anything in this dataset. |
Notes from BEP 003:
|
81693b4
to
75f90e6
Compare
After long time "doing nothing", I have updated this PR reflecting current state of the specification. |
00de6ff
to
0d095e3
Compare
bids-standard#487 (and originally bids-standard#439) is a `WIP ENH` to introduce standardized provenance capture/expression for BIDS datasets. This PR just follows the idea of bids-standard#371 (small atomic ENHs), and is based on current state of the specification where we have GeneratedBy to describe how a BIDS derivative dataset came to its existence. ## Rationale As I had previously stated in many (face-to-face when it was still possible ;)) conversations, in my view, any BIDS dataset is a derivative dataset. Even if it contains "raw" data, it is never given by gods, but is a result of some process (let's call it pipeline for consistency) which produced it out of some other data. That is why there is 1) `sourcedata/` to provide placement for such original (as "raw" in terms of processing, but "raw"er in terms of its relation to actual data acquired by equipment), and 2) `code/` to provide placement for scripts used to produce or "tune" the dataset. Typically "sourcedata" is either a collection of DICOMs or a collection of data in some other formats (e.g. nifti) which is then either converted or just renamed into BIDS layout. When encountering a new BIDS dataset ATM it requires forensics and/or data archaeology to discover how this BIDS dataset came about, to e.g. possibly figure out the source of the buggy (meta)data it contains. At the level of individual files, some tools already add ad-hoc fields during conversion into side car .json files they produce, <details> <summary>e.g. dcm2niix adds ConversionSoftware and ConversionSoftwareVersion</summary> ```shell (git-annex)lena:~/datalad/dbic/QA[master]git $> git grep ConversionSoftware | head -n 2 sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json: "ConversionSoftware": "dcm2niix", sub-amit/ses-20180508/anat/sub-amit_ses-20180508_acq-MPRAGE_T1w.json: "ConversionSoftwareVersion": "v1.0.20170923 (OpenJPEG build) GCC6.3.0", ``` </details> ATM I need to add such metadata to datasets produced by heudiconv to make sure that in case of incremental conversions there is no switch in versions of the software.
0d095e3
to
2a638c6
Compare
Unfortunately there is no convention yet in BIDS on storing such information in a standardized way. bids-standard/bids-specification#440 proposes to add GeneratedBy (within dataset_description.json) which could provide detailed high level information which should then be consistent through out dataset (so we would need to add safeguards) bids-standard/bids-specification#487 is WiP to introduce PROV into BIDS standard, which would allow to establish _prov.json with all needed gory details. For now, since fields in side car .json files are not strictly regulated, I think it would be benefitial to user to have heudiconv version stored there along with other "Version" fields, such as $> grep -e Version -e dcm2ni fmap/sub-phantom1sid1_ses-localizer_acq-3mm_phasediff.json "ConversionSoftware": "dcm2niix", "ConversionSoftwareVersion": "v1.0.20211006", "SoftwareVersions": "syngo MR E11", and although strictly speaking Heudiconv is a "conversion software", since dcm2niix decided to use that pair, I have decided to leave it alone and just come up with yet another descriptive key "HeudiconvVersion": "0.10.0",
ping @effigies for guidance with this PR -- need feedback. |
@yarikoptic Overall this LGTM. I'll try to submit a detailed review in the near future, but I think we can do this. I suspect with general review, the only validatable things (as opposed to wording adjustments) that might change would be requirement levels, so it seems safe to go ahead and implement this for tools like heudiconv (if you haven't already) as it will simply be ignored by the validator in the meantime. cc @bids-standard/maintainers for possible objections... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no objections, just a couple of requests for wording changes.
No objections from me. Agree with @tsalo requested changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no objections from my side either, I agree that this could be useful.
And the validator schema has these fields already:
(required only for derivatives)
9b118e9
to
cdacd3b
Compare
my worry is about making it RECOMMENDED for non-derivatives and REQUIRED for non-derivatives:
"Easy" way out is to make it REQUIRED for all but then it would introduce backward incompatible change to the specification so can't be done for 1.x series. Making it always RECOMMENDED is also suboptimal. |
The
There are already rules that certain fields are required under certain cases (e.g., if PET data is present, all MRI data must define
Agree. We're kind of stuck here. |
Co-authored-by: Stefan Appelhoff <[email protected]> Co-authored-by: Taylor Salo <[email protected]>
cdacd3b
to
d9060ae
Compare
ok, am I re-reading it right that there are no changes to do for this PR? Re DatasetType, we have
I wonder -- why is it "for backwards compatibility" only? if it was not, then we could assume that it is mandatory for |
@effigies - ping on the above |
"GeneratedBy": "RECOMMENDED", | ||
"SourceDatasets": "RECOMMENDED", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RECOMMENDED will generally come with a validator warning on absence. That's fine with me, but want to make sure that's your intent. Also, this table is sorted by requirement level, so if these are RECOMMENDED they should come before the OPTIONAL fields.
A derivative dataset that does not declare it will be interpreted as raw. In effect it's optional for raw and mandatory for derivative, but we can make it recommended so people know to add it and be explicit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@tsalo I think your change requests have been addressed. |
Comments addressed. Please re-review.
Codecov Report
@@ Coverage Diff @@
## master #440 +/- ##
=========================================
Coverage ? 36.16%
=========================================
Files ? 8
Lines ? 788
Branches ? 0
=========================================
Hits ? 285
Misses ? 503
Partials ? 0 Continue to review full report at Codecov.
|
I think it is blessed for a merge? ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me to get this into 1.7. Validator support seems to be in place as well.
Thanks @yarikoptic |
#439 is a
WIP ENH
to introduce standardized provenance capture/expression for BIDS datasets. This PR just follows the idea of #371 (small atomic ENHs), and is based on current state (as of v1.2.2-189-g4550458) of #265 (common derivatives), which introducedPipelineDescription
to describe how a BIDS derivative dataset came to its existence.Rationale
As I had previously stated in many (face-to-face when it was still possible ;)) conversations, in my view, any BIDS dataset is a derivative dataset. Even if it contains "raw" data, it is never given by gods, but is a result of some process (let's call it pipeline for consistency) which produced it out of some other data. That is why there is 1)
sourcedata/
to provide placement for such original (as "raw" in terms of processing, but "raw"er in terms of its relation to actual data acquired by equipment), and 2)code/
to provide placement for scripts used to produce or "tune" the dataset. Typically "sourcedata" is either a collection of DICOMs or a collection of data in some other formats (e.g. nifti) which is then either converted or just renamed into BIDS layout. When encountering a new BIDS dataset ATM it requires forensics and/or data archaeology to discover how this BIDS dataset came about, to e.g. possibly figure out the source of the buggy (meta)data it contains.At the level of individual files, some tools already add ad-hoc fields during conversion into side car .json files they produce,
e.g. dcm2niix adds ConversionSoftware and ConversionSoftwareVersion
ATM I need to add such metadata to datasets produced by heudiconv to make sure that in case of incremental conversions there is no switch in versions of the software.
TODOs:
PipelineDescription
withPipeline
(orProducedBy
)code/
) where applicable (I think it is OOS for this PR)dataset_description.json
(will file separate issue with validator whenever merged)adjust common derivatives PR to account for this addition(it is part of the spec already)