Symbolic linking within datasets #526

tsalo · 2020-07-08T19:23:30Z

#508 proposes a number of new suffixes meant for qMRI workflows. These suffixes all require multiple files, and in some cases some of those files may be equivalent to existing suffixes. For example, one file from a multi-parametric mapping (MPM) scheme may be the same as a T1w scan, and if the dataset curator knows this, they could identify it as such.

#508 also introduces the idea of symbolically linking dataset files to derivatives, in cases where the scanner automatically generates what would typically be considered a derivative (e.g., a T2map).

Would it be reasonable for the curator to symbolically link files within a dataset?

So, for example, we could have the two following:

sub-X/
    anat/
        sub-X_fa-1_mt-on_MPM.nii.gz ---
        sub-X_fa-1_mt-off_MPM.nii.gz  |
        sub-X_fa-2_mt-on_MPM.nii.gz   |  symbolic link
        sub-X_fa-2_mt-off_MPM.nii.gz  | 
        sub-X_T1w.nii.gz  <------------

Tagging @yarikoptic and @adswa to get Datalad-related thoughts, as well as @agahkarakuzu and @emdupre because they were involved in the initial conversation that spawned this issue.

This issue is related to #508 and #512.

The text was updated successfully, but these errors were encountered:

effigies · 2020-07-08T19:32:41Z

Symlinks and deduplication seem like problems for the filesystem or a storage system like datalad, and should not be part of the specification. Not all filesystems support symlinks, so I think it would be unwise for us to recommend or require them in the spec.

Do we currently have a principle in which we say files must not be duplicated?

tsalo · 2020-07-11T16:15:57Z

I haven't seen anything about duplication in the spec, but I could have missed it. Are your concerns specifically about symlinks, or about duplicate files in general? I don't think having copies of files would be a problem for Datalad, but then I think there'd be a need for unique identifiers stored in the sidecars. Perhaps this could just be reflected in the scans file, which, at least with heudiconv, generally has some random string that is unique to each file?

Since symlinks are a part of BEP001, should they be replaced with file duplication?

tsalo · 2020-08-17T16:54:10Z

Per bids-standard/bids-2-devel#43 (comment), @satra agrees that symlinking would not be compatible with common storage systems.

Does anyone have any ideas for a good alternative that will work well with scanner-generated "derivatives"?

effigies · 2020-08-17T16:58:53Z

My suggestion would be to generate your dataset as a compliant derivatives dataset, and stick derivatives side-by-side with raw files. I'm not sure if this is anything like a consensus position, but given that derivatives datasets may contain raw filenames IFF they are raw files, I think it's a kind of nice way to handle the case. If it becomes common behavior, it drives us toward the end state where we acknowledge that all datasets are derivative.

tsalo · 2020-08-19T15:05:43Z

To tie it back to #508, the BEP001 team has proposed the following format for a dataset with scanner-generated derivatives and sufficient provenance (with minor adjustments to add functional data):

ds-example/
 ├── derivatives/
 |   └── qMRI-software/
 |       └── sub-01/
 |           └── anat/
 |               ├── sub-01_T1map.nii.gz ─────────┐ L
 |               ├── sub-01_T1map.json   ───────┐ | I
 |               ├── sub-01_MTsat.nii.gz ─────┐ | | N
 |               └── sub-01_MTsat.json   ───┐ | | | K
 └── sub-01/                                | | | |
     ├── anat/                              | | | |
     |   ├── sub-01_fa-1_mt-on_MTS.nii.gz   | | | | T
     |   ├── sub-01_fa-1_mt-on_MTS.json     | | | | O
     |   ├── sub-01_fa-1_mt-off_MTS.nii.gz  | | | |
     |   ├── sub-01_fa-1_mt-off_MTS.json    | | | | A
     |   ├── sub-01_fa-2_mt-off_MTS.nii.gz  | | | | N
     |   ├── sub-01_fa-2_mt-off_MTS.json    | | | | A
     |   ├── sub-01_T1map.nii.gz <──────────├─├─├─┘ T
     |   ├── sub-01_T1map.json   <──────────├─├─┘
     |   ├── sub-01_MTsat.nii.gz <──────────├─┘
     |   └── sub-01_MTsat.json   <──────────┘
     └── func/
         ├── sub-01_task-rest_bold.nii.gz
         └── sub-01_task-rest_bold.json

In the case of scanner-generated derivatives without provenance, I believe that their proposal is to simply have the data in the raw data folder:

ds-example/
 └── sub-01/
     ├── anat/
     |   ├── sub-01_fa-1_mt-on_MTS.nii.gz
     |   ├── sub-01_fa-1_mt-on_MTS.json
     |   ├── sub-01_fa-1_mt-off_MTS.nii.gz
     |   ├── sub-01_fa-1_mt-off_MTS.json
     |   ├── sub-01_fa-2_mt-off_MTS.nii.gz
     |   ├── sub-01_fa-2_mt-off_MTS.json
     |   ├── sub-01_T1map.nii.gz
     |   ├── sub-01_T1map.json
     |   ├── sub-01_MTsat.nii.gz
     |   └── sub-01_MTsat.json
     └── func/
         ├── sub-01_task-rest_bold.nii.gz
         └── sub-01_task-rest_bold.json

If I understand correctly, you're proposing that folks do almost the opposite- put everything in the derivatives folder? Like this:

ds-example/
 ├── derivatives/
 |   └── qMRI-software/
 |       └── sub-01/
 |           └── anat/
 |               ├── sub-01_fa-1_mt-on_MTS.nii.gz
 |               ├── sub-01_fa-1_mt-on_MTS.json
 |               ├── sub-01_fa-1_mt-off_MTS.nii.gz
 |               ├── sub-01_fa-1_mt-off_MTS.json
 |               ├── sub-01_fa-2_mt-off_MTS.nii.gz
 |               ├── sub-01_fa-2_mt-off_MTS.json
 |               ├── sub-01_T1map.nii.gz
 |               ├── sub-01_T1map.json
 |               ├── sub-01_MTsat.nii.gz
 |               └── sub-01_MTsat.json
 └── sub-01/
     └── func/
         ├── sub-01_task-rest_bold.nii.gz
         └── sub-01_task-rest_bold.json

effigies · 2020-08-19T15:15:06Z

No, I'm proposing:

ds-example/
 └── sub-01/
     ├── anat/
     |   ├── sub-01_fa-1_mt-on_MTS.nii.gz
     |   ├── sub-01_fa-1_mt-on_MTS.json
     |   ├── sub-01_fa-1_mt-off_MTS.nii.gz
     |   ├── sub-01_fa-1_mt-off_MTS.json
     |   ├── sub-01_fa-2_mt-off_MTS.nii.gz
     |   ├── sub-01_fa-2_mt-off_MTS.json
     |   ├── sub-01_T1map.nii.gz
     |   ├── sub-01_T1map.json
     |   ├── sub-01_MTsat.nii.gz
     |   └── sub-01_MTsat.json
     └── func/
         ├── sub-01_task-rest_bold.nii.gz
         └── sub-01_task-rest_bold.json

With dataset_description.json:

{
  ...
  "DatasetType": "derivatives",
  "GeneratedBy": [...]
}

tsalo · 2020-08-19T15:33:06Z

Ohhhh okay. Thanks! Now that there's a symlink-less solution on the table, I'll feed it back into the BEP001 review.

tsalo · 2020-08-24T21:12:37Z

I commented on the BEP001 PR with the proposed solution, so I'm going to close this.

tsalo mentioned this issue Aug 19, 2020

[ENH] BEP001 - Quantitative MRI #508

Closed

tsalo closed this as completed Aug 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Symbolic linking within datasets #526

Symbolic linking within datasets #526

tsalo commented Jul 8, 2020

effigies commented Jul 8, 2020

tsalo commented Jul 11, 2020

tsalo commented Aug 17, 2020

effigies commented Aug 17, 2020

tsalo commented Aug 19, 2020 •

edited

Loading

effigies commented Aug 19, 2020

tsalo commented Aug 19, 2020

tsalo commented Aug 24, 2020

Symbolic linking within datasets #526

Symbolic linking within datasets #526

Comments

tsalo commented Jul 8, 2020

effigies commented Jul 8, 2020

tsalo commented Jul 11, 2020

tsalo commented Aug 17, 2020

effigies commented Aug 17, 2020

tsalo commented Aug 19, 2020 • edited Loading

effigies commented Aug 19, 2020

tsalo commented Aug 19, 2020

tsalo commented Aug 24, 2020

tsalo commented Aug 19, 2020 •

edited

Loading