Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Symbolic linking within datasets #526

Closed
tsalo opened this issue Jul 8, 2020 · 8 comments
Closed

Symbolic linking within datasets #526

tsalo opened this issue Jul 8, 2020 · 8 comments

Comments

@tsalo
Copy link
Member

tsalo commented Jul 8, 2020

#508 proposes a number of new suffixes meant for qMRI workflows. These suffixes all require multiple files, and in some cases some of those files may be equivalent to existing suffixes. For example, one file from a multi-parametric mapping (MPM) scheme may be the same as a T1w scan, and if the dataset curator knows this, they could identify it as such.

#508 also introduces the idea of symbolically linking dataset files to derivatives, in cases where the scanner automatically generates what would typically be considered a derivative (e.g., a T2map).

Would it be reasonable for the curator to symbolically link files within a dataset?

So, for example, we could have the two following:

sub-X/
    anat/
        sub-X_fa-1_mt-on_MPM.nii.gz ---
        sub-X_fa-1_mt-off_MPM.nii.gz  |
        sub-X_fa-2_mt-on_MPM.nii.gz   |  symbolic link
        sub-X_fa-2_mt-off_MPM.nii.gz  | 
        sub-X_T1w.nii.gz  <------------

Tagging @yarikoptic and @adswa to get Datalad-related thoughts, as well as @agahkarakuzu and @emdupre because they were involved in the initial conversation that spawned this issue.

This issue is related to #508 and #512.

@effigies
Copy link
Collaborator

effigies commented Jul 8, 2020

Symlinks and deduplication seem like problems for the filesystem or a storage system like datalad, and should not be part of the specification. Not all filesystems support symlinks, so I think it would be unwise for us to recommend or require them in the spec.

Do we currently have a principle in which we say files must not be duplicated?

@tsalo
Copy link
Member Author

tsalo commented Jul 11, 2020

I haven't seen anything about duplication in the spec, but I could have missed it. Are your concerns specifically about symlinks, or about duplicate files in general? I don't think having copies of files would be a problem for Datalad, but then I think there'd be a need for unique identifiers stored in the sidecars. Perhaps this could just be reflected in the scans file, which, at least with heudiconv, generally has some random string that is unique to each file?

Since symlinks are a part of BEP001, should they be replaced with file duplication?

@tsalo
Copy link
Member Author

tsalo commented Aug 17, 2020

Per bids-standard/bids-2-devel#43 (comment), @satra agrees that symlinking would not be compatible with common storage systems.

Does anyone have any ideas for a good alternative that will work well with scanner-generated "derivatives"?

@effigies
Copy link
Collaborator

My suggestion would be to generate your dataset as a compliant derivatives dataset, and stick derivatives side-by-side with raw files. I'm not sure if this is anything like a consensus position, but given that derivatives datasets may contain raw filenames IFF they are raw files, I think it's a kind of nice way to handle the case. If it becomes common behavior, it drives us toward the end state where we acknowledge that all datasets are derivative.

@tsalo
Copy link
Member Author

tsalo commented Aug 19, 2020

To tie it back to #508, the BEP001 team has proposed the following format for a dataset with scanner-generated derivatives and sufficient provenance (with minor adjustments to add functional data):

ds-example/
 ├── derivatives/
 |   └── qMRI-software/
 |       └── sub-01/
 |           └── anat/
 |               ├── sub-01_T1map.nii.gz ─────────┐ L
 |               ├── sub-01_T1map.json   ───────┐ | I
 |               ├── sub-01_MTsat.nii.gz ─────┐ | | N
 |               └── sub-01_MTsat.json   ───┐ | | | K
 └── sub-01/                                | | | |
     ├── anat/                              | | | |
     |   ├── sub-01_fa-1_mt-on_MTS.nii.gz   | | | | T
     |   ├── sub-01_fa-1_mt-on_MTS.json     | | | | O
     |   ├── sub-01_fa-1_mt-off_MTS.nii.gz  | | | |
     |   ├── sub-01_fa-1_mt-off_MTS.json    | | | | A
     |   ├── sub-01_fa-2_mt-off_MTS.nii.gz  | | | | N
     |   ├── sub-01_fa-2_mt-off_MTS.json    | | | | A
     |   ├── sub-01_T1map.nii.gz <──────────├─├─├─┘ T
     |   ├── sub-01_T1map.json   <──────────├─├─┘
     |   ├── sub-01_MTsat.nii.gz <──────────├─┘
     |   └── sub-01_MTsat.json   <──────────┘
     └── func/
         ├── sub-01_task-rest_bold.nii.gz
         └── sub-01_task-rest_bold.json

In the case of scanner-generated derivatives without provenance, I believe that their proposal is to simply have the data in the raw data folder:

ds-example/
 └── sub-01/
     ├── anat/
     |   ├── sub-01_fa-1_mt-on_MTS.nii.gz
     |   ├── sub-01_fa-1_mt-on_MTS.json
     |   ├── sub-01_fa-1_mt-off_MTS.nii.gz
     |   ├── sub-01_fa-1_mt-off_MTS.json
     |   ├── sub-01_fa-2_mt-off_MTS.nii.gz
     |   ├── sub-01_fa-2_mt-off_MTS.json
     |   ├── sub-01_T1map.nii.gz
     |   ├── sub-01_T1map.json
     |   ├── sub-01_MTsat.nii.gz
     |   └── sub-01_MTsat.json
     └── func/
         ├── sub-01_task-rest_bold.nii.gz
         └── sub-01_task-rest_bold.json

If I understand correctly, you're proposing that folks do almost the opposite- put everything in the derivatives folder? Like this:

ds-example/
 ├── derivatives/
 |   └── qMRI-software/
 |       └── sub-01/
 |           └── anat/
 |               ├── sub-01_fa-1_mt-on_MTS.nii.gz
 |               ├── sub-01_fa-1_mt-on_MTS.json
 |               ├── sub-01_fa-1_mt-off_MTS.nii.gz
 |               ├── sub-01_fa-1_mt-off_MTS.json
 |               ├── sub-01_fa-2_mt-off_MTS.nii.gz
 |               ├── sub-01_fa-2_mt-off_MTS.json
 |               ├── sub-01_T1map.nii.gz
 |               ├── sub-01_T1map.json
 |               ├── sub-01_MTsat.nii.gz
 |               └── sub-01_MTsat.json
 └── sub-01/
     └── func/
         ├── sub-01_task-rest_bold.nii.gz
         └── sub-01_task-rest_bold.json

@effigies
Copy link
Collaborator

No, I'm proposing:

ds-example/
 └── sub-01/
     ├── anat/
     |   ├── sub-01_fa-1_mt-on_MTS.nii.gz
     |   ├── sub-01_fa-1_mt-on_MTS.json
     |   ├── sub-01_fa-1_mt-off_MTS.nii.gz
     |   ├── sub-01_fa-1_mt-off_MTS.json
     |   ├── sub-01_fa-2_mt-off_MTS.nii.gz
     |   ├── sub-01_fa-2_mt-off_MTS.json
     |   ├── sub-01_T1map.nii.gz
     |   ├── sub-01_T1map.json
     |   ├── sub-01_MTsat.nii.gz
     |   └── sub-01_MTsat.json
     └── func/
         ├── sub-01_task-rest_bold.nii.gz
         └── sub-01_task-rest_bold.json

With dataset_description.json:

{
  ...
  "DatasetType": "derivatives",
  "GeneratedBy": [...]
}

@tsalo
Copy link
Member Author

tsalo commented Aug 19, 2020

Ohhhh okay. Thanks! Now that there's a symlink-less solution on the table, I'll feed it back into the BEP001 review.

@tsalo
Copy link
Member Author

tsalo commented Aug 24, 2020

I commented on the BEP001 PR with the proposed solution, so I'm going to close this.

@tsalo tsalo closed this as completed Aug 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants