Add option to use (metadata of) assets already in the archive for the dandiset #1169

yarikoptic · 2022-12-05T15:52:19Z

I think this is the use case @satra having in mind which resulted in #1163 , where we need to organize just some subset of locally present assets and have no other assets available locally. Then @satra envisions that dandi (during organize and/or upload) somehow (an exact algorithmic procedure was not yet formulated) would use assets metadata from archive to ensure "correct" disambiguation.

Having (only briefly) thought about it, I am not yet sure about possibility of a clean algorithmic procedure which would satisfy us. I think that we might "alternatively" make it more explicit by

addressing organize: option to specify keys to be used instead of current "hard coded" #69
which in turn likely address [Feature] Allow dandi organize to force usage of session_id (NWB Assets) #1000

and would allow user to explicitly state for a given dandiset which metadata entities to use, and thus make any partial dandiset organize be consistent with the case if it was done on an entire dandiset across all files. WDYT @satra ?

The text was updated successfully, but these errors were encountered:

satra · 2022-12-05T20:32:12Z

@yarikoptic - at least as it stands there is a need to disambiguate for partial upload settings that takes data on archive into account. indeed addressing #69 as part of this will help. it is a bit more complicated as we may need to rename files of assets in the archive as a result. but having a joint integrated view of dandiset would be helpful.

indeed, the algorithm is not yet there, but offers a chance for thinking how distributed syncing can work soon and eventually.

jeromelecoq · 2022-12-06T23:42:08Z

I stumbled upon this issue with the following steps:

I had a previously uploaded dataset : https://dandiarchive.org/dandiset/000037?pos=1
We wanted to upload an update to those files. Initially I created all the updated NWB files locally in an "updated" folder and ran dandi organize
I also deleted all NWB files in the "local copy of 000037" (large in the 100s GB range).
One of the file had validation errors so running dandi upload pushed all but one file.
Then we fixed that one file. I edited the NWB file in the local "updated" folder with all NWB files.
I ran dandi organize followed by dandi upload but this would not upload any file due to duplication.
So I deleted all local file in the "updated" folder and ran dandi organize again, followed by dandi upload.
this upload worked but this only file was uploaded without the sessions part of the filename, as shown here : https://dandiarchive.org/dandiset/000037/draft/files?location=sub-411400

yarikoptic · 2022-12-07T17:01:04Z

Thank you @jeromelecoq for sharing the use case in detail. Note that you can always use regular mv (rename) command to rename files as you see needed (e.g. to add _ses- entity) . My question though -- why you didn't update files directly to correct filenames, now that you know how they should be named?
dandi organize is just a helper to organize a pull of already existing data. But the best workflow whenever you either already have a tree of files in DANDI layout (e.g. whenever you uploaded to DANDI already) or know how they should be named -- just name them appropriately to start with.

You mention "duplication" -- what exactly was duplicated?

In other words: organize is just a helper and not mandatory step in the workflow to prepare data for upload to DANDI.

What I see likely needed for our DANDI (or even with BIDS too) layout validation (so dandi validate) is to be able to say --mode=incremental (analogous to the one envisioned in #47 ), but then also acquire --instance option, so that current files were considered along with the ones known in the archive for validation. It would have benefit only after we gain some validation rules which would validate consistency across files (e.g. all of them to either have or not _ses- entity).

this upload worked but this only file was uploaded without the sessions part of the filename, as shown here : https://dandiarchive.org/dandiset/000037/draft/files?location=sub-411400

all of them seems to have _ses-:

❯ dandi ls https://dandiarchive.org/dandiset/000037/draft/files?location=sub-411400/ | grep path:
2022-12-07 12:00:36,077 [    INFO] Logs saved in /home/yoh/.cache/dandi-cli/log/20221207170035Z-656130.log
  path: sub-411400/sub-411400_ses-20181015T173410_behavior+ophys.nwb
  path: sub-411400/sub-411400_ses-20181011T174057_behavior+ophys.nwb
  path: sub-411400/sub-411400_ses-20181003T180253_behavior+ophys.nwb
  path: sub-411400/sub-411400_ses-20181002T173740_behavior+ophys.nwb
  path: sub-411400/sub-411400_ses-20181009T175037_behavior+ophys.nwb
  path: sub-411400/sub-411400_ses-20181001T180256_behavior+ophys.nwb
  path: sub-411400/sub-411400_ses-20181003T180253_behavior+image+ophys.nwb
  path: sub-411400/sub-411400_ses-20181001T180256_behavior+image+ophys.nwb
  path: sub-411400/sub-411400_ses-20181009T175037_behavior+image+ophys.nwb
  path: sub-411400/sub-411400_ses-20181011T174057_behavior+image+ophys.nwb
  path: sub-411400/sub-411400_ses-20181015T173410_behavior+image+ophys.nwb
  path: sub-411400/sub-411400_ses-20181002T173740_behavior+image+ophys.nwb

yarikoptic · 2022-12-07T17:02:56Z

re duplication -- I guess you meant files with and without +image? you can use dandi delete to delete the "older" ones. Once again -- dandi organize is a helper -- it would never be able to read the minds (even with ChatGPT ;)).

jeromelecoq · 2022-12-08T00:23:47Z

Sorry I ended up deleting one file from Dandi directly through the website, since I ended up with two uploads of the same file with different filenames. This might explain why you don't see it anymore.
Regarding using 'mv'. Well, I was not entirely clear what was happening in the background between dandi organize, dandi validate and dandi upload. I followed the documentation and I assumed if I messed up with the files in between, it would not function properly. Perhaps, it is stated somewhere that I missed in the documentation? What I understood was that 'dandi organize' was making the local copy and naming of files before upload, 'dandi validate' checked the content of the files and 'dandi upload' would do the upload itself checking that nothing is uploaded twice. Does this help clarify my point of view?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to use (metadata of) assets already in the archive for the dandiset #1169

Add option to use (metadata of) assets already in the archive for the dandiset #1169

yarikoptic commented Dec 5, 2022

satra commented Dec 5, 2022

jeromelecoq commented Dec 6, 2022

yarikoptic commented Dec 7, 2022

yarikoptic commented Dec 7, 2022

jeromelecoq commented Dec 8, 2022

Add option to use (metadata of) assets already in the archive for the dandiset #1169

Add option to use (metadata of) assets already in the archive for the dandiset #1169

Comments

yarikoptic commented Dec 5, 2022

satra commented Dec 5, 2022

jeromelecoq commented Dec 6, 2022

yarikoptic commented Dec 7, 2022

yarikoptic commented Dec 7, 2022

jeromelecoq commented Dec 8, 2022