Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to use (metadata of) assets already in the archive for the dandiset #1169

Open
yarikoptic opened this issue Dec 5, 2022 · 5 comments

Comments

@yarikoptic
Copy link
Member

I think this is the use case @satra having in mind which resulted in #1163 , where we need to organize just some subset of locally present assets and have no other assets available locally. Then @satra envisions that dandi (during organize and/or upload) somehow (an exact algorithmic procedure was not yet formulated) would use assets metadata from archive to ensure "correct" disambiguation.

Having (only briefly) thought about it, I am not yet sure about possibility of a clean algorithmic procedure which would satisfy us. I think that we might "alternatively" make it more explicit by

and would allow user to explicitly state for a given dandiset which metadata entities to use, and thus make any partial dandiset organize be consistent with the case if it was done on an entire dandiset across all files. WDYT @satra ?

@satra
Copy link
Member

satra commented Dec 5, 2022

@yarikoptic - at least as it stands there is a need to disambiguate for partial upload settings that takes data on archive into account. indeed addressing #69 as part of this will help. it is a bit more complicated as we may need to rename files of assets in the archive as a result. but having a joint integrated view of dandiset would be helpful.

indeed, the algorithm is not yet there, but offers a chance for thinking how distributed syncing can work soon and eventually.

@jeromelecoq
Copy link

I stumbled upon this issue with the following steps:

  • I had a previously uploaded dataset : https://dandiarchive.org/dandiset/000037?pos=1
  • We wanted to upload an update to those files. Initially I created all the updated NWB files locally in an "updated" folder and ran dandi organize
  • I also deleted all NWB files in the "local copy of 000037" (large in the 100s GB range).
  • One of the file had validation errors so running dandi upload pushed all but one file.
  • Then we fixed that one file. I edited the NWB file in the local "updated" folder with all NWB files.
  • I ran dandi organize followed by dandi upload but this would not upload any file due to duplication.
  • So I deleted all local file in the "updated" folder and ran dandi organize again, followed by dandi upload.
  • this upload worked but this only file was uploaded without the sessions part of the filename, as shown here : https://dandiarchive.org/dandiset/000037/draft/files?location=sub-411400

@yarikoptic
Copy link
Member Author

Thank you @jeromelecoq for sharing the use case in detail. Note that you can always use regular mv (rename) command to rename files as you see needed (e.g. to add _ses- entity) . My question though -- why you didn't update files directly to correct filenames, now that you know how they should be named?
dandi organize is just a helper to organize a pull of already existing data. But the best workflow whenever you either already have a tree of files in DANDI layout (e.g. whenever you uploaded to DANDI already) or know how they should be named -- just name them appropriately to start with.

You mention "duplication" -- what exactly was duplicated?

In other words: organize is just a helper and not mandatory step in the workflow to prepare data for upload to DANDI.

What I see likely needed for our DANDI (or even with BIDS too) layout validation (so dandi validate) is to be able to say --mode=incremental (analogous to the one envisioned in #47 ), but then also acquire --instance option, so that current files were considered along with the ones known in the archive for validation. It would have benefit only after we gain some validation rules which would validate consistency across files (e.g. all of them to either have or not _ses- entity).

all of them seems to have _ses-:

❯ dandi ls https://dandiarchive.org/dandiset/000037/draft/files?location=sub-411400/ | grep path:
2022-12-07 12:00:36,077 [    INFO] Logs saved in /home/yoh/.cache/dandi-cli/log/20221207170035Z-656130.log
  path: sub-411400/sub-411400_ses-20181015T173410_behavior+ophys.nwb
  path: sub-411400/sub-411400_ses-20181011T174057_behavior+ophys.nwb
  path: sub-411400/sub-411400_ses-20181003T180253_behavior+ophys.nwb
  path: sub-411400/sub-411400_ses-20181002T173740_behavior+ophys.nwb
  path: sub-411400/sub-411400_ses-20181009T175037_behavior+ophys.nwb
  path: sub-411400/sub-411400_ses-20181001T180256_behavior+ophys.nwb
  path: sub-411400/sub-411400_ses-20181003T180253_behavior+image+ophys.nwb
  path: sub-411400/sub-411400_ses-20181001T180256_behavior+image+ophys.nwb
  path: sub-411400/sub-411400_ses-20181009T175037_behavior+image+ophys.nwb
  path: sub-411400/sub-411400_ses-20181011T174057_behavior+image+ophys.nwb
  path: sub-411400/sub-411400_ses-20181015T173410_behavior+image+ophys.nwb
  path: sub-411400/sub-411400_ses-20181002T173740_behavior+image+ophys.nwb

@yarikoptic
Copy link
Member Author

re duplication -- I guess you meant files with and without +image? you can use dandi delete to delete the "older" ones. Once again -- dandi organize is a helper -- it would never be able to read the minds (even with ChatGPT ;)).

@jeromelecoq
Copy link

  1. Sorry I ended up deleting one file from Dandi directly through the website, since I ended up with two uploads of the same file with different filenames. This might explain why you don't see it anymore.
  2. Regarding using 'mv'. Well, I was not entirely clear what was happening in the background between dandi organize, dandi validate and dandi upload. I followed the documentation and I assumed if I messed up with the files in between, it would not function properly. Perhaps, it is stated somewhere that I missed in the documentation? What I understood was that 'dandi organize' was making the local copy and naming of files before upload, 'dandi validate' checked the content of the files and 'dandi upload' would do the upload itself checking that nothing is uploaded twice. Does this help clarify my point of view?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants