Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Allow dandi organize to force usage of session_id (NWB Assets) #1000

Closed
CodyCBakerPhD opened this issue May 4, 2022 · 8 comments
Closed

Comments

@CodyCBakerPhD
Copy link
Contributor

CodyCBakerPhD commented May 4, 2022

Hey all, I'd like to be able to use dandi organize to force usage of both subject ID and session ID when generating a unique filename.

E.g., I want to get sub-subj1_ses-foobar_ecephys.nwb but all I get is after standard dandi organize is sub-subj1_ecephys.nwb

This came about because I am iteratively uploading only a single NWBFile at a time to a dandiset that is being downloaded lazily (dandi download ... --download dandiset.yaml) in order to reduce the amount of time my very large files spend on the processing server.

@yarikoptic
Copy link
Member

Yet to check in detail, but I think it is at large duplicate of #69

@CodyCBakerPhD
Copy link
Contributor Author

@yarikoptic Specifically the last comment by @satra in that thread (#69 (comment), the part about 'partial access' to data) kind of touches on a part of why I'm requesting this...

But the issue would still exist for the very first session of each subject I upload, which acts as if the subject ID alone is sufficient to be 'unique', but really it needs the full session ID added on for that. But I won't have two files present at time of organize to force the automated usage of the session ID, so a manual control for that would be nice.

@waxlamp
Copy link
Member

waxlamp commented Nov 3, 2022

@AlmightyYakob and I (with @CodyCBakerPhD's help) discovered that this may be leading to bad behavior in general: apparently, repeated upload of a disambiguated filename of the kind Cody is talking about here led to multiple assets being stored in the dandiset at the exact same pathname. This resulted in two assets in the dandiset only one of which was visible in the file browser.

We have manually renamed the later version of the duplicate pathname and communicated with Cory about it, and we expect that under the new AssetPath work in the archive that this shouldn't happen again, but it may instead result in an error about the target file already existing in the dandiset. We may find out soon as Cody's CI runs and uploads files to that folder again.

@yarikoptic
Copy link
Member

@waxlamp I am not certain on what is this and what would be the bad behavior in

this may be leading to bad behavior in general

since

led to multiple assets being stored in the dandiset at the exact same pathname

sounds just as a bug in dandi-archive backend to not replace an asset with the same pathname upon an upload of a new asset. In other words -- it is native to upload different version for the same path. Could you please elaborate a little more and what in particular would you like to be changed in dandi-cli?

@satra
Copy link
Member

satra commented Nov 3, 2022

led to multiple assets being stored in the dandiset at the exact same pathname.

i agree with @yarikoptic . this should never happen according to our design. if it's happening, something broke. path is a unique key in the asset space. there can be only one.

perhaps there was a regression at some point.

@satra
Copy link
Member

satra commented Nov 3, 2022

the place where this gets tricky is partial uploads and perhaps where @CodyCBakerPhD is running into an issue.

dandi organize currently assumes all your data are local for path disambiguation. there is an open issue i think to support partial dataset disambiguation, i.e. take into account both local and remote to do disambiguation and also change paths of remote assets as a result. this has not been implemented.

ping @jwodder and @yarikoptic about the partial disambiguation step. also see #69 and #47

@waxlamp
Copy link
Member

waxlamp commented Nov 3, 2022

I don't know specifically what caused multiple assets to be coexisting in the database with identical pathnames. Perhaps it was a dandi-archive bug, or perhaps it was intended behavior, with a future GC capability intended to remove orphaned versions. This came up with @AlmightyYakob was applying the new AssetPath checks and Cody's dandiset was the one place it came up. If the duplicate asset path doesn't appear again during regular use of Cody's CI, then I believe we can chalk it up to something that was buggy and no longer is; if we run into difficulties, we can decide at that point what the proper behavior of the archive ought to be, and adjust the logic to account for that.

That is to say, no action needed at the moment nor any bugs to address, and we're just waiting on more data from the CI outcomes.

@yarikoptic
Copy link
Member

I stay with opinion that it is at large a duplicate of #69 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants