-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add classes for BIDS assets #1076
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1076 +/- ##
==========================================
+ Coverage 88.53% 88.84% +0.31%
==========================================
Files 73 78 +5
Lines 9295 9459 +164
==========================================
+ Hits 8229 8404 +175
+ Misses 1066 1055 -11
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@yarikoptic The test
Suggestions? |
@jwodder - this might be a good time to make dandi ls not use flatdata but the asset structure, rendered as flat (since it is a dictionary, that should still work, and could come with a default set of filters). that should be a separate PR though. out of curiosity, since dandi ls can work on a remote asset, how does it do flat metadata for that? |
|
thanks @jwodder - in that case it may be time to converge on a single model (the schemadata one). |
I guess disable |
BTW I liked terminology, so we could may be DOC it somewhere
Later we might need to make some |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial review/comments.
Overall I think it is a good direction. I did leave my desires for some code (eg. in zarr) which was merely moved but apparently is not unittested at all. If not to add tests here, please create and self-assign dedicated issues for those pieces. I have initiated a separate #1096 where I would like to decide on get_metadata
aspect . Just wanted to submit this portion of the review first.
dandi/files/__init__.py
Outdated
""" | ||
.. versionadded:: 0.36.0 | ||
|
||
This module defines functionality for working with local files & directories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This module defines functionality for working with local files & directories | |
This package defines functionality for working with local files & directories |
to be more precise in that files/
is now upgraded to a package from a single file module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See 8e15843.
try: | ||
df = dandi_file(p, dandiset_path, bids_dataset_description=bidsdd) | ||
except UnknownAssetError: | ||
if (p / BIDS_DATASET_DESCRIPTION).exists(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a comment explaining when such UnknownAssetError
is expected to happen.
Also this code block is identical (but not test covered) to the one above within prior elif
condition for when it is "the top of the dandiset" if I get it right... but then (because not clear when exception is to happen) may be just that condition above could be adjusted that e.g. we first condition on having BIDS_DATASET_DESCRIPTION
and then something happens "differently" depending on either it is a top of dandiset or not?
Also may be it would allow to catch/report an unsupported ATM (AFAIK?) case when BIDS dataset is embedded somewhere within DANDI dataset but there is no BIDS_DATASET_DESCRIPTION on top. ATM with this code we are more lenient and seems to be happily associating with nested bids datasets (correctly) even if not on top of dandiset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be just that condition above could be adjusted that e.g. we first condition on having BIDS_DATASET_DESCRIPTION and then something happens "differently" depending on either it is a top of dandiset or not?
That would lead to wrong behavior if a Zarr contained a dataset_description.json
.
Also may be it would allow to catch/report an unsupported ATM (AFAIK?) case when BIDS dataset is embedded somewhere within DANDI dataset but there is no BIDS_DATASET_DESCRIPTION on top.
If there's no dataset_description.json
, how would we know there's a BIDS dataset there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment added: a654c8b.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also may be it would allow to catch/report an unsupported ATM (AFAIK?) case when BIDS dataset is embedded somewhere within DANDI dataset but there is no BIDS_DATASET_DESCRIPTION on top.
If there's no
dataset_description.json
, how would we know there's a BIDS dataset there?
I was not clear. In " there is no BIDS_DATASET_DESCRIPTION on top" I meant that it is "on top of the dandiset". I.e. when we have dandiset.yaml
and rawdata/dataset_description.json
but no dataset_description.json
, i.e. BIDS dataset is embedded without entire dandiset being a BIDS dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a case like that, the contents of rawdata/
should still be recognized as part of a BIDS dataset.
entries within it. | ||
""" | ||
if self.is_dir(): | ||
return sum(p.size for p in self.iterdir()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is just an immediate size , ie. nor recursively adding the sizes?!
please add a unittest for this case with some nested folder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It adds up the .size
properties of LocalZarrEntry
s inside, so it does end up being recursive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested: b43612d
else: | ||
return Digest( | ||
algorithm=DigestType.md5, value=get_digest(self.filepath, "md5") | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have unittests for digesting zarrs, please unittest this as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested: b43612d
files=files, | ||
) | ||
|
||
return dirstat(self.filetree) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems needs unittesting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested: b43612d
See commit 3c26f38. |
@yarikoptic This PR eliminates the need for the special handling of BIDS assets during upload introduced in #1011 and #1080. The latter PR causes an entire upload to fail early if it contains any invalid BIDS datasets, whereas this PR would simply cause invalid BIDS datasets to be treated as normal assets failing validation (i.e., they're not uploaded, and they're marked as "ERROR" in pyout). What behavior should upload have going forwards? |
I think such behavior (of this PR) is consistent with how we treat other assets, so all good with me! |
Convert "flatdata" [1]_ for an asset into raw [2]_ "schemadata" [3]_ | ||
|
||
.. [1] a flat `dict` mapping strings to strings & other primitive types; | ||
returned by `get_metadata()` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered if we are to unify should we just then rename the get_metadata
into get_flatdata
and deprecate get_metadata
to make it all consistent... but then I realized that get_flatdata
excluding the context of "metadata" is somewhat misleading and get_flatdata_metadata
is odd. So may be we should make it flatmetadata
and get_flatmetadata
or better even adhocmetadata
and get_adhocmetadata
.... let's sleep on that... for now good and we could do such massive RF later ...
iter_upload_spy.assert_not_called() | ||
# Does validation ignoring work? | ||
bids_dandiset_invalid.upload(existing="forced", validation="ignore") | ||
bids_dandiset_invalid.upload(existing="force", validation="ignore") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come all of that worked with a wrong value (forced
instead of force
) ? I would have assumed it was important to specify forcing in this test. Or it doesn't matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe existing
is only checked if there are any local assets being uploaded with the same path as a remote asset. The BIDS sample Dandiset fixtures don't upload anything as part of their setup, so that code was never triggered.
I have remaining confusion question above, but overall I would say -- let's proceed! Thank you @jwodder ! |
Closes #1044.
To do:
dandi upload
#1011 and User notification if datasets are invalid. #1080