Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add classes for BIDS assets #1076

Merged
merged 17 commits into from
Aug 9, 2022
Merged

Add classes for BIDS assets #1076

merged 17 commits into from
Aug 9, 2022

Conversation

jwodder
Copy link
Member

@jwodder jwodder commented Jul 19, 2022

Closes #1044.

To do:

@jwodder jwodder added the BIDS label Jul 19, 2022
@codecov
Copy link

codecov bot commented Jul 19, 2022

Codecov Report

Attention: Patch coverage is 85.07282% with 123 lines in your changes missing coverage. Please review.

Project coverage is 88.84%. Comparing base (84f3ead) to head (e66abfd).
Report is 956 commits behind head on master.

Files Patch % Lines
dandi/files/bases.py 77.95% 54 Missing ⚠️
dandi/files/zarr.py 85.38% 44 Missing ⚠️
dandi/files/__init__.py 78.48% 17 Missing ⚠️
dandi/files/bids.py 93.81% 6 Missing ⚠️
dandi/files/_private.py 95.83% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1076      +/-   ##
==========================================
+ Coverage   88.53%   88.84%   +0.31%     
==========================================
  Files          73       78       +5     
  Lines        9295     9459     +164     
==========================================
+ Hits         8229     8404     +175     
+ Misses       1066     1055      -11     
Flag Coverage Δ
unittests 88.84% <85.07%> (+0.31%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jwodder
Copy link
Member Author

jwodder commented Aug 4, 2022

@yarikoptic The test dandi/cli/tests/test_ls.py::test_ls_bids_file is currently failing due to how I moved some of the code around, and I do not want to move the code back in order to make it pass.

  • First, note that the code currently works with metadata in two forms:
    1. A flat dict mapping strings to strings & other primitive types; let us call this "flatdata". This is the format returned by the dandi.metadata.get_metadata() function, and I believe it was the format used for asset data on the Archive prior to the introduction of dandischema
    2. A dandischema object or a dict of the fields thereof (containing nested objects and/or dicts); let us call this "schemadata". This is the format returned by dandi.metadata.nwb2asset().
  • "Flatdata" is converted into "schemadata" by the prepare_metadata() function (called metadata2asset() prior to Factor out common fields in nwb2asset() and get_default_metadata() #1088).
  • When uploading, metadata for NWB files is generated using nwb2asset(), which calls get_metadata() and passes the results to prepare_metadata(). Non-NWB files get their metadata from get_default_metadata(), which returns "schemadata" directly and does not call get_metadata().
  • When Horea added preliminary support for BIDS metadata, he simply added a piece of code in get_metadata() that checks whether the file belongs to a BIDS dataset and, if so, runs the BIDS validation routine on just that one file, returning some BIDS-specific "flatdata" instead of the NWB-specific "flatdata".
  • When dandi ls displays metadata for a file, it uses nwb2asset() if the --schema option was given or get_metadata() if it was not.
  • The docstrings in dandi ls imply that the only files it should operate on are NWB files, but if a path to a non-NWB file is supplied directly, dandi ls will treat it like an NWB file. This is what the failing test does.
  • This PR currently moves the handling of BIDS metadata extraction to a set of classes in dandi.files. In this new setup, files are only recognized as BIDS files if they are next to or underneath a dataset_description.json file found while traversing directories in search of files to upload. BIDS asset classes then determine their metadata by calling the BIDS validation routine on all files in a BIDS dataset once per BIDS dataset, saving the results, and combining the relevant parts per asset with the asset metadata determined through other means.
  • As part of this PR, I deleted Horea's code in get_metadata() for extracting BIDS metadata, as that is now handled by the BIDS asset classes. However, this causes the abovementioned test to fail, as it checks that running dandi ls on an asset in a BIDS dataset outputs certain BIDS metadata fields.
  • I can't simply make dandi ls use the dandi.files classes for fetching metadata, as those only produce "schemadata", yet dandi ls produces "flatdata" by default.
  • I can't add Horea's code back to get_metadata(), as then NWBs in BIDS datasets would lack NWB metadata, and even if I fixed that, we'd be running the BIDS validation on every BIDS asset twice.

Suggestions?

@satra
Copy link
Member

satra commented Aug 4, 2022

@jwodder - this might be a good time to make dandi ls not use flatdata but the asset structure, rendered as flat (since it is a dictionary, that should still work, and could come with a default set of filters). that should be a separate PR though. out of curiosity, since dandi ls can work on a remote asset, how does it do flat metadata for that?

@jwodder
Copy link
Member Author

jwodder commented Aug 4, 2022

@satra

out of curiosity, since dandi ls can work on a remote asset, how does it do flat metadata for that?

dandi ls on a remote asset just shows the "schemadata" from the server.

@satra
Copy link
Member

satra commented Aug 4, 2022

thanks @jwodder - in that case it may be time to converge on a single model (the schemadata one).

@yarikoptic
Copy link
Member

Suggestions?

I guess disable ls test for now if needed. Or is there any other specific question ATM? I am yet to review to provide more meaningful overall feedback.

@yarikoptic
Copy link
Member

BTW I liked terminology, so we could may be DOC it somewhere

  • flatdata - a flat dict of ad-hoc key: value pairs
  • raw_metadata (as we have *_raw_metadata) - dict representation of dandischema metadata
  • metadata -- dandischema object

Later we might need to make some flat_metadata for ls (or elsewhere) which would be flat dict with flattened keys like "digest.dandi:dandi-etag" or alike.

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial review/comments.

Overall I think it is a good direction. I did leave my desires for some code (eg. in zarr) which was merely moved but apparently is not unittested at all. If not to add tests here, please create and self-assign dedicated issues for those pieces. I have initiated a separate #1096 where I would like to decide on get_metadata aspect . Just wanted to submit this portion of the review first.

"""
.. versionadded:: 0.36.0

This module defines functionality for working with local files & directories
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This module defines functionality for working with local files & directories
This package defines functionality for working with local files & directories

to be more precise in that files/ is now upgraded to a package from a single file module?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 8e15843.

try:
df = dandi_file(p, dandiset_path, bids_dataset_description=bidsdd)
except UnknownAssetError:
if (p / BIDS_DATASET_DESCRIPTION).exists():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a comment explaining when such UnknownAssetError is expected to happen.
Also this code block is identical (but not test covered) to the one above within prior elif condition for when it is "the top of the dandiset" if I get it right... but then (because not clear when exception is to happen) may be just that condition above could be adjusted that e.g. we first condition on having BIDS_DATASET_DESCRIPTION and then something happens "differently" depending on either it is a top of dandiset or not?

Also may be it would allow to catch/report an unsupported ATM (AFAIK?) case when BIDS dataset is embedded somewhere within DANDI dataset but there is no BIDS_DATASET_DESCRIPTION on top. ATM with this code we are more lenient and seems to be happily associating with nested bids datasets (correctly) even if not on top of dandiset

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be just that condition above could be adjusted that e.g. we first condition on having BIDS_DATASET_DESCRIPTION and then something happens "differently" depending on either it is a top of dandiset or not?

That would lead to wrong behavior if a Zarr contained a dataset_description.json.

Also may be it would allow to catch/report an unsupported ATM (AFAIK?) case when BIDS dataset is embedded somewhere within DANDI dataset but there is no BIDS_DATASET_DESCRIPTION on top.

If there's no dataset_description.json, how would we know there's a BIDS dataset there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added: a654c8b.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also may be it would allow to catch/report an unsupported ATM (AFAIK?) case when BIDS dataset is embedded somewhere within DANDI dataset but there is no BIDS_DATASET_DESCRIPTION on top.

If there's no dataset_description.json, how would we know there's a BIDS dataset there?

I was not clear. In " there is no BIDS_DATASET_DESCRIPTION on top" I meant that it is "on top of the dandiset". I.e. when we have dandiset.yaml and rawdata/dataset_description.json but no dataset_description.json, i.e. BIDS dataset is embedded without entire dandiset being a BIDS dataset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a case like that, the contents of rawdata/ should still be recognized as part of a BIDS dataset.

dandi/files/__init__.py Show resolved Hide resolved
dandi/files/__init__.py Outdated Show resolved Hide resolved
dandi/files/__init__.py Show resolved Hide resolved
dandi/tests/test_files.py Show resolved Hide resolved
entries within it.
"""
if self.is_dir():
return sum(p.size for p in self.iterdir())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is just an immediate size , ie. nor recursively adding the sizes?!

please add a unittest for this case with some nested folder

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It adds up the .size properties of LocalZarrEntrys inside, so it does end up being recursive.

Copy link
Member Author

@jwodder jwodder Aug 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested: b43612d

else:
return Digest(
algorithm=DigestType.md5, value=get_digest(self.filepath, "md5")
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have unittests for digesting zarrs, please unittest this as well.

Copy link
Member Author

@jwodder jwodder Aug 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested: b43612d

files=files,
)

return dirstat(self.filetree)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems needs unittesting

Copy link
Member Author

@jwodder jwodder Aug 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested: b43612d

dandi/metadata.py Show resolved Hide resolved
@jwodder
Copy link
Member Author

jwodder commented Aug 5, 2022

@yarikoptic

BTW I liked terminology, so we could may be DOC it somewhere

See commit 3c26f38.

@jwodder jwodder added the minor Increment the minor version when merged label Aug 5, 2022
@jwodder
Copy link
Member Author

jwodder commented Aug 5, 2022

@yarikoptic This PR eliminates the need for the special handling of BIDS assets during upload introduced in #1011 and #1080. The latter PR causes an entire upload to fail early if it contains any invalid BIDS datasets, whereas this PR would simply cause invalid BIDS datasets to be treated as normal assets failing validation (i.e., they're not uploaded, and they're marked as "ERROR" in pyout). What behavior should upload have going forwards?

@yarikoptic
Copy link
Member

whereas this PR would simply cause invalid BIDS datasets to be treated as normal assets failing validation (i.e., they're not uploaded, and they're marked as "ERROR" in pyout). What behavior should upload have going forwards?

I think such behavior (of this PR) is consistent with how we treat other assets, so all good with me!

Convert "flatdata" [1]_ for an asset into raw [2]_ "schemadata" [3]_

.. [1] a flat `dict` mapping strings to strings & other primitive types;
returned by `get_metadata()`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered if we are to unify should we just then rename the get_metadata into get_flatdata and deprecate get_metadata to make it all consistent... but then I realized that get_flatdata excluding the context of "metadata" is somewhat misleading and get_flatdata_metadata is odd. So may be we should make it flatmetadata and get_flatmetadata or better even adhocmetadata and get_adhocmetadata.... let's sleep on that... for now good and we could do such massive RF later ...

@jwodder jwodder marked this pull request as ready for review August 8, 2022 14:45
iter_upload_spy.assert_not_called()
# Does validation ignoring work?
bids_dandiset_invalid.upload(existing="forced", validation="ignore")
bids_dandiset_invalid.upload(existing="force", validation="ignore")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come all of that worked with a wrong value (forced instead of force) ? I would have assumed it was important to specify forcing in this test. Or it doesn't matter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe existing is only checked if there are any local assets being uploaded with the same path as a remote asset. The BIDS sample Dandiset fixtures don't upload anything as part of their setup, so that code was never triggered.

@yarikoptic
Copy link
Member

I have remaining confusion question above, but overall I would say -- let's proceed! Thank you @jwodder !

@yarikoptic yarikoptic merged commit 7c42ed9 into master Aug 9, 2022
@yarikoptic yarikoptic deleted the gh-1044 branch August 9, 2022 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BIDS minor Increment the minor version when merged Python API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RF "files.py" validation/metadata-loading to support BIDS
3 participants