Add support for .zarr "files" upload #852

yarikoptic · 2021-12-15T23:15:04Z

zarr upload support was added to staging in Zarr upload draft dandi-archive#586
@dchiquito on slack:
- The staging swagger docs now have a zarr section (https://api-staging.dandiarchive.org/swagger/) and https://github.com/dandi/dandi-api/blob/master/doc/design/zarr-support-3.md#upload-flow describes the intended flow.
- https://github.com/dandi/dandi-api/blob/master/dandiapi/api/tests/test_zarr_upload.py could be used as example implementations
I think we can for starters assume that a directory with .zarr extension is a zarr archive
- @satra -- is there a better (and a reliable) indicator to judge that some directory is a zarr "file"? I guess we could look for some .zarray or .zgroup but I wonder if that would be reliable enough?

The text was updated successfully, but these errors were encountered:

jwodder · 2021-12-16T14:59:49Z

@dchiquito Questions on the zarr upload API:

The docs for POST /api/zarr/{zarr_id}/upload/ say "The number of files being uploaded must be less than some experimentally defined limit". What is the current/recommended limit?
Number 3 in "Requirements" states "The CLI uses some kind of tree hashing scheme to compute a checksum for the entire zarr archive," but the "Upload flow" section makes no mention of such a checksum. Does the client have to compute a checksum or not, and, if it does, where is it used?
The document says "The asset metadata will contain the information required to determine if an asset is a normal file blob or a zarr file." Exactly what information is that, and is it added by the client or the server?
For POST /zarr/, what exactly should the "name" field in the request body be set to?
".checksum files format" says "Each zarr file and directory in the archive has a path and a checksum (md5). For files, this is simply the ETag." So is the checksum for a file an md5 or a DANDI e-tag? If the latter, why are we using different digests for files on their own versus in directories?
"Upload flow" states "The client calculates the MD5 of each file." and "The client sends the paths+MD5s to ...", but the field name in the request body for POST /zarr/{zarr_id}/upload/ is "etag", not "md5". Which digest is supposed to be used?
- Similarly, the docs for POST /api/zarr/{zarr_id}/upload/ say, "Requires a list of file paths and ETags (md5 checksums)". E-tags and MD5 checksums are not the same thing.
Unlike POST /uploads/initialize/, POST /zarr/{zarr_id}/upload/ does not return information on different parts of files. Does this mean that each file in a zarr directory is to be uploaded in a single request? What if a file exceeds S3's part size limit of 5 GB?
The Swagger docs don't describe the request body format for DELETE /api/zarr/{zarr_id}/files/.
How do you download a Zarr?

satra · 2021-12-16T15:56:34Z

use the zarr python library to open the nested directory store (i think to start with we will only support this backend). it will check consistency. files will not necessarily have zarr extension. in fact ngff uses .ngff extension. also ngff validator not in place at the moment, so zarr is the closest. i've posted this issue for seeking additional clarity: zarr-developers/zarr-python#912

jwodder · 2021-12-16T15:57:39Z

@dchiquito I'm trying to upload a trivial test zarr with the following code:

#!/usr/bin/env python3
import json
import os
from pathlib import Path
import sys
from dandi.dandiapi import DandiAPIClient, RESTFullAPIClient
from dandi.support.digests import get_dandietag
from dandi.utils import find_files

dandiset_id = sys.argv[1]
zarrdir = Path(sys.argv[2])
if zarrdir.suffix != ".zarr" or not zarrdir.is_dir():
    sys.exit(f"{zarrdir} is not a zarr directory")

with DandiAPIClient.for_dandi_instance(
    "dandi-staging", token=os.environ["DANDI_API_KEY"]
) as client:
    r = client.post("/zarr/", json={"name": zarrdir.name})
    zarr_id = r["zarr_id"]
    zfiles = list(map(Path, find_files(r".*", str(zarrdir))))
    upload_body = []
    for zf in zfiles:
        upload_body.append({
            "path": zf.relative_to(zarrdir).as_posix(),
            "etag": get_dandietag(zf).as_str(),
        })
    r = client.post(f"/zarr/{zarr_id}/upload/", json=upload_body)
    with RESTFullAPIClient("http://nil.nil") as storage:
        for upspec in r:
            with (zarrdir / upspec["path"]).open("rb") as fp:
                storage.put(upspec["upload_url"], data=fp, json_resp=False)
    r = client.post(f"/zarr/{zarr_id}/upload/complete/")
    #print(json.dumps(r, indent=4))
    d = client.get_dandiset(dandiset_id, "draft", lazy=False)
    r = client.post(
        f"{d.version_api_path}assets/",
        json={"metadata": {"path": zarrdir.name}, "zarr_id": zarr_id},
    )
    print(json.dumps(r, indent=4))

but I'm getting a 403 response when PUTting the files to S3:

<Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>Y16M1QN13AM6769S</RequestId><HostId>4uYw6UzqgIdTUIhqRXqW/PDmD6lXmIv53YYPHJFYv/u2rKi62895bV6jSCqrJCKF3qhJZJI1RCw=</HostId></Error>

jwodder · 2021-12-16T15:58:35Z

@satra

files will not necessarily have zarr extension

Then how do we tell whether something is a zarr directory or not?

satra · 2021-12-16T16:17:09Z

Then how do we tell whether something is a zarr directory or not?

open it as a NestedDirectoryStore in read mode with the zarr-python library. it should be able to give you information about groups and shapes, and metadata. we would essentially validate based on those and for the moment write our own requirements for the ngff files till the ome-zarr-py library implements those. but since this is generic zarr to start with, i would use the xarray or zarr python libraries to read the dataset.

jwodder · 2021-12-16T16:18:16Z

@satra So we just try to open every single directory with the zarr library and see it if succeeds?

yarikoptic · 2021-12-16T16:27:49Z

@satra:

use the zarr python library to open the nested directory store (i think to start with we will only support this backend). it will check consistency. files will not necessarily have zarr extension. in fact ngff uses .ngff extension. also ngff validator not in place at the moment, so zarr is the closest. i've posted this issue for seeking additional clarity: zarr-developers/zarr-python#912

@jwodder

So we just try to open every single directory with the zarr library and see it if succeeds?

I think we should formalize "supported zarr directory name extensions" to a limited set one way or another
We can't rely on "sensing" each directory alone since e.g. it alone would probably not tell us (if errors out) if a directory potentially contains a (broken) zarr archive or just a collection of files
But we must not allow for upload of "heavy trees" as any zarr would likely to be -- we would flood with thousands of assets which would only need to be removed

Thus I think we should

do formalize "supported zarr directory name extensions" e.g. to a set of .zarr and .ngff
for any of those do test if they are "ok" by trying to open. If not -- error/skip
if we detect file tree deeper than some N directories down (5?) -- we error out. Might need to become an option somewhere. In current dandi organizeed scheme (sub-*/{files}) we have only 1 level down. In BIDS (sub-*/ses-*/{modality}/{files}/potentially-onemore) we have 4.

edit 1: I think it is ok to add zarr as a dependency

adds only 2 dependencies -- asciitree and numcodecs (this one seems a bit heavy though). and zarr is in conda-forge

$> pip install zarr
Collecting zarr
  Using cached zarr-2.10.3-py3-none-any.whl (146 kB)
Requirement already satisfied: numpy>=1.7 in ./venvs/dev3/lib/python3.9/site-packages (from zarr) (1.21.4)
Requirement already satisfied: fasteners in ./venvs/dev3/lib/python3.9/site-packages (from zarr) (0.16.3)
Collecting asciitree
  Using cached asciitree-0.3.3-py3-none-any.whl
Collecting numcodecs>=0.6.4
  Using cached numcodecs-0.9.1-cp39-cp39-manylinux2010_x86_64.whl (6.4 MB)
Requirement already satisfied: six in ./venvs/dev3/lib/python3.9/site-packages (from fasteners->zarr) (1.16.0)
Installing collected packages: numcodecs, asciitree, zarr
Successfully installed asciitree-0.3.3 numcodecs-0.9.1 zarr-2.10.3

satra · 2021-12-16T16:36:09Z

do formalize "supported zarr directory name extensions" e.g. to a set of .zarr and .ngff

i think for the moment this would be fine.

satra · 2021-12-16T16:37:06Z

zarr with also require compression codecs to be added.

jwodder · 2021-12-16T16:58:42Z

@yarikoptic @satra Note that zarr seems to ignore files with invalid/unexpected names, and a *.zarr directory containing only such files with no .zarray or .zgroup is treated as an empty group. How should directories like this be handled?

satra · 2021-12-16T17:10:58Z

a *.zarr directory containing only such files with no .zarray or .zgroup is treated as an empty group. How should directories like this be handled?

for the moment, let's say no empty groups allowed.

jwodder · 2021-12-16T18:04:35Z

@yarikoptic

When uploading a Zarr directory, should all files within it be uploaded, even "extra"/unrecognized ones? Note that Zarr does not seem to provide a way to only list recognized files.
Should --allow-any-path have any effect on the treatment of Zarr directories, particularly invalid ones? If not, should there be a way to force processing of an invalid Zarr directory?

yarikoptic · 2021-12-16T18:51:02Z

a *.zarr directory containing only such files with no .zarray or .zgroup is treated as an empty group. How should directories like this be handled?

for the moment, let's say no empty groups allowed.

FWIW, pretty much we need zarr_validate (as if to complement pynwb_validate) which would do all those checks. Then it would be interfaced in validate and upload.

Should --allow-any-path have any effect on the treatment of Zarr directories, particularly invalid ones? If not, should there be a way to force processing of an invalid Zarr directory?

if per above we just add zarr_validate and since we do allow to upload without validation -- uniform way would be to "allow" users to upload invalid zarrs if they say so via --validation [skip|ignore]

jwodder · 2021-12-16T19:11:38Z

@yarikoptic So, in the absence of or prior to validation, would we just treat any directory with a .zarr or .ngff extension as a Zarr directory?

Also, how should metadata be determined for Zarr assets? (cc @satra)

yarikoptic · 2021-12-16T19:26:07Z

@yarikoptic So, in the absence of or prior to validation, would we just treat any directory with a .zarr or .ngff extension as a Zarr directory?

Yes

satra · 2021-12-16T19:32:30Z

for the moment i would limit metadata extraction similar to the bids data, so based on names rather than internal metadata. in the future once we get better ngff metadata we will write additional extractors. i can help with the metadata extraction beyond basic metadata (size, encoding type, subject id). for dandi this would be part of bids datasets, so we will have additional info available for sticking into the asset metadata from participants.tsv and samples.tsv files.

jwodder · 2021-12-16T21:33:07Z

@yarikoptic FYI: I'm implementing the metadata & validation for Zarr by giving NWB, Zarr, and generic files their own classes with metadata and validation methods; however, fscacher doesn't currently support caching of instance methods, so some caching is going to have to be disabled for now.

yarikoptic · 2021-12-16T22:00:30Z

hm, do you see how fscacher could gain support for bound methods? if not, I wonder if we shouldn't just concentrate logic in @staticmethods of such classes which would be explicitly passed a path instead of an instance?

jwodder · 2021-12-16T22:17:58Z

@yarikoptic We would have to add a variant of memoize_path() that gets the affected path from a given attribute of the decorated method's self parameter.

yarikoptic · 2021-12-16T22:24:08Z

right -- sounds good! somehow it didn't occur to me ;)

dchiquito · 2021-12-23T17:34:57Z

The docs for POST /api/zarr/{zarr_id}/upload/ say "The number of files being uploaded must be less than some experimentally defined limit". What is the current/recommended limit?

Currently there is no limit. Now that we have the code up on a real server I need to do some experimentation to determine that value. The limit is mostly there to enforce that no request takes longer than 30 seconds, since Heroku will forcibly cancel any requests that exceed that timeout.

Number 3 in "Requirements" states "The CLI uses some kind of tree hashing scheme to compute a checksum for the entire zarr archive," but the "Upload flow" section makes no mention of such a checksum. Does the client have to compute a checksum or not, and, if it does, where is it used?

The client would need to calculate the checksum to do integrity checking of the zarr upload. I wrote https://github.com/dandi/dandi-api/blob/master/dandiapi/api/zarr_checksums.py to handle all of that logic, and I put it in dandi-api for now for quicker iteration. It should be moved to a common location soon so that the CLI can use it as well.

The document says "The asset metadata will contain the information required to determine if an asset is a normal file blob or a zarr file." Exactly what information is that, and is it added by the client or the server?

That seems ambigious to me at the moment. The server is not adding that metadata right now. @satra any preference on where/how asset metadata is updated?

For POST /zarr/, what exactly should the "name" field in the request body be set to?

I would assume that would be the name of the directory containing the zarr data, unless zarr metadata contains a more descriptive name.

".checksum files format" says "Each zarr file and directory in the archive has a path and a checksum (md5). For files, this is simply the ETag." So is the checksum for a file an md5 or a DANDI e-tag? If the latter, why are we using different digests for files on their own versus in directories?
"Upload flow" states "The client calculates the MD5 of each file." and "The client sends the paths+MD5s to ...", but the field name in the request body for POST /zarr/{zarr_id}/upload/ is "etag", not "md5". Which digest is supposed to be used?
Similarly, the docs for POST /api/zarr/{zarr_id}/upload/ say, "Requires a list of file paths and ETags (md5 checksums)". E-tags and MD5 checksums are not the same thing.

I would say we are using simple MD5, and that for these purposes MD5 and ETag are the same thing. Files in the zarr archive are limited to 5GB so that they can use the simple upload endpoint, and for files uploaded simply (as opposed to multipart), their ETag is the MD5 of the file. DANDI Etags are defined to be the ETags of multipart uploads, which are MD5s with a suffix relating to the number of parts.

Unlike POST /uploads/initialize/, POST /zarr/{zarr_id}/upload/ does not return information on different parts of files. Does this mean that each file in a zarr directory is to be uploaded in a single request? What if a file exceeds S3's part size limit of 5 GB?

Correct. Multipart upload requires substantially more engineering, so the executive decision was made to cap zarr component files at 5GB. Zarr is relatively easy to simply rechunk smaller if that is exceeded.

The Swagger docs don't describe the request body format for DELETE /api/zarr/{zarr_id}/files/.

Sorry, my bad. dandi/dandi-archive#648

How do you download a Zarr?

That is done directly from S3 using the zarr library. You should be able to pass the S3 path of the zarr directory and zarr will read that directly from S3.

satra · 2021-12-23T17:46:29Z

That seems ambigious to me at the moment. The server is not adding that metadata right now. @satra any preference on where/how asset metadata is updated?

this should be done as is done currently by the CLI for other types of files. the exact details of encodingformat can be worked out in the relevant CLI PR.

dchiquito · 2021-12-23T18:08:04Z

@jwodder I confirm that I am encountering the same error with your script. I will look closer next week.

dchiquito · 2021-12-27T15:12:54Z

@jwodder I got the script working against dandi-api-local-docker-tests by changing the ETag from get_dandietag(zf).as_str() to md5(blob).hexdigest(). I'm still getting the 403 errors in staging though, so I am still investigating.

dchiquito · 2021-12-27T17:36:34Z

@jwodder I missed a spot with AWS permissions, the staging API has been updated. You need to specify X-Amz-ACL: bucket-owner-full-control as a request header for the upload.

My adaptation of your example looks like this:

with DandiAPIClient.for_dandi_instance(
    "dandi-staging", token=os.environ["DANDI_API_KEY"]
) as client:
    r = client.post("/zarr/", json={"name": zarrdir.name})
    zarr_id = r["zarr_id"]
    zfiles = list(map(Path, find_files(r".*", str(zarrdir))))
    upload_body = []
    for zf in zfiles:
        with open(zf, "rb") as f:
            blob = f.read()
        upload_body.append({
            "path": zf.relative_to(zarrdir).as_posix(),
            "etag": md5(blob).hexdigest(), # Simple MD5 instead of dandi-etag
        })
    r = client.post(f"/zarr/{zarr_id}/upload/", json=upload_body)
    with RESTFullAPIClient("http://nil.nil") as storage:
        for upspec in r:
            with (zarrdir / upspec["path"]).open("rb") as fp:
                storage.put(upspec["upload_url"], data=fp, json_resp=False, headers={"X-Amz-ACL": "bucket-owner-full-control"}) # X-Amz-ACL header is required
    r = client.post(f"/zarr/{zarr_id}/upload/complete/")

jwodder · 2022-01-03T14:20:07Z

@dchiquito

The client would need to calculate the checksum to do integrity checking of the zarr upload.

Could you elaborate on how this "integrity checking" fits into the upload flow? Exactly what does the client compare its computed value against?

yarikoptic · 2022-01-05T17:47:26Z

@dchiquito could you please clarify above question of @jwodder ? In the example above you posted I do not see zarr "file" integrity checking anywhere. Is providing it optional???

jwodder · 2022-01-12T20:00:04Z

@yarikoptic So when the client is asked to download a Zarr, should it, instead of downloading a monolithic asset, effectively treat the Zarr as a tree of independent blobs, each one of which is separately compared against whatever's on-disk at the final location? What exactly should pyout display when downloading a multifile Zarr?

yarikoptic · 2022-01-12T20:20:22Z

@yarikoptic So when the client is asked to download a Zarr, should it, instead of downloading a monolithic asset, effectively treat the Zarr as a tree of independent blobs, each one of which is separately compared against whatever's on-disk at the final location?

correct

edit: with an overall checksum comparison across the tree... if possible to make it "on the fly" as we do with digest for an individual file, but traversing the same order as desired for computation of the zarr checksum, would be awesome!

What exactly should pyout display when downloading a multifile Zarr?

I think for user reporting (pyout) we could consider zarr file (well -- directory) to be a single asset/file. So % progress would be based on "total" for that directory. Eventually we might want to return/expose some progress in # of files within zarr but I think there is no immediate need for that.

jwodder · 2022-01-13T16:24:16Z

@yarikoptic @dchiquito What should the Zarr upload do if a Zarr contains an empty directory? There doesn't seem to be a way to inform the API of such a directory's existence, yet the directory is still used in calculating the checksum, and so there will be a checksum mismatch after uploading.

dchiquito · 2022-01-13T16:38:33Z

I don't think well-formed zarr files can contain empty directories. I would assume it's not a thing, but perhaps we should appeal to someone with more knowledge than I.

S3 has no concept of "directories", so they can't be empty. My vote is that the checksum calculator should just ignore empty directories and the empty directory will not be considered a part of the zarr.

jwodder · 2022-01-13T19:07:18Z

@yarikoptic In order to show only a single entry in pyout when downloading a Zarr, I think the code would need to call _download_file() in a separate thread for each file in the Zarr and then interleave & combine the results. Assuming I can figure out a good way to do that, the biggest problem I can think of would be how we would limit the maximum number of concurrent download threads, which is currently managed by pyout. Unless there's some way to get pyout to do the entry-combination for us?

yarikoptic · 2022-01-13T23:45:59Z

Can _download_file() in pyout'ed thread, when handling a Zarr just start its own pool of threads for _download_file() ing individual files, and then report back to pyout an aggregate progress?

yarikoptic · 2022-01-13T23:49:46Z

Attn @satra on above #852 (comment) question about empty directories. Smells like we should ignore them and exclude from checksum compute

satra · 2022-01-13T23:53:59Z

Smells like we should ignore them and exclude from checksum compute

i would agree with that.

the only thing is that we should perhaps test what an empty array (https://zarr.readthedocs.io/en/stable/api/creation.html#zarr.creation.empty) looks like in a nested directory store. and whether zarr's checksum of a zarr dataset changes or not.

jwodder · 2022-01-14T13:28:46Z

@yarikoptic

Can _download_file() in pyout'ed thread, when handling a Zarr just start its own pool of threads for _download_file() ing individual files, and then report back to pyout an aggregate progress?

That's the basic plan, but the problem is determining the size of the thread pool. Normally, on a quad-core computer, pyout only runs 8 threads at once, so there are only at most 8 concurrent downloads. If we set the size of the Zarr thread pools to some n, we could end up with 8n concurrent downloads — and if we try to avoid that by only downloading one file in a Zarr at a time, we may end up taking longer than normal to download everything if there's ever a point where fewer than 8 downloads are running in total.

satra · 2022-01-14T13:36:34Z

@jwodder - for downloads and uploads isn't the bottleneck in disk and network bandwidth much more than number of cores? could something be done to auto adjust the number of processes ?

also when dealing with lee's data i found pyout too constraining by default, especially because the number of jobs it runs is limited by the display of the table. it would be nice if there was an option to not care about a tabular display but just a summary display. it does seem that decoupling download processes from display constraints would be a good step, and may provide a greater ability to optimize it.

jwodder · 2022-01-14T13:40:41Z

@satra It is indeed possible to configure the maximum number of workers pyout uses when instantiating a table. How exactly do you recommend we determine this value?

jwodder · 2022-01-14T13:50:18Z

@satra Also, regarding the number of jobs in pyout being limited by the display of the table: It seems that, by default, pyout waits for the top three jobs to complete before adding a new row that would cause the topmost to go off-screen, though this can be configured or disabled.

satra · 2022-01-14T14:05:14Z

@jwodder - i don't a good answer on how to optimize, but perhaps a test download would tell you how many threads optimizes bandwidth given disk constraints (keep increasing the number of threads/processes till it saturates). on clusters where disks may be on different interfaces, they could even be competing with download bandwidth. it may be worthwhile looking up if someone has implemented an algorithm for this.

regarding pyout, the issue i ran into was when one of the rows near the top was uploading (say a large file) and everything else was complete, it would not add any more jobs. if the bandwidth was being fully utilized i don't care too much, but in some cases this result in less efficient upload. the other use case is when a lot of small files are being uploaded then the job limit was 10 and could not be increased even though bandwidth wasn't an issue and devices weren't being saturated. hence the notion of removing the tabular display altogether and simply providing the summary display.

between optimizing number of threads and removing tabular display, i would prioritize the latter first so that the number of uploads/downloads do not have to be limited to 10.

yarikoptic · 2022-01-14T14:29:07Z

re jobs -- I think we should adopt what we have for upload already:

  -J, --jobs N[:M]                Number of files to upload in parallel and,
                                  optionally, number of upload threads per
                                  file

so user could explicitly control on how many threads per file (if possible, like in case of zarr) it would be. Default to e.g. 4 and be "done".

In the longer run, we are still to RF all the interfacing and pyout should not take care then about parallelizing and only for reporting/UI.

jwodder · 2022-01-14T14:34:28Z

@yarikoptic So you're suggesting adding a --jobs option to dandi download for separately setting the number of "main" download threads and the number of threads per Zarr asset, then?

yarikoptic · 2022-01-14T20:02:27Z

yes, that is what I am thinking about. In principle we could even thread downloads of regular files via multiple range requests but I do not see much need in that, so we could just comment that [:M] is in effect only for assets containing multiple files (such as .zarr directories)

jwodder · 2022-01-17T18:47:29Z

@yarikoptic I've written a library for running multiple iterators in threads and collecting the results. Now comes the bikeshedding part: deciding what status dicts should be emitted for a combined set of status dicts.

A refresher: The columns shown in the download pyout table are "path", "size", "done", "done%", "checksum", "status", and "message". The status dicts yielded for a given path while it's downloading take the following forms, yielded roughly in the order shown:

{"status": "skipped", "message": "<MESSAGE>"}
{"size": <int>}
{"status": "downloading"}
{"done": <bytes downloaded>, "done%": <percentage downloaded, from 0 to 100>}
{"status": "error", "message": "<MESSAGE>"}
{"checksum": "differs", "status": "error", "message": "<MESSAGE>"}
{"checksum": "ok"}
{"status": "setting mtime"}
{"status": "done"}

Hence the following questions & thoughts:

Should the "message" for a combined download always be in the form "X downloading, Y errored, Z skipped"? If so, what should happen to the error messages for individual files?
When at least one file is downloading and at least one file has errored, should the combined status be "downloading" or "error"? Likewise when skipped files are added to the mix.
When a download errors, should its contribution to the combined download progress be removed, left where it was, or treated like 100%?
{"status": "setting mtime"} status dict should probably just be discarded on input and not reported in the combined output.
It'll be simplest if the combined "size" and "done%" only reflect files seen so far rather than reporting the total Zarr size from the start.
Should the combined "checksum" field be "ok" as long as at least one downloaded file has its checksum verified, or should that only be shown once all checksums are verified?

yarikoptic · 2022-01-19T19:10:41Z

Should the "message" for a combined download always be in the form "X downloading, Y errored, Z skipped"? If so, what should happen to the error messages for individual files?

Let's make it "X done, Y errored, Z skipped" with any of those omitted whenever it is just 0.

When at least one file is downloading and at least one file has errored, should the combined status be "downloading" or "error"? Likewise when skipped files are added to the mix.

I would keep it "downloading" until we are done with all files, and then if some failed, yield "error" instead of "done".
(note: if any errorred, no need to do checksum I guess)

When a download errors, should its contribution to the combined download progress be removed, left where it was, or treated like 100%?

my vote would be for "removed" so user would get adequate "done%" as for how much of zarr was downloaded.

{"status": "setting mtime"} status dict should probably just be discarded on input and not reported in the combined output.

we will be setting it for every file, right? then indeed now worth yielding, unless may be for the .zarr/ directory at the end of the process.

It'll be simplest if the combined "size" and "done%" only reflect files seen so far rather than reporting the total Zarr size from the start.

"size" should report total zarr (as reported by the server); "done" and "done%" - seen so far.

Should the combined "checksum" field be "ok" as long as at least one downloaded file has its checksum verified, or should that only be shown once all checksums are verified?

I think checksum should be reported once for the entire zarr at the end of the download process.

jwodder · 2022-01-19T19:28:58Z

@yarikoptic I don't think the code should calculate a checksum for the whole Zarr after downloading; digests are already checked for the individual files, and that should be enough. Moreover, checksumming a Zarr would fail if downloading into a pre-existing local Zarr containing files that don't exist in the Zarr being downloaded.

yarikoptic · 2022-01-19T19:36:16Z

Moreover, checksumming a Zarr would fail if downloading into a pre-existing local Zarr containing files that don't exist in the Zarr being downloaded.

I think it is actually important enough to guarantee consistency of zarr as downloaded to match what is on the server to avoid confusion and e.g. reupload of a zarr with some garbage in it. So, as the last step in the zarr download, it would be good to delete files which are not listed/present in the remote (on server) copy. And then checksum should match. I guess we can rely on fscacher to cache individual files checksumming to make it reasonably fast.

jwodder · 2022-01-20T13:18:44Z

@yarikoptic Hashing of downloaded files is performed piecemeal as they're being downloaded without using fscacher, so checksumming a downloaded Zarr will have to start from scratch.

yarikoptic · 2022-01-20T13:40:55Z

@yarikoptic Hashing of downloaded files is performed piecemeal as they're being downloaded without using fscacher, so checksumming a downloaded Zarr will have to start from scratch.

so for downloaded in current session we would have md5s from "on the fly" digestion right? for those others found in the folder we would need to redigest them from disk. And that is where I thought fscacher would kick in. Or where was I wrong?

jwodder · 2022-01-20T13:46:23Z

@yarikoptic

so for downloaded in current session we would have md5s from "on the fly" digestion right?

The digests aren't returned from the download manager; they're just discarded after they're checked.

yarikoptic · 2022-01-20T14:44:44Z

The digests aren't returned from the download manager; they're just discarded after they're checked.

that's unfortunate. Didn't check the code, but may be you see a reasonably easy way to still make it possible to get their digests for overall zarr digest computation "on the fly"? if not -- we will be doomed to re-digest the entire zarr tree in full upon completion.

jwodder · 2022-01-20T15:23:58Z

@yarikoptic I figured out a way.

satra · 2022-01-21T13:24:19Z

i will copy the directory over and retry, but perhaps symbolic links should be supported.

$ dandi upload -i dandi-staging sub-MITU01_run-1_sample-178_stain-LEC_chunk-1_spim.ngff
2022-01-21 08:20:05,787 [ WARNING] A newer version (0.34.1) of dandi/dandi-cli is available. You are using 0.34.0+50.gc12f72b
2022-01-21 08:20:33,114 [ WARNING] /mnt/beegfs/satra/zarrhdf/101233/sub-MITU01/ses-20210521h17m17s06/microscopy/sub-MITU01_run-1_sample-178_stain-LEC_chunk-1_spim.ngff: Ignoring unsupported symbolic link to directory
2022-01-21 08:20:33,114 [    INFO] Found 0 files to consider
2022-01-21 08:20:33,126 [    INFO] Logs saved in /home/satra/.cache/dandi-cli/log/20220121132004Z-1470910.log

Support Zarr directories

yarikoptic assigned jwodder Dec 15, 2021

jwodder mentioned this issue Dec 16, 2021

Support Zarr directories #853

Merged

24 tasks

jwodder mentioned this issue Dec 17, 2021

Support decorating instance methods of classes with path attributes con/fscacher#58

Open

yarikoptic closed this as completed in #853 Jan 25, 2022

yarikoptic added a commit that referenced this issue Jan 25, 2022

Merge pull request #853 from dandi/gh-852

1585b4d

Support Zarr directories

jwodder added the zarr label Jan 25, 2022

jwodder mentioned this issue Apr 25, 2022

ngff upload running into validation errors #983

Closed

Add support for .zarr "files" upload #852

Add support for .zarr "files" upload #852

Comments

yarikoptic commented Dec 15, 2021

jwodder commented Dec 16, 2021 • edited Loading

satra commented Dec 16, 2021

jwodder commented Dec 16, 2021 • edited Loading

jwodder commented Dec 16, 2021

satra commented Dec 16, 2021

jwodder commented Dec 16, 2021

yarikoptic commented Dec 16, 2021 • edited Loading

satra commented Dec 16, 2021

satra commented Dec 16, 2021

jwodder commented Dec 16, 2021

satra commented Dec 16, 2021

jwodder commented Dec 16, 2021

yarikoptic commented Dec 16, 2021

jwodder commented Dec 16, 2021

yarikoptic commented Dec 16, 2021

satra commented Dec 16, 2021

jwodder commented Dec 16, 2021

yarikoptic commented Dec 16, 2021

jwodder commented Dec 16, 2021

yarikoptic commented Dec 16, 2021

dchiquito commented Dec 23, 2021

satra commented Dec 23, 2021

dchiquito commented Dec 23, 2021

dchiquito commented Dec 27, 2021

dchiquito commented Dec 27, 2021 • edited Loading

jwodder commented Jan 3, 2022

yarikoptic commented Jan 5, 2022

jwodder commented Jan 12, 2022

yarikoptic commented Jan 12, 2022 • edited Loading

jwodder commented Jan 13, 2022

dchiquito commented Jan 13, 2022

jwodder commented Jan 13, 2022

yarikoptic commented Jan 13, 2022

yarikoptic commented Jan 13, 2022

satra commented Jan 13, 2022 • edited Loading

jwodder commented Jan 14, 2022

satra commented Jan 14, 2022

jwodder commented Jan 14, 2022

jwodder commented Jan 14, 2022

satra commented Jan 14, 2022

yarikoptic commented Jan 14, 2022

jwodder commented Jan 14, 2022

yarikoptic commented Jan 14, 2022

jwodder commented Jan 17, 2022

yarikoptic commented Jan 19, 2022

jwodder commented Jan 19, 2022

yarikoptic commented Jan 19, 2022

jwodder commented Jan 20, 2022

yarikoptic commented Jan 20, 2022

jwodder commented Jan 20, 2022

yarikoptic commented Jan 20, 2022

jwodder commented Jan 20, 2022

satra commented Jan 21, 2022

jwodder commented Dec 16, 2021 •

edited

Loading

jwodder commented Dec 16, 2021 •

edited

Loading

yarikoptic commented Dec 16, 2021 •

edited

Loading

dchiquito commented Dec 27, 2021 •

edited

Loading

yarikoptic commented Jan 12, 2022 •

edited

Loading

satra commented Jan 13, 2022 •

edited

Loading