Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for .zarr "files" upload #852

Closed
yarikoptic opened this issue Dec 15, 2021 · 72 comments · Fixed by #853
Closed

Add support for .zarr "files" upload #852

yarikoptic opened this issue Dec 15, 2021 · 72 comments · Fixed by #853
Assignees

Comments

@yarikoptic
Copy link
Member

@jwodder
Copy link
Member

jwodder commented Dec 16, 2021

@dchiquito Questions on the zarr upload API:

  • The docs for POST /api/zarr/{zarr_id}/upload/ say "The number of files being uploaded must be less than some experimentally defined limit". What is the current/recommended limit?
  • Number 3 in "Requirements" states "The CLI uses some kind of tree hashing scheme to compute a checksum for the entire zarr archive," but the "Upload flow" section makes no mention of such a checksum. Does the client have to compute a checksum or not, and, if it does, where is it used?
  • The document says "The asset metadata will contain the information required to determine if an asset is a normal file blob or a zarr file." Exactly what information is that, and is it added by the client or the server?
  • For POST /zarr/, what exactly should the "name" field in the request body be set to?
  • ".checksum files format" says "Each zarr file and directory in the archive has a path and a checksum (md5). For files, this is simply the ETag." So is the checksum for a file an md5 or a DANDI e-tag? If the latter, why are we using different digests for files on their own versus in directories?
  • "Upload flow" states "The client calculates the MD5 of each file." and "The client sends the paths+MD5s to ...", but the field name in the request body for POST /zarr/{zarr_id}/upload/ is "etag", not "md5". Which digest is supposed to be used?
    • Similarly, the docs for POST /api/zarr/{zarr_id}/upload/ say, "Requires a list of file paths and ETags (md5 checksums)". E-tags and MD5 checksums are not the same thing.
  • Unlike POST /uploads/initialize/, POST /zarr/{zarr_id}/upload/ does not return information on different parts of files. Does this mean that each file in a zarr directory is to be uploaded in a single request? What if a file exceeds S3's part size limit of 5 GB?
  • The Swagger docs don't describe the request body format for DELETE /api/zarr/{zarr_id}/files/.
  • How do you download a Zarr?

@satra
Copy link
Member

satra commented Dec 16, 2021

use the zarr python library to open the nested directory store (i think to start with we will only support this backend). it will check consistency. files will not necessarily have zarr extension. in fact ngff uses .ngff extension. also ngff validator not in place at the moment, so zarr is the closest. i've posted this issue for seeking additional clarity: zarr-developers/zarr-python#912

@jwodder
Copy link
Member

jwodder commented Dec 16, 2021

@dchiquito I'm trying to upload a trivial test zarr with the following code:

#!/usr/bin/env python3
import json
import os
from pathlib import Path
import sys
from dandi.dandiapi import DandiAPIClient, RESTFullAPIClient
from dandi.support.digests import get_dandietag
from dandi.utils import find_files

dandiset_id = sys.argv[1]
zarrdir = Path(sys.argv[2])
if zarrdir.suffix != ".zarr" or not zarrdir.is_dir():
    sys.exit(f"{zarrdir} is not a zarr directory")

with DandiAPIClient.for_dandi_instance(
    "dandi-staging", token=os.environ["DANDI_API_KEY"]
) as client:
    r = client.post("/zarr/", json={"name": zarrdir.name})
    zarr_id = r["zarr_id"]
    zfiles = list(map(Path, find_files(r".*", str(zarrdir))))
    upload_body = []
    for zf in zfiles:
        upload_body.append({
            "path": zf.relative_to(zarrdir).as_posix(),
            "etag": get_dandietag(zf).as_str(),
        })
    r = client.post(f"/zarr/{zarr_id}/upload/", json=upload_body)
    with RESTFullAPIClient("http://nil.nil") as storage:
        for upspec in r:
            with (zarrdir / upspec["path"]).open("rb") as fp:
                storage.put(upspec["upload_url"], data=fp, json_resp=False)
    r = client.post(f"/zarr/{zarr_id}/upload/complete/")
    #print(json.dumps(r, indent=4))
    d = client.get_dandiset(dandiset_id, "draft", lazy=False)
    r = client.post(
        f"{d.version_api_path}assets/",
        json={"metadata": {"path": zarrdir.name}, "zarr_id": zarr_id},
    )
    print(json.dumps(r, indent=4))

but I'm getting a 403 response when PUTting the files to S3:

<Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>Y16M1QN13AM6769S</RequestId><HostId>4uYw6UzqgIdTUIhqRXqW/PDmD6lXmIv53YYPHJFYv/u2rKi62895bV6jSCqrJCKF3qhJZJI1RCw=</HostId></Error>

@jwodder
Copy link
Member

jwodder commented Dec 16, 2021

@satra

files will not necessarily have zarr extension

Then how do we tell whether something is a zarr directory or not?

@satra
Copy link
Member

satra commented Dec 16, 2021

Then how do we tell whether something is a zarr directory or not?

open it as a NestedDirectoryStore in read mode with the zarr-python library. it should be able to give you information about groups and shapes, and metadata. we would essentially validate based on those and for the moment write our own requirements for the ngff files till the ome-zarr-py library implements those. but since this is generic zarr to start with, i would use the xarray or zarr python libraries to read the dataset.

@jwodder
Copy link
Member

jwodder commented Dec 16, 2021

@satra So we just try to open every single directory with the zarr library and see it if succeeds?

@yarikoptic
Copy link
Member Author

yarikoptic commented Dec 16, 2021

@satra:

use the zarr python library to open the nested directory store (i think to start with we will only support this backend). it will check consistency. files will not necessarily have zarr extension. in fact ngff uses .ngff extension. also ngff validator not in place at the moment, so zarr is the closest. i've posted this issue for seeking additional clarity: zarr-developers/zarr-python#912

@jwodder

So we just try to open every single directory with the zarr library and see it if succeeds?

  1. I think we should formalize "supported zarr directory name extensions" to a limited set one way or another
  2. We can't rely on "sensing" each directory alone since e.g. it alone would probably not tell us (if errors out) if a directory potentially contains a (broken) zarr archive or just a collection of files
  3. But we must not allow for upload of "heavy trees" as any zarr would likely to be -- we would flood with thousands of assets which would only need to be removed

Thus I think we should

  1. do formalize "supported zarr directory name extensions" e.g. to a set of .zarr and .ngff
  2. for any of those do test if they are "ok" by trying to open. If not -- error/skip
  3. if we detect file tree deeper than some N directories down (5?) -- we error out. Might need to become an option somewhere. In current dandi organizeed scheme (sub-*/{files}) we have only 1 level down. In BIDS (sub-*/ses-*/{modality}/{files}/potentially-onemore) we have 4.

edit 1: I think it is ok to add zarr as a dependency

adds only 2 dependencies -- asciitree and numcodecs (this one seems a bit heavy though). and zarr is in conda-forge
$> pip install zarr
Collecting zarr
  Using cached zarr-2.10.3-py3-none-any.whl (146 kB)
Requirement already satisfied: numpy>=1.7 in ./venvs/dev3/lib/python3.9/site-packages (from zarr) (1.21.4)
Requirement already satisfied: fasteners in ./venvs/dev3/lib/python3.9/site-packages (from zarr) (0.16.3)
Collecting asciitree
  Using cached asciitree-0.3.3-py3-none-any.whl
Collecting numcodecs>=0.6.4
  Using cached numcodecs-0.9.1-cp39-cp39-manylinux2010_x86_64.whl (6.4 MB)
Requirement already satisfied: six in ./venvs/dev3/lib/python3.9/site-packages (from fasteners->zarr) (1.16.0)
Installing collected packages: numcodecs, asciitree, zarr
Successfully installed asciitree-0.3.3 numcodecs-0.9.1 zarr-2.10.3

@satra
Copy link
Member

satra commented Dec 16, 2021

do formalize "supported zarr directory name extensions" e.g. to a set of .zarr and .ngff

i think for the moment this would be fine.

@satra
Copy link
Member

satra commented Dec 16, 2021

zarr with also require compression codecs to be added.

@jwodder
Copy link
Member

jwodder commented Dec 16, 2021

@yarikoptic @satra Note that zarr seems to ignore files with invalid/unexpected names, and a *.zarr directory containing only such files with no .zarray or .zgroup is treated as an empty group. How should directories like this be handled?

@satra
Copy link
Member

satra commented Dec 16, 2021

a *.zarr directory containing only such files with no .zarray or .zgroup is treated as an empty group. How should directories like this be handled?

for the moment, let's say no empty groups allowed.

@jwodder
Copy link
Member

jwodder commented Dec 16, 2021

@yarikoptic

  • When uploading a Zarr directory, should all files within it be uploaded, even "extra"/unrecognized ones? Note that Zarr does not seem to provide a way to only list recognized files.
  • Should --allow-any-path have any effect on the treatment of Zarr directories, particularly invalid ones? If not, should there be a way to force processing of an invalid Zarr directory?

@yarikoptic
Copy link
Member Author

a *.zarr directory containing only such files with no .zarray or .zgroup is treated as an empty group. How should directories like this be handled?

for the moment, let's say no empty groups allowed.

FWIW, pretty much we need zarr_validate (as if to complement pynwb_validate) which would do all those checks. Then it would be interfaced in validate and upload.

Should --allow-any-path have any effect on the treatment of Zarr directories, particularly invalid ones? If not, should there be a way to force processing of an invalid Zarr directory?

if per above we just add zarr_validate and since we do allow to upload without validation -- uniform way would be to "allow" users to upload invalid zarrs if they say so via --validation [skip|ignore]

@jwodder
Copy link
Member

jwodder commented Dec 16, 2021

@yarikoptic So, in the absence of or prior to validation, would we just treat any directory with a .zarr or .ngff extension as a Zarr directory?

Also, how should metadata be determined for Zarr assets? (cc @satra)

@yarikoptic
Copy link
Member Author

@yarikoptic So, in the absence of or prior to validation, would we just treat any directory with a .zarr or .ngff extension as a Zarr directory?

Yes

@satra
Copy link
Member

satra commented Dec 16, 2021

for the moment i would limit metadata extraction similar to the bids data, so based on names rather than internal metadata. in the future once we get better ngff metadata we will write additional extractors. i can help with the metadata extraction beyond basic metadata (size, encoding type, subject id). for dandi this would be part of bids datasets, so we will have additional info available for sticking into the asset metadata from participants.tsv and samples.tsv files.

@jwodder
Copy link
Member

jwodder commented Dec 16, 2021

@yarikoptic FYI: I'm implementing the metadata & validation for Zarr by giving NWB, Zarr, and generic files their own classes with metadata and validation methods; however, fscacher doesn't currently support caching of instance methods, so some caching is going to have to be disabled for now.

@yarikoptic
Copy link
Member Author

hm, do you see how fscacher could gain support for bound methods? if not, I wonder if we shouldn't just concentrate logic in @staticmethods of such classes which would be explicitly passed a path instead of an instance?

@jwodder jwodder mentioned this issue Dec 16, 2021
24 tasks
@jwodder
Copy link
Member

jwodder commented Dec 16, 2021

@yarikoptic We would have to add a variant of memoize_path() that gets the affected path from a given attribute of the decorated method's self parameter.

@yarikoptic
Copy link
Member Author

right -- sounds good! somehow it didn't occur to me ;)

@dchiquito
Copy link
Contributor

The docs for POST /api/zarr/{zarr_id}/upload/ say "The number of files being uploaded must be less than some experimentally defined limit". What is the current/recommended limit?

Currently there is no limit. Now that we have the code up on a real server I need to do some experimentation to determine that value. The limit is mostly there to enforce that no request takes longer than 30 seconds, since Heroku will forcibly cancel any requests that exceed that timeout.

Number 3 in "Requirements" states "The CLI uses some kind of tree hashing scheme to compute a checksum for the entire zarr archive," but the "Upload flow" section makes no mention of such a checksum. Does the client have to compute a checksum or not, and, if it does, where is it used?

The client would need to calculate the checksum to do integrity checking of the zarr upload. I wrote https://github.com/dandi/dandi-api/blob/master/dandiapi/api/zarr_checksums.py to handle all of that logic, and I put it in dandi-api for now for quicker iteration. It should be moved to a common location soon so that the CLI can use it as well.

The document says "The asset metadata will contain the information required to determine if an asset is a normal file blob or a zarr file." Exactly what information is that, and is it added by the client or the server?

That seems ambigious to me at the moment. The server is not adding that metadata right now. @satra any preference on where/how asset metadata is updated?

For POST /zarr/, what exactly should the "name" field in the request body be set to?

I would assume that would be the name of the directory containing the zarr data, unless zarr metadata contains a more descriptive name.

".checksum files format" says "Each zarr file and directory in the archive has a path and a checksum (md5). For files, this is simply the ETag." So is the checksum for a file an md5 or a DANDI e-tag? If the latter, why are we using different digests for files on their own versus in directories?
"Upload flow" states "The client calculates the MD5 of each file." and "The client sends the paths+MD5s to ...", but the field name in the request body for POST /zarr/{zarr_id}/upload/ is "etag", not "md5". Which digest is supposed to be used?
Similarly, the docs for POST /api/zarr/{zarr_id}/upload/ say, "Requires a list of file paths and ETags (md5 checksums)". E-tags and MD5 checksums are not the same thing.

I would say we are using simple MD5, and that for these purposes MD5 and ETag are the same thing. Files in the zarr archive are limited to 5GB so that they can use the simple upload endpoint, and for files uploaded simply (as opposed to multipart), their ETag is the MD5 of the file. DANDI Etags are defined to be the ETags of multipart uploads, which are MD5s with a suffix relating to the number of parts.

Unlike POST /uploads/initialize/, POST /zarr/{zarr_id}/upload/ does not return information on different parts of files. Does this mean that each file in a zarr directory is to be uploaded in a single request? What if a file exceeds S3's part size limit of 5 GB?

Correct. Multipart upload requires substantially more engineering, so the executive decision was made to cap zarr component files at 5GB. Zarr is relatively easy to simply rechunk smaller if that is exceeded.

The Swagger docs don't describe the request body format for DELETE /api/zarr/{zarr_id}/files/.

Sorry, my bad. dandi/dandi-archive#648

How do you download a Zarr?

That is done directly from S3 using the zarr library. You should be able to pass the S3 path of the zarr directory and zarr will read that directly from S3.

@satra
Copy link
Member

satra commented Dec 23, 2021

That seems ambigious to me at the moment. The server is not adding that metadata right now. @satra any preference on where/how asset metadata is updated?

this should be done as is done currently by the CLI for other types of files. the exact details of encodingformat can be worked out in the relevant CLI PR.

@dchiquito
Copy link
Contributor

@jwodder I confirm that I am encountering the same error with your script. I will look closer next week.

@dchiquito
Copy link
Contributor

@jwodder I got the script working against dandi-api-local-docker-tests by changing the ETag from get_dandietag(zf).as_str() to md5(blob).hexdigest(). I'm still getting the 403 errors in staging though, so I am still investigating.

@dchiquito
Copy link
Contributor

dchiquito commented Dec 27, 2021

@jwodder I missed a spot with AWS permissions, the staging API has been updated. You need to specify X-Amz-ACL: bucket-owner-full-control as a request header for the upload.

My adaptation of your example looks like this:

with DandiAPIClient.for_dandi_instance(
    "dandi-staging", token=os.environ["DANDI_API_KEY"]
) as client:
    r = client.post("/zarr/", json={"name": zarrdir.name})
    zarr_id = r["zarr_id"]
    zfiles = list(map(Path, find_files(r".*", str(zarrdir))))
    upload_body = []
    for zf in zfiles:
        with open(zf, "rb") as f:
            blob = f.read()
        upload_body.append({
            "path": zf.relative_to(zarrdir).as_posix(),
            "etag": md5(blob).hexdigest(), # Simple MD5 instead of dandi-etag
        })
    r = client.post(f"/zarr/{zarr_id}/upload/", json=upload_body)
    with RESTFullAPIClient("http://nil.nil") as storage:
        for upspec in r:
            with (zarrdir / upspec["path"]).open("rb") as fp:
                storage.put(upspec["upload_url"], data=fp, json_resp=False, headers={"X-Amz-ACL": "bucket-owner-full-control"}) # X-Amz-ACL header is required
    r = client.post(f"/zarr/{zarr_id}/upload/complete/")

@jwodder
Copy link
Member

jwodder commented Jan 3, 2022

@dchiquito

The client would need to calculate the checksum to do integrity checking of the zarr upload.

Could you elaborate on how this "integrity checking" fits into the upload flow? Exactly what does the client compare its computed value against?

@yarikoptic
Copy link
Member Author

@dchiquito could you please clarify above question of @jwodder ? In the example above you posted I do not see zarr "file" integrity checking anywhere. Is providing it optional???

@jwodder
Copy link
Member

jwodder commented Jan 12, 2022

@yarikoptic So when the client is asked to download a Zarr, should it, instead of downloading a monolithic asset, effectively treat the Zarr as a tree of independent blobs, each one of which is separately compared against whatever's on-disk at the final location? What exactly should pyout display when downloading a multifile Zarr?

@yarikoptic
Copy link
Member Author

yarikoptic commented Jan 12, 2022

@yarikoptic So when the client is asked to download a Zarr, should it, instead of downloading a monolithic asset, effectively treat the Zarr as a tree of independent blobs, each one of which is separately compared against whatever's on-disk at the final location?

correct

edit: with an overall checksum comparison across the tree... if possible to make it "on the fly" as we do with digest for an individual file, but traversing the same order as desired for computation of the zarr checksum, would be awesome!

What exactly should pyout display when downloading a multifile Zarr?

I think for user reporting (pyout) we could consider zarr file (well -- directory) to be a single asset/file. So % progress would be based on "total" for that directory. Eventually we might want to return/expose some progress in # of files within zarr but I think there is no immediate need for that.

@jwodder
Copy link
Member

jwodder commented Jan 13, 2022

@yarikoptic @dchiquito What should the Zarr upload do if a Zarr contains an empty directory? There doesn't seem to be a way to inform the API of such a directory's existence, yet the directory is still used in calculating the checksum, and so there will be a checksum mismatch after uploading.

@dchiquito
Copy link
Contributor

I don't think well-formed zarr files can contain empty directories. I would assume it's not a thing, but perhaps we should appeal to someone with more knowledge than I.

S3 has no concept of "directories", so they can't be empty. My vote is that the checksum calculator should just ignore empty directories and the empty directory will not be considered a part of the zarr.

@jwodder
Copy link
Member

jwodder commented Jan 13, 2022

@yarikoptic In order to show only a single entry in pyout when downloading a Zarr, I think the code would need to call _download_file() in a separate thread for each file in the Zarr and then interleave & combine the results. Assuming I can figure out a good way to do that, the biggest problem I can think of would be how we would limit the maximum number of concurrent download threads, which is currently managed by pyout. Unless there's some way to get pyout to do the entry-combination for us?

@yarikoptic
Copy link
Member Author

Can _download_file() in pyout'ed thread, when handling a Zarr just start its own pool of threads for _download_file() ing individual files, and then report back to pyout an aggregate progress?

@yarikoptic
Copy link
Member Author

Attn @satra on above #852 (comment) question about empty directories. Smells like we should ignore them and exclude from checksum compute

@satra
Copy link
Member

satra commented Jan 13, 2022

Smells like we should ignore them and exclude from checksum compute

i would agree with that.

the only thing is that we should perhaps test what an empty array (https://zarr.readthedocs.io/en/stable/api/creation.html#zarr.creation.empty) looks like in a nested directory store. and whether zarr's checksum of a zarr dataset changes or not.

@jwodder
Copy link
Member

jwodder commented Jan 14, 2022

@yarikoptic

Can _download_file() in pyout'ed thread, when handling a Zarr just start its own pool of threads for _download_file() ing individual files, and then report back to pyout an aggregate progress?

That's the basic plan, but the problem is determining the size of the thread pool. Normally, on a quad-core computer, pyout only runs 8 threads at once, so there are only at most 8 concurrent downloads. If we set the size of the Zarr thread pools to some n, we could end up with 8n concurrent downloads — and if we try to avoid that by only downloading one file in a Zarr at a time, we may end up taking longer than normal to download everything if there's ever a point where fewer than 8 downloads are running in total.

@satra
Copy link
Member

satra commented Jan 14, 2022

@jwodder - for downloads and uploads isn't the bottleneck in disk and network bandwidth much more than number of cores? could something be done to auto adjust the number of processes ?

also when dealing with lee's data i found pyout too constraining by default, especially because the number of jobs it runs is limited by the display of the table. it would be nice if there was an option to not care about a tabular display but just a summary display. it does seem that decoupling download processes from display constraints would be a good step, and may provide a greater ability to optimize it.

@jwodder
Copy link
Member

jwodder commented Jan 14, 2022

@satra It is indeed possible to configure the maximum number of workers pyout uses when instantiating a table. How exactly do you recommend we determine this value?

@jwodder
Copy link
Member

jwodder commented Jan 14, 2022

@satra Also, regarding the number of jobs in pyout being limited by the display of the table: It seems that, by default, pyout waits for the top three jobs to complete before adding a new row that would cause the topmost to go off-screen, though this can be configured or disabled.

@satra
Copy link
Member

satra commented Jan 14, 2022

@jwodder - i don't a good answer on how to optimize, but perhaps a test download would tell you how many threads optimizes bandwidth given disk constraints (keep increasing the number of threads/processes till it saturates). on clusters where disks may be on different interfaces, they could even be competing with download bandwidth. it may be worthwhile looking up if someone has implemented an algorithm for this.

regarding pyout, the issue i ran into was when one of the rows near the top was uploading (say a large file) and everything else was complete, it would not add any more jobs. if the bandwidth was being fully utilized i don't care too much, but in some cases this result in less efficient upload. the other use case is when a lot of small files are being uploaded then the job limit was 10 and could not be increased even though bandwidth wasn't an issue and devices weren't being saturated. hence the notion of removing the tabular display altogether and simply providing the summary display.

between optimizing number of threads and removing tabular display, i would prioritize the latter first so that the number of uploads/downloads do not have to be limited to 10.

@yarikoptic
Copy link
Member Author

re jobs -- I think we should adopt what we have for upload already:

  -J, --jobs N[:M]                Number of files to upload in parallel and,
                                  optionally, number of upload threads per
                                  file

so user could explicitly control on how many threads per file (if possible, like in case of zarr) it would be. Default to e.g. 4 and be "done".

In the longer run, we are still to RF all the interfacing and pyout should not take care then about parallelizing and only for reporting/UI.

@jwodder
Copy link
Member

jwodder commented Jan 14, 2022

@yarikoptic So you're suggesting adding a --jobs option to dandi download for separately setting the number of "main" download threads and the number of threads per Zarr asset, then?

@yarikoptic
Copy link
Member Author

yes, that is what I am thinking about. In principle we could even thread downloads of regular files via multiple range requests but I do not see much need in that, so we could just comment that [:M] is in effect only for assets containing multiple files (such as .zarr directories)

@jwodder
Copy link
Member

jwodder commented Jan 17, 2022

@yarikoptic I've written a library for running multiple iterators in threads and collecting the results. Now comes the bikeshedding part: deciding what status dicts should be emitted for a combined set of status dicts.

A refresher: The columns shown in the download pyout table are "path", "size", "done", "done%", "checksum", "status", and "message". The status dicts yielded for a given path while it's downloading take the following forms, yielded roughly in the order shown:

  • {"status": "skipped", "message": "<MESSAGE>"}
  • {"size": <int>}
  • {"status": "downloading"}
  • {"done": <bytes downloaded>, "done%": <percentage downloaded, from 0 to 100>}
  • {"status": "error", "message": "<MESSAGE>"}
  • {"checksum": "differs", "status": "error", "message": "<MESSAGE>"}
  • {"checksum": "ok"}
  • {"status": "setting mtime"}
  • {"status": "done"}

Hence the following questions & thoughts:

  • Should the "message" for a combined download always be in the form "X downloading, Y errored, Z skipped"? If so, what should happen to the error messages for individual files?
  • When at least one file is downloading and at least one file has errored, should the combined status be "downloading" or "error"? Likewise when skipped files are added to the mix.
  • When a download errors, should its contribution to the combined download progress be removed, left where it was, or treated like 100%?
  • {"status": "setting mtime"} status dict should probably just be discarded on input and not reported in the combined output.
  • It'll be simplest if the combined "size" and "done%" only reflect files seen so far rather than reporting the total Zarr size from the start.
  • Should the combined "checksum" field be "ok" as long as at least one downloaded file has its checksum verified, or should that only be shown once all checksums are verified?

@yarikoptic
Copy link
Member Author

  • Should the "message" for a combined download always be in the form "X downloading, Y errored, Z skipped"? If so, what should happen to the error messages for individual files?

Let's make it "X done, Y errored, Z skipped" with any of those omitted whenever it is just 0.

  • When at least one file is downloading and at least one file has errored, should the combined status be "downloading" or "error"? Likewise when skipped files are added to the mix.

I would keep it "downloading" until we are done with all files, and then if some failed, yield "error" instead of "done".
(note: if any errorred, no need to do checksum I guess)

  • When a download errors, should its contribution to the combined download progress be removed, left where it was, or treated like 100%?

my vote would be for "removed" so user would get adequate "done%" as for how much of zarr was downloaded.

  • {"status": "setting mtime"} status dict should probably just be discarded on input and not reported in the combined output.

we will be setting it for every file, right? then indeed now worth yielding, unless may be for the .zarr/ directory at the end of the process.

  • It'll be simplest if the combined "size" and "done%" only reflect files seen so far rather than reporting the total Zarr size from the start.

"size" should report total zarr (as reported by the server); "done" and "done%" - seen so far.

  • Should the combined "checksum" field be "ok" as long as at least one downloaded file has its checksum verified, or should that only be shown once all checksums are verified?

I think checksum should be reported once for the entire zarr at the end of the download process.

@jwodder
Copy link
Member

jwodder commented Jan 19, 2022

@yarikoptic I don't think the code should calculate a checksum for the whole Zarr after downloading; digests are already checked for the individual files, and that should be enough. Moreover, checksumming a Zarr would fail if downloading into a pre-existing local Zarr containing files that don't exist in the Zarr being downloaded.

@yarikoptic
Copy link
Member Author

Moreover, checksumming a Zarr would fail if downloading into a pre-existing local Zarr containing files that don't exist in the Zarr being downloaded.

I think it is actually important enough to guarantee consistency of zarr as downloaded to match what is on the server to avoid confusion and e.g. reupload of a zarr with some garbage in it. So, as the last step in the zarr download, it would be good to delete files which are not listed/present in the remote (on server) copy. And then checksum should match. I guess we can rely on fscacher to cache individual files checksumming to make it reasonably fast.

@jwodder
Copy link
Member

jwodder commented Jan 20, 2022

@yarikoptic Hashing of downloaded files is performed piecemeal as they're being downloaded without using fscacher, so checksumming a downloaded Zarr will have to start from scratch.

@yarikoptic
Copy link
Member Author

@yarikoptic Hashing of downloaded files is performed piecemeal as they're being downloaded without using fscacher, so checksumming a downloaded Zarr will have to start from scratch.

so for downloaded in current session we would have md5s from "on the fly" digestion right? for those others found in the folder we would need to redigest them from disk. And that is where I thought fscacher would kick in. Or where was I wrong?

@jwodder
Copy link
Member

jwodder commented Jan 20, 2022

@yarikoptic

so for downloaded in current session we would have md5s from "on the fly" digestion right?

The digests aren't returned from the download manager; they're just discarded after they're checked.

@yarikoptic
Copy link
Member Author

The digests aren't returned from the download manager; they're just discarded after they're checked.

that's unfortunate. Didn't check the code, but may be you see a reasonably easy way to still make it possible to get their digests for overall zarr digest computation "on the fly"? if not -- we will be doomed to re-digest the entire zarr tree in full upon completion.

@jwodder
Copy link
Member

jwodder commented Jan 20, 2022

@yarikoptic I figured out a way.

@satra
Copy link
Member

satra commented Jan 21, 2022

i will copy the directory over and retry, but perhaps symbolic links should be supported.

$ dandi upload -i dandi-staging sub-MITU01_run-1_sample-178_stain-LEC_chunk-1_spim.ngff
2022-01-21 08:20:05,787 [ WARNING] A newer version (0.34.1) of dandi/dandi-cli is available. You are using 0.34.0+50.gc12f72b
2022-01-21 08:20:33,114 [ WARNING] /mnt/beegfs/satra/zarrhdf/101233/sub-MITU01/ses-20210521h17m17s06/microscopy/sub-MITU01_run-1_sample-178_stain-LEC_chunk-1_spim.ngff: Ignoring unsupported symbolic link to directory
2022-01-21 08:20:33,114 [    INFO] Found 0 files to consider
2022-01-21 08:20:33,126 [    INFO] Logs saved in /home/satra/.cache/dandi-cli/log/20220121132004Z-1470910.log

yarikoptic added a commit that referenced this issue Jan 25, 2022
Support Zarr directories
@jwodder jwodder added the zarr label Jan 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants