Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process killed while trying to access Zarr dataset on S3 #2594

Open
abhibaruah opened this issue Jan 24, 2023 · 6 comments
Open

Process killed while trying to access Zarr dataset on S3 #2594

abhibaruah opened this issue Jan 24, 2023 · 6 comments

Comments

@abhibaruah
Copy link

NetCDF version: v4.9.0
HDF5 version: 1.10.8
OS: Linux

I am trying to access the zarr datasets here (https://power-analysis-ready-datastore.s3.us-west-2.amazonaws.com/index.html) and here (https://hrrrzarr.s3.amazonaws.com/index.html) using netcdf.open. I am seeing two issues here:

  1. If I only append '#mode=zarr' or '#mode=nczarr,zarr' without specifying the driver to be used, the process stalls for a bit and then gets killed. This happens even if I build netCDF without NCZarr support.

  2. I also see a process abortion if I try and open the dataset here ("https://hrrrzarr.s3.us-west-1.amazonaws.com/prs/20180712/20180712_12z_anl.zarr#mode=zarr,s3")
    The error message is as follows:
    "netcdf/libnczarr/zclose.c:228: zclose_type: Assertion `type && type->format_type_info != NULL' failed.
    Abort
    "
    Please find my repro code below:

#include <stdio.h>
#include <string.h>
#include <netcdf.h>

// ISSUE 1
#define FILENAME "https://power-analysis-ready-datastore.s3.us-west-2.amazonaws.com/power_901_annual_meteorology_utc.zarr/#mode=zarr"
// ISSUE 2
//#define FILENAME "https://hrrrzarr.s3.us-west-1.amazonaws.com/grid/HRRR_chunk_index.zarr#mode=zarr,s3"


int main()
{
    int ncid1;
    int status;
    

    // Open the file
    status = nc_open(FILENAME, NC_NOWRITE, &ncid1);
	printf("status code after open = %d\n", status);
	
    status = nc_close(ncid1);
    printf("status code after close = %d\n", status);
	
    printf("End of test.\n\n");

    return 0;
}

@WardF
Copy link
Member

WardF commented Jan 24, 2023

Pinging @DennisHeimbigner

@DennisHeimbigner
Copy link
Collaborator

  1. I have a branch that attempts to infer the storage type for existing files.
    I have not yet had the time to test it and publish the PR.
    But in any case, a missing format should provide a better error report.
  2. Is either my error or a malformed zarr file. I will investigate.

@DennisHeimbigner
Copy link
Collaborator

Ok, I have some information.
Problem #1 was fixed for 4.9.1 at some point in the sense that an error is reported instead of aborting.
Problem #2 is a bit more complex. The chunk_id array, for example, is using the dtype "|O",
where O means that the variable values are pickel'd (in the python sense) serialized values.
We currently do not support such values and they are not part of the Zarr specification, so technically
they are illegal.
As with #1, this now reports an error rather than aborting. Perhaps better would be to mark such
variables as unreadable so that at least the readable parts of the file can be accessed.

@abhibaruah
Copy link
Author

Thank you for taking a look.
Is there any way to identify the fix for (1) so that we can patch the library on our end?
When is the v4.9.1 of netCDF-C scheduled to be released?

For (2), I tried reading the same file using Zarr-Python, and it seemed to work fine.

import s3fs
import zarr
url = "s3://hrrrzarr/prs/20180712/20180712_12z_anl.zarr"
fs = s3fs.S3FileSystem(anon=True)
store = zarr.open(s3fs.S3Map(url, s3=fs))
print(store.info_items())

"Perhaps better would be to mark such
variables as unreadable so that at least the readable parts of the file can be accessed"
Was this comment in reference to (2)?
Is this something that would be implemented for v4.9.1 of netCDF-C?

@DennisHeimbigner
Copy link
Collaborator

4.9.1 should be out inside a week [Ward?].
My comment should have been marked as "note to self", but it does refer to #2.
The file is readable by at least some Python zarr implementations because
the can interpret the non-standard "O" dtype. But again, this is not legal vis-a-vis the Zarr V2 specification.

@WardF
Copy link
Member

WardF commented Jan 25, 2023

I'm working on v4.9.1 as we speak; as is always the case, I wanted it out weeks ago, I would like it to be out today or tomorrow, and I'm hopeful it will be by the end of the week. We'll see what other little things crop up in the meantime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants