Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CF flag_values check #792

Closed
daltonkell opened this issue Apr 8, 2020 · 2 comments
Closed

CF flag_values check #792

daltonkell opened this issue Apr 8, 2020 · 2 comments

Comments

@daltonkell
Copy link
Contributor

It seems that there's a bit of a discrepancy testing for the type of a variable's flag_values when using xarray vs netCDF4-python. I can encode as a numpy array with type |S1:

In [26]: ds2["depthflag"].attrs["flag_values"] = np.array([b'S', b'S'], dtype='|S1')                                                            [12/7481]

In [27]: ds2["depthflag"]                                                                                                                                
Out[27]: 
<xarray.DataArray 'depthflag' (station: 2)>
array([b'S', b'S'], dtype='|S1')
Coordinates:
  * station    (station) int32 39 41
    time       (station) datetime64[ns] ...
    longitude  (station) float32 ...
    latitude   (station) float32 ...
Attributes:
    long_name:      Bottom Depth Flag
    flag_values:    [b'S' b'S']
    flag_meanings:  Measured_at_Station Estimated_from_GTOPO30_Bathymetric_Da...

but when tested with the Compliance Checker, I get an error:

§3.5 Flags
* depthflag's flag_values must be an array of values not <class 'list'>

Looking at how netCDF4-python reads in this data type (since that API is used to load the NetCDF file being tested), it looks like it's being converted to a plain list:

In [4]: ds.variables["depthflag"]                                                                                                                        
Out[4]: 
<class 'netCDF4._netCDF4.Variable'>
|S1 depthflag(station, string1)
    long_name: Bottom Depth Flag
    flag_values: ['S', 'G']
    flag_meanings: Measured_at_Station Estimated_from_GTOPO30_Bathymetric_Database
    coordinates: time latitude longitude
unlimited dimensions: 
current shape = (2, 1)
filling on, default _FillValue of  used

In [5]: getattr(ds.variables["depthflag"], "flag_values")                                                                                                
Out[5]: ['S', 'G']

In [6]: type(getattr(ds.variables["depthflag"], "flag_values"))                                                                                          
Out[6]: list

Additional investigation is needed into what the |S1 type can be represented as, and maybe a workaround will be implemented.

Pinging @benjwadams, you might find this interesting.

@benjwadams
Copy link
Contributor

Maybe not a compliance checker issue since we don't have any control over how data is generated, but interesting nonetheless.

@daltonkell
Copy link
Contributor Author

EDIT

After a bit more investigation, I found this in the netCDF4-python code:

https://github.com/Unidata/netcdf4-python/blob/06e58422204cc77946fa21effd31ffb9421bd139/netCDF4/_netCDF4.pyx#L1560-L1577

Lines:

    if value_arr.dtype.char in ['S','U']:
        # force array of strings if array has multiple elements (issue #770)
        N = value_arr.size
        if N > 1: force_ncstring=True
        if not is_netcdf3 and force_ncstring and N > 1:
            string_ptrs = <char**>PyMem_Malloc(N * sizeof(char*))
            if not string_ptrs:
                raise MemoryError()
            try:
                strings = [_strencode(s) for s in value_arr.flat]
                for j in range(N):
                    if len(strings[j]) == 0:
                        strings[j] = _strencode('\x00')
                    string_ptrs[j] = strings[j]
                issue485_workaround(grp._grpid, varid, attname)
                ierr = nc_put_att_string(grp._grpid, varid, attname, N, string_ptrs)
            finally:
                PyMem_Free(string_ptrs)

You'll notice the list comprehension which creates a list of the attributes, and then the for-loop that assigns them to the recently-allocated string_ptrs char array. Next, nc_put_att_string() assigns the attribute to the variable. That's defined here:

https://github.com/Unidata/netcdf-c/blob/e4003be502b196fe0e2a5a40140f0187cbffc2c6/libdispatch/dattput.c#L75-L83

Lines:

nc_put_att_string(int ncid, int varid, const char *name,
		  size_t len, const char** value)
{
    NC* ncp;
    int stat = NC_check_id(ncid, &ncp);
    if(stat != NC_NOERR) return stat;
    return ncp->dispatch->put_att(ncid, varid, name, NC_STRING,
				  len, (void*)value, NC_STRING);
}

The attribute will thus be encoded as a Python list type, not an array with |S1 data type in numpy, even though numpy arrays are used to represent most other data types. We'll have to develop a workaround in regards to type checks when they come about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants