Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow .chunk for datasets with duplicated dimension names, e.g. Sentinel-3 OLCI files #8579

Closed
fwfichtner opened this issue Jan 2, 2024 · 6 comments · Fixed by #9099
Closed
Labels
contrib-help-wanted topic-chunked-arrays Managing different chunked backends, e.g. dask

Comments

@fwfichtner
Copy link

fwfichtner commented Jan 2, 2024

What is your issue?

Sentinel-3 OLCI files (e.g. taken from Copernicus Data Space Ecosystem) come with duplicate dimensions which causes xarray 2023.12.0 to raise after #8491. Specifically instrument_data.nc cannot be opened anymore:

import xarray as xr

dataset = xr.open_dataset("instrument_data.nc", decode_cf=True, mask_and_scale=True, chunks="auto")

Results in the now expected ValueError:

ValueError: This function cannot handle duplicate dimensions, but dimensions {'bands'} appear more than once on this object's dims: ('bands', 'bands')

ncdump -h prints:

netcdf instrument_data {
dimensions:
	bands = 21 ;
	columns = 4865 ;
	detectors = 3700 ;
	rows = 1953 ;
variables:
	float FWHM(bands, detectors) ;
		FWHM:_FillValue = -1.f ;
		FWHM:ancillary_variables = "detector_index lambda0" ;
		FWHM:long_name = "OLCI bandwidth (Full Widths at Half Maximum)" ;
		FWHM:units = "nm" ;
		FWHM:valid_max = 650.f ;
		FWHM:valid_min = 0.f ;
	short detector_index(rows, columns) ;
		detector_index:_FillValue = -1s ;
		detector_index:coordinates = "time_stamp altitude latitude longitude" ;
		detector_index:long_name = "Detector index" ;
		detector_index:valid_max = 3699s ;
		detector_index:valid_min = 0s ;
	byte frame_offset(rows, columns) ;
		frame_offset:_FillValue = -128b ;
		frame_offset:long_name = "Re-sampling along-track frame offset" ;
		frame_offset:valid_max = 15b ;
		frame_offset:valid_min = -15b ;
	float lambda0(bands, detectors) ;
		lambda0:_FillValue = -1.f ;
		lambda0:ancillary_variables = "detector_index FWHM" ;
		lambda0:long_name = "OLCI characterised central wavelength" ;
		lambda0:units = "nm" ;
		lambda0:valid_max = 1040.f ;
		lambda0:valid_min = 390.f ;
	float relative_spectral_covariance(bands, bands) ;
		relative_spectral_covariance:_FillValue = NaNf ;
		relative_spectral_covariance:ancillary_variables = "lambda0" ;
		relative_spectral_covariance:long_name = "Relative spectral covariance matrix" ;
	float solar_flux(bands, detectors) ;
		solar_flux:_FillValue = -1.f ;
		solar_flux:ancillary_variables = "detector_index lambda0" ;
		solar_flux:long_name = "In-band solar irradiance, seasonally corrected" ;
		solar_flux:units = "mW.m-2.nm-1" ;
		solar_flux:valid_max = 2500.f ;
		solar_flux:valid_min = 500.f ;

// global attributes:
		:absolute_orbit_number = 29437U ;
		:ac_subsampling_factor = 64US ;
		:al_subsampling_factor = 1US ;
		:comment = " " ;
		:contact = "[email protected]" ;
		:creation_time = "2023-12-20T07:20:24Z" ;
		:history = "  2023-12-20T07:20:24Z: PUGCoreProcessor JobOrder.3302865.xml" ;
		:institution = "PS2" ;
		:netCDF_version = "4.2 of Jan 13 2023 10:05:23 $" ;
		:processing_baseline = "OL__L1_.003.03.01" ;
		:product_name = "S3B_OL_1_EFR____20231220T045944_20231220T050110_20231220T072024_0085_087_290_1980_PS2_O_NR_003.SEN3" ;
		:references = "S3IPF PDS 004.1 - i2r6 - Product Data Format Specification - OLCI Level 1, S3IPF PDS 002 - i1r8 - Product Data Format Specification - Product Structures, S3IPF DPM 002 - i2r9 - Detailed Processing Model - OLCI Level 1" ;
		:resolution = "[ 270 294 ]" ;
		:source = "IPF-OL-1-EO 06.17" ;
		:start_time = "2023-12-20T04:59:43.719978Z" ;
		:stop_time = "2023-12-20T05:01:09.611725Z" ;
		:title = "OLCI Level 1b Product, Instrument Data Set" ;
}

The relative_spectral_covariance variable has duplicate dimensions. What do you suggest doing in such cases?

I guess this is related to #1378.

@fwfichtner fwfichtner added the needs triage Issue that has not been reviewed by xarray team member label Jan 2, 2024
Copy link

welcome bot commented Jan 2, 2024

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@keewis
Copy link
Collaborator

keewis commented Jan 2, 2024

You should have received a warning when opening the file with instructions on what to do (see also the issue you referenced):

In [5]: import xarray as xr
   ...: 
   ...: ds = xr.Dataset({"a": (("x", "x"), [[0, 1], [2, 3]])})
   ...: ds
.../xarray/namedarray/core.py:487: UserWarning: Duplicate dimension names present: dimensions {'x'} appear more than once in dims=('x', 'x'). We do not yet support duplicate dimension names, but we do allow initial construction of the object. We recommend you rename the dims immediately to become distinct, as most xarray functionality is likely to fail silently if you do not. To rename the dimensions you will need to set the ``.dims`` attribute of each variable, ``e.g. var.dims=('x0', 'x1')``.
  warnings.warn(
Out[5]: 
<xarray.Dataset>
Dimensions:  (x: 2)
Dimensions without coordinates: x
Data variables:
    a        (x, x) int64 0 1 2 3

The warning itself is not as helpful for duplicated dimensions on a variable within a dataset, though, since for DataArray objects the dimensions are not mutable. Instead, we can do the operation directly on the variable:

In [6]: ds.variables["a"].dims = ("x0", "x1")
   ...: ds
Out[6]: 
<xarray.Dataset>
Dimensions:  (x: 2)
Dimensions without coordinates: x
Data variables:
    a        (x0, x1) int64 0 1 2 3

@fwfichtner
Copy link
Author

Alright, thanks! So in this case the chunking fails unless the dimensions are renamed. The solution would therefore be something like:

ds = xr.open_dataset("instrument_data.nc", decode_cf=True, mask_and_scale=True)
ds.variables["relative_spectral_covariance"].dims = ("x0", "x1")
ds.chunk(chunks="auto")

@djhoese
Copy link
Contributor

djhoese commented Jan 2, 2024

So am I reading this correctly that there is no way to workaround this if we want to use open_dataset with dask chunking (ex. chunks="auto"). There is no real choice but to accept the performance penalty, right?

@dcherian
Copy link
Contributor

dcherian commented Jan 2, 2024

I think we can enable .chunk to handle duplicated dimensions. There's only one unambiguous interpretation IIUC. And clearly there's a use-case for just opening files successfully.

@dcherian dcherian reopened this Jan 2, 2024
@dcherian dcherian changed the title Sentinel-3 OLCI files come with now disallowed duplicate dimensions Allow .chunk for datasets with duplicated dimension names, e.g. Sentinel-3 OLCI files Jan 2, 2024
@max-sixty max-sixty added topic-chunked-arrays Managing different chunked backends, e.g. dask and removed needs triage Issue that has not been reviewed by xarray team member labels Feb 26, 2024
@simonrp84

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contrib-help-wanted topic-chunked-arrays Managing different chunked backends, e.g. dask
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants