xarray.Dataset should NOT be recognized as a valid tabular data type #3146

seisman · 2024-03-28T09:29:43Z

Currently, xarray.Dataset is recognized as a valid tabular data type in some places. For example:

Lines 108 to 109 in dbbc168

    
           if hasattr(data, "data_vars") and len(data.data_vars) < 3:  # xr.Dataset 
        
               raise GMTInvalidInput("data must provide x, y, and z columns.")

pygmt/pygmt/clib/session.py

Lines 1531 to 1547 in dbbc168

    
                   >>> from pygmt.helpers import GMTTempFile 
        
                   >>> import xarray as xr 
        
                   >>> data = xr.Dataset( 
        
                   ...     coords=dict(index=[0, 1, 2]), 
        
                   ...     data_vars=dict( 
        
                   ...         x=("index", [9, 8, 7]), 
        
                   ...         y=("index", [6, 5, 4]), 
        
                   ...         z=("index", [3, 2, 1]), 
        
                   ...     ), 
        
                   ... ) 
        
                   >>> with Session() as ses: 
        
                   ...     with ses.virtualfile_in(check_kind="vector", data=data) as fin: 
        
                   ...         # Send the output to a file so that we can read it 
        
                   ...         with GMTTempFile() as fout: 
        
                   ...             ses.call_module("info", fin + " ->" + fout.name) 
        
                   ...             print(fout.read().strip()) 
        
                   <vector memory>: N = 3 <7/9> <4/6> <1/3>

But I think it should NOT be like that. Here are the reasons.

1. xarray.Dataset is more like a collection of xarray.DataArrays, rather than a pandas.DataFrame:

As the official docs says:

A dataset resembles an in-memory representation of a NetCDF file, and consists of variables, coordinates and attributes which together form a self describing dataset.

Dataset implements the mapping interface with keys given by variable names and values given by DataArray objects for each variable name.

xarray.Dataset can represent tabular data, but it's more commonly used as a data structure to hold multiple xarray.DataArray objects.

2. It's unclear what/how data are passed.

Here is an example from the official documentation:

>>> import numpy as np
>>> import xarray as xr
>>> import pandas as pd
>>> np.random.seed(0)
>>> temperature = 15 + 8 * np.random.randn(2, 2, 3)
>>> precipitation = 10 * np.random.rand(2, 2, 3)
>>> lon = [[-99.83, -99.32], [-99.79, -99.23]]
>>> lat = [[42.25, 42.21], [42.63, 42.59]]
>>> time = pd.date_range("2014-09-06", periods=3)
>>> reference_time = pd.Timestamp("2014-09-05")

>>> ds = xr.Dataset(
...     data_vars=dict(
...         temperature=(["x", "y", "time"], temperature),
...         precipitation=(["x", "y", "time"], precipitation),
...     ),
...     coords=dict(
...         lon=(["x", "y"], lon),
...         lat=(["x", "y"], lat),
...         time=time,
...         reference_time=reference_time,
...     ),
...     attrs=dict(description="Weather related data."),
... )
>>> ds
<xarray.Dataset> Size: 288B
Dimensions:         (x: 2, y: 2, time: 3)
Coordinates:
    lon             (x, y) float64 32B -99.83 -99.32 -99.79 -99.23
    lat             (x, y) float64 32B 42.25 42.21 42.63 42.59
  * time            (time) datetime64[ns] 24B 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 8B 2014-09-05
Dimensions without coordinates: x, y
Data variables:
    temperature     (x, y, time) float64 96B 29.11 18.2 22.83 ... 16.15 26.63
    precipitation   (x, y, time) float64 96B 5.68 9.256 0.7104 ... 4.615 7.805
Attributes:
    description:  Weather related data.

Each data variable is a xarray.DataArray object:

>>> ds.temperature
<xarray.DataArray 'temperature' (x: 2, y: 2, time: 3)> Size: 96B
array([[[29.11241877, 18.20125767, 22.82990387],
        [32.92714559, 29.94046392,  7.18177696]],

       [[22.60070734, 13.78914233, 14.17424919],
        [18.28478802, 16.15234857, 26.63418806]]])
Coordinates:
    lon             (x, y) float64 32B -99.83 -99.32 -99.79 -99.23
    lat             (x, y) float64 32B 42.25 42.21 42.63 42.59
  * time            (time) datetime64[ns] 24B 2014-09-06 2014-09-07 2014-09-08
    reference_time  datetime64[ns] 8B 2014-09-05
Dimensions without coordinates: x, y

Then, in virtualfile_in, a list of multi-dimensional (3-D in this example) xarray.DataArray objects are passed to GMT modules which expects a list of 1-D arrays instead. It works without errors because in Session.put_vector, we pass the pointer of the 2-D array to the GMT C API function, but it likely won't work if the data is not C-contiguous (e.g., a slice of a dataset). So, the actual behavior is not well defined.

So, I think xarray.Dataset should not be recognized as a valid tabular data type, which not only makes more sense but also can simplify our codes/tests.

The text was updated successfully, but these errors were encountered:

weiji14 · 2024-03-29T00:52:09Z

For context, allowing xr.Dataset inputs was added in #619, and the background reason I wanted to enable this was because there are certain file formats like HDF5 that can be read directly into an xr.Dataset (via xr.open_dataset) but not into a pandas.DataFrame, and I wanted to be able to plot 1D arrays stored as xr.DataArray data variables in an xr.Dataset without needing to convert them to pandas as an intermediate step.

For example, with ICESat-2 ATL06 altimetry data, the data is stored as a HDF5 file, with 1D data variables like height for each altimetry measurement.

import xarray as xr

# https://n5eil01u.ecs.nsidc.org/ATLAS/ATL06.006/2018.10.14/ATL06_20181014001049_02350102_006_02.h5
ds: xr.Dataset = xr.open_dataset(
    "ATL06_20181014001049_02350102_006_02.h5", group="gt2l/land_ice_segments"
)
print(ds)

produces an xr.Dataset like:

<xarray.Dataset> Size: 320kB
Dimensions:                (delta_time: 7103)
Coordinates:
  * delta_time             (delta_time) datetime64[ns] 57kB 2018-10-14T00:17:...
    latitude               (delta_time) float64 57kB ...
    longitude              (delta_time) float64 57kB ...
Data variables:
    atl06_quality_summary  (delta_time) int8 7kB ...
    h_li                   (delta_time) float32 28kB ...
    h_li_sigma             (delta_time) float32 28kB ...
    segment_id             (delta_time) float64 57kB ...
    sigma_geo_h            (delta_time) float32 28kB ...
Attributes:
    Description:  The land_ice_height group contains the primary set of deriv...
    data_rate:    Data within this group are sparse.  Data values are provide...

I can make an along-track plot with PyGMT directly by passing in the xr.Dataset like so:

import pygmt
ds_height = ds[["segment_id", "h_li"]]  # along-track index and height

fig = pygmt.Figure()
fig.plot(
    data=ds_height, frame=["xaf+lsegment_id", "yaf+lHeight (m)"], style="c0.05c"
)
fig.show()

produces:

That said, I can see your point that we currently don't check whether the xr.DataArray variables are 1D arrays or 2D arrays, which results in undefined behaviour:

Then, in virtualfile_in, a list of multi-dimensional (3-D in this example) xarray.DataArray objects are passed to GMT modules which expects a list of 1-D arrays instead. It works without errors because in Session.put_vector, we pass the pointer of the 2-D array to the GMT C API function, but it likely won't work if the data is not C-contiguous (e.g., a slice of a dataset). So, the actual behavior is not well defined.

This sounds similar to the issue mentioned at #3086 where xyz2grd doesn't check if the inputs passed to the data parameter have the expected number of columns. Perhaps we need to have better checks on the input dimensions/shape before passing the data into virtualfile_from_vectors.

seisman · 2024-03-29T02:08:13Z

In your example, it's unclear to me (from a user's perspective) what data are passed to GMT. Users may expect to pass three columns (time_delta, segment_id and h_li, because segment_id and h_li are a function of time_delta), but actual only two columns (segment_id and h_li) are passed.

weiji14 · 2024-03-29T02:25:16Z

PyGMT is only able to read from an xr.Dataset's data_vars, not from the coords, so delta_time cannot actually be used or plotted in this case, unless that coordinate is turned into a data variable (using .reset_coords()), though that actually isn't possible in this case because delta_time is an index coordinate (denoted with the asterisk *).

We could theoretically apply .reset_coords() on 'longitude' and 'latitude' (since they are not index coords), and pass in 'longitude', 'latitude', 'h_li' to plot3d for example to make a 3D scatterplot.

seisman · 2024-03-29T02:58:31Z

I didn't mean we should pass the coordinates to GMT. I feel the behavior of passing an xr.Dataset to PyGMT is ambiguous (coordinates are passed or not) and may confuse users. Instead, passing a pandas.DataFrame or ndarray is more easier to understand.

weiji14 · 2024-03-29T03:10:59Z

Maybe we can add a section for xr.Dataset inputs under https://www.pygmt.org/v0.11.0/get_started/04_table_inputs.html, as a follow-up to #2722? Xref #1268.

seisman · 2024-03-29T03:17:00Z

Maybe we can add a section for xr.Dataset inputs under https://www.pygmt.org/v0.11.0/get_started/04_table_inputs.html, as a follow-up to #2722? Xref #1268.

OK. Also need to update the codes to check if the dataset only has 1-D variables.

seisman added the discussions Need more discussion before taking further actions label Mar 28, 2024

weiji14 mentioned this issue Mar 29, 2024

Add tutorial section on passing xr.Dataset with 1-D data_vars as input #3147

Draft

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xarray.Dataset should NOT be recognized as a valid tabular data type #3146

xarray.Dataset should NOT be recognized as a valid tabular data type #3146

seisman commented Mar 28, 2024

weiji14 commented Mar 29, 2024

seisman commented Mar 29, 2024

weiji14 commented Mar 29, 2024 •

edited

Loading

seisman commented Mar 29, 2024

weiji14 commented Mar 29, 2024

seisman commented Mar 29, 2024

xarray.Dataset should NOT be recognized as a valid tabular data type #3146

xarray.Dataset should NOT be recognized as a valid tabular data type #3146

Comments

seisman commented Mar 28, 2024

weiji14 commented Mar 29, 2024

seisman commented Mar 29, 2024

weiji14 commented Mar 29, 2024 • edited Loading

seisman commented Mar 29, 2024

weiji14 commented Mar 29, 2024

seisman commented Mar 29, 2024

weiji14 commented Mar 29, 2024 •

edited

Loading