Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bccaqv2 split bbox and grid point #60

Merged
merged 33 commits into from
Jan 16, 2020

Conversation

davidcaron
Copy link
Collaborator

Overview

Add possibility to subset multiple grid points.

Changes:

  • The subset_gridpoint process accepts a list of comma separated floats, instead of a single float for lat and lon (same for bccaqv2 grid point subset process)
  • deprecate (but don't remove) lat0 and lon0 in the subset_ensemble_BCCAQv2 process
  • added more tests, and refactored to use less mocks
  • pin pywps~=4.2.3

Comments:

I thought that providing a list of floats was more intuitive for the user of the WPS process, instead of asking for multiple lat and lon parameters that have to be in the exact order to respect the coordinate pairs.

@davidcaron davidcaron requested a review from huard January 10, 2020 16:48
@davidcaron davidcaron changed the title Bccaqv2 split bbox and grid point [WIP] Bccaqv2 split bbox and grid point Jan 13, 2020
for lon, lat in zip(longitudes, latitudes):
subset = subset_gridpoint(dataset, lon=lon, lat=lat, start_date=start, end_date=end)
subset = subset.expand_dims(["lat", "lon"])
output_ds = output_ds.combine_first(subset) if output_ds is not None else subset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The different sites are ordered along which dimension ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make more sense to have individual points with a multi-index coordinate of lon&lat together?
e.g. something like this:

index1 = pd.MultiIndex.from_arrays([longitudes, latitudes], names=['lon','lat'])
output = xr.Dataset(coords={'point_id': index1, 'time': subset.time}, attrs=subset.attrs)

@davidcaron I sent you a bit of code via email last week with an exemple not sure if you recieved?

@Zeitsperre
Copy link
Collaborator

You might want to bump the version of the process (currently at 0.1).

def __init__(self):
inputs = [
LiteralInput(
"variable",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest putting inputs that are being reused in multiple processes in wpsio.py.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but I will keep it as is until I remove the lat0 and lon0 for the subset_point process (we have to support both to update the process properly)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -64,6 +64,24 @@ def test_netcdf_to_csv_to_zip():
n_files = 15
assert len(z.infolist()) == n_files + n_calendar_types
assert sum(1 for f in z.infolist() if f.filename.startswith("metadata")) == n_files
data_filename = [n for n in z.namelist() if 'metadata' not in n]
for filename in data_filename:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could probably use pytest parameterized fixtures here.

@davidcaron
Copy link
Collaborator Author

Does it make more sense to have individual points with a multi-index coordinate of lon&lat together?

@tlogan2000 I wanted to avoid doing that... but maybe I'm wrong.

This process can be used to subset either a single cell, multiple grid cells, and it can output wither a csv or a netcdf. I wanted the outputs to be of the same format whether or not the user asked for a single grid cell or multiple.

So does it make sense for example to output a netcdf file for a single grid cell with this multi-index?

The different sites are ordered along which dimension ?

@huard I'm not sure what you mean by that... you mean the alignment of the datasets?

@huard
Copy link
Collaborator

huard commented Jan 13, 2020

@davidcaron The output is one single netCDF file with all requested points, correct ? Not multiple files.
In that case, the time series for each point are going to create a matrix (time, site_dimension), no ?
At the moment, my feeling (untested) is that you're creating a (time, lat, lon) matrix, but some of the lat-lon combinations will be empty.

Signature: xr.Dataset.combine_first(self, other:'Dataset') -> 'Dataset'
Docstring:
Combine two Datasets, default to data_vars of self.

The new coordinates follow the normal broadcasting and alignment rules
of ``join='outer'``.  Vacant cells in the expanded coordinates are
filled with np.nan.

@davidcaron
Copy link
Collaborator Author

Yes, to merge the grid cells, as it is actually, the single netcdf file will contain NaN cells, with lat and lon indices. The csv output can remove these NaN values.

This was my idea of what would be the best output, but you're the people actually using these datasets and outputs, so your opinions are better than mine.

@huard
Copy link
Collaborator

huard commented Jan 13, 2020

@tlogan2000
Copy link
Collaborator

At the moment, my feeling (untested) is that you're creating a (time, lat, lon) matrix, but some of the lat-lon combinations will be empty.

This is why I thought a multiindex coordinate name=='point_id' (or similar) would maybe work better. I'm not 100% sure either just my feeling

@davidcaron
Copy link
Collaborator Author

Ok, I'll go with region (or point_id? I'll let you guys decide) I should have something worked out today.

@tlogan2000
Copy link
Collaborator

Ok, I'll go with region (or point_id? I'll let you guys decide) I should have something worked out today.

With respect to the multindex stuff I was talking about .... the only advantage is that the coordinate values of the point_id dimension become the lon/lat coordinates of the sites versus a simple range of integers (0 to n)... Maybe a nice to gave but not critical?

@davidcaron
Copy link
Collaborator Author

davidcaron commented Jan 14, 2020

Hum... I think I'm running into pydata/xarray#1077 where you can't write a netcdf file to disk with a multi-index.

The error message is:

variable 'region' is a MultiIndex, which cannot yet be serialized to netCDF files (pydata/xarray#1077). Use reset_index() to convert MultiIndex levels into coordinate variables instead.

Would using coordinate variables be a good idea?

@tlogan2000
Copy link
Collaborator

Ok. I didn't realize we couldn't save to netcdf with the multiIndex.

What does reset_index() do exactly?

@davidcaron
Copy link
Collaborator Author

Here is a test script:

import xarray as xr
import pandas as pd
from xclim.subset import subset_gridpoint

longitudes = -72.8, -72.7, -72.9
latitudes = 46.0, 46.1, 46.1

dataset = xr.open_dataset("tests/data/bccaqv2_subset_sample/tasmin_subset.nc")
dataset.attrs = {}  # cleaner output

subsets = []
for lon, lat in zip(longitudes, latitudes):
    subset = subset_gridpoint(dataset, lon=lon, lat=lat)
    subsets.append(subset)

concatenated = xr.concat(subsets, dim="region")
multi_index = pd.MultiIndex.from_arrays(
    [concatenated.lon, concatenated.lat], names=["lon", "lat"]
)
concatenated = concatenated.drop(["lon", "lat"])
output_ds = xr.Dataset(
    coords={"region": multi_index, "time": concatenated.time}, attrs=concatenated.attrs,
)
for d in concatenated.data_vars:
    output_ds[d] = concatenated[d]

print(output_ds)
output_ds = output_ds.reset_index(["region"])
print(output_ds)

Here is the dataset before using reset_index (can't be written to disk):

<xarray.Dataset>
Dimensions:  (region: 3, time: 100)
Coordinates:
  * region   (region) MultiIndex
  - lon      (region) float64 -72.79 -72.71 -72.88
  - lat      (region) float64 46.04 46.12 46.12
  * time     (time) object 1950-01-01 12:00:00 ... 1950-04-10 12:00:00
Data variables:
    tasmin   (region, time) float32 -21.693748 -27.193098 ... -4.1428237

And after calling reset_index:


<xarray.Dataset>
Dimensions:  (region: 3, time: 100)
Coordinates:
  * time     (time) object 1950-01-01 12:00:00 ... 1950-04-10 12:00:00
    lon      (region) float64 -72.79 -72.71 -72.88
    lat      (region) float64 46.04 46.12 46.12
Dimensions without coordinates: region
Data variables:
    tasmin   (region, time) float32 -21.693748 -27.193098 ... -4.1428237

@tlogan2000
Copy link
Collaborator

Ok. I didn't realize we couldn't save to netcdf with the multiIndex.

What does reset_index() do exactly?

Looks like it simply brings the data back to the original xr.concat(dim='point_id') result... I suggest just forgetting about my MutiIndex comments for now.

@nilshempelmann
Copy link
Member

We had some similar issues in flyingpigen bird-house/flyingpigeon#171.
Just a note from my side: if you are writing back to netCDF try to reproduce the same format that 'cdo' subsetting is producing to keep compatibillity

@tlogan2000
Copy link
Collaborator

Here is a test script:

import xarray as xr
import pandas as pd
from xclim.subset import subset_gridpoint

longitudes = -72.8, -72.7, -72.9
latitudes = 46.0, 46.1, 46.1

dataset = xr.open_dataset("tests/data/bccaqv2_subset_sample/tasmin_subset.nc")
dataset.attrs = {}  # cleaner output

subsets = []
for lon, lat in zip(longitudes, latitudes):
    subset = subset_gridpoint(dataset, lon=lon, lat=lat)
    subsets.append(subset)

concatenated = xr.concat(subsets, dim="region")

By just stopping the script here : I get a ds concatenated that is identical to the final output_ds (i.e. after reset_index() )

@nilshempelmann
Copy link
Member

@davidcaron
Copy link
Collaborator Author

So I think I'm done, here is where I'm at:

For any grid cell subset (single or multiple) the dimensions will look like this (same as shown previously):

<xarray.Dataset>
Dimensions:  (region: 3, time: 100)
Coordinates:
  * time     (time) object 1950-01-01 12:00:00 ... 1950-04-10 12:00:00
    lon      (region) float64 -72.79 -72.71 -72.88
    lat      (region) float64 46.04 46.12 46.12
Dimensions without coordinates: region
Data variables:
    tasmin   (region, time) float32 -21.693748 -27.193098 ... -4.1428237

For bounding box subset, the lon and lat dimensions will be kept, like in the source dataset.

And I fixed the csv output so that there is one line for each combination of lat-lon pair and timestamp.

@nilshempelmann Thank you for your input, could you tell me if the way I organized the data in the multiple grid subset is compatible with the cdo outputs? From what I could tell, I believe it does, but I'm not sure.

@davidcaron davidcaron changed the title [WIP] Bccaqv2 split bbox and grid point Bccaqv2 split bbox and grid point Jan 15, 2020
@davidcaron
Copy link
Collaborator Author

From my point of view, I would be ready to merge, I would need at least an approval.

@davidcaron
Copy link
Collaborator Author

@huard Thanks, I'll merge and release tomorrow.

@davidcaron davidcaron merged commit 4f23921 into master Jan 16, 2020
@davidcaron davidcaron deleted the bccaqv2-split-bbox-and-grid-point branch January 16, 2020 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants