Bccaqv2 split bbox and grid point #60

davidcaron · 2020-01-10T16:48:20Z

Overview

Add possibility to subset multiple grid points.

Changes:

The subset_gridpoint process accepts a list of comma separated floats, instead of a single float for lat and lon (same for bccaqv2 grid point subset process)
deprecate (but don't remove) lat0 and lon0 in the subset_ensemble_BCCAQv2 process
added more tests, and refactored to use less mocks
pin pywps~=4.2.3

Comments:

I thought that providing a list of floats was more intuitive for the user of the WPS process, instead of asking for multiple lat and lon parameters that have to be in the exact order to respect the coordinate pairs.

(the dates were never used in production for this process)

SubsetGridPointBCCAQV2Process

use less mocks

separated list of floats

huard · 2020-01-13T19:52:47Z

finch/processes/wps_xsubsetpoint.py

+            for lon, lat in zip(longitudes, latitudes):
+                subset = subset_gridpoint(dataset, lon=lon, lat=lat, start_date=start, end_date=end)
+                subset = subset.expand_dims(["lat", "lon"])
+                output_ds = output_ds.combine_first(subset) if output_ds is not None else subset


The different sites are ordered along which dimension ?

Does it make more sense to have individual points with a multi-index coordinate of lon&lat together?
e.g. something like this:

index1 = pd.MultiIndex.from_arrays([longitudes, latitudes], names=['lon','lat']) output = xr.Dataset(coords={'point_id': index1, 'time': subset.time}, attrs=subset.attrs)

@davidcaron I sent you a bit of code via email last week with an exemple not sure if you recieved?

Zeitsperre · 2020-01-13T19:54:41Z

You might want to bump the version of the process (currently at 0.1).

huard · 2020-01-13T19:54:22Z

finch/processes/wps_xsubsetpoint_bccaqv2.py

+    def __init__(self):
+        inputs = [
+            LiteralInput(
+                "variable",


I suggest putting inputs that are being reused in multiple processes in wpsio.py.

I agree, but I will keep it as is until I remove the lat0 and lon0 for the subset_point process (we have to support both to update the process properly)

huard · 2020-01-13T19:56:05Z

tests/test_utils.py

@@ -64,6 +64,24 @@ def test_netcdf_to_csv_to_zip():
        n_files = 15
        assert len(z.infolist()) == n_files + n_calendar_types
        assert sum(1 for f in z.infolist() if f.filename.startswith("metadata")) == n_files
+        data_filename = [n for n in z.namelist() if 'metadata' not in n]
+        for filename in data_filename:


Could probably use pytest parameterized fixtures here.

davidcaron · 2020-01-13T21:01:43Z

Does it make more sense to have individual points with a multi-index coordinate of lon&lat together?

@tlogan2000 I wanted to avoid doing that... but maybe I'm wrong.

This process can be used to subset either a single cell, multiple grid cells, and it can output wither a csv or a netcdf. I wanted the outputs to be of the same format whether or not the user asked for a single grid cell or multiple.

So does it make sense for example to output a netcdf file for a single grid cell with this multi-index?

The different sites are ordered along which dimension ?

@huard I'm not sure what you mean by that... you mean the alignment of the datasets?

huard · 2020-01-13T21:14:38Z

@davidcaron The output is one single netCDF file with all requested points, correct ? Not multiple files.
In that case, the time series for each point are going to create a matrix (time, site_dimension), no ?
At the moment, my feeling (untested) is that you're creating a (time, lat, lon) matrix, but some of the lat-lon combinations will be empty.

Signature: xr.Dataset.combine_first(self, other:'Dataset') -> 'Dataset'
Docstring:
Combine two Datasets, default to data_vars of self.

The new coordinates follow the normal broadcasting and alignment rules
of ``join='outer'``.  Vacant cells in the expanded coordinates are
filled with np.nan.

davidcaron · 2020-01-13T21:22:17Z

Yes, to merge the grid cells, as it is actually, the single netcdf file will contain NaN cells, with lat and lon indices. The csv output can remove these NaN values.

This was my idea of what would be the best output, but you're the people actually using these datasets and outputs, so your opinions are better than mine.

huard · 2020-01-13T21:35:02Z

See example H.3 in http://cfconventions.org/cf-conventions/cf-conventions.html#discrete-sampling-geometries

tlogan2000 · 2020-01-14T13:31:31Z

At the moment, my feeling (untested) is that you're creating a (time, lat, lon) matrix, but some of the lat-lon combinations will be empty.

This is why I thought a multiindex coordinate name=='point_id' (or similar) would maybe work better. I'm not 100% sure either just my feeling

davidcaron · 2020-01-14T14:37:44Z

Ok, I'll go with region (or point_id? I'll let you guys decide) I should have something worked out today.

tlogan2000 · 2020-01-14T14:41:36Z

Ok, I'll go with region (or point_id? I'll let you guys decide) I should have something worked out today.

With respect to the multindex stuff I was talking about .... the only advantage is that the coordinate values of the point_id dimension become the lon/lat coordinates of the sites versus a simple range of integers (0 to n)... Maybe a nice to gave but not critical?

davidcaron · 2020-01-14T15:13:06Z

Hum... I think I'm running into pydata/xarray#1077 where you can't write a netcdf file to disk with a multi-index.

The error message is:

variable 'region' is a MultiIndex, which cannot yet be serialized to netCDF files (pydata/xarray#1077). Use reset_index() to convert MultiIndex levels into coordinate variables instead.

Would using coordinate variables be a good idea?

tlogan2000 · 2020-01-14T15:21:44Z

Ok. I didn't realize we couldn't save to netcdf with the multiIndex.

What does reset_index() do exactly?

davidcaron · 2020-01-14T15:29:10Z

Here is a test script:

import xarray as xr
import pandas as pd
from xclim.subset import subset_gridpoint

longitudes = -72.8, -72.7, -72.9
latitudes = 46.0, 46.1, 46.1

dataset = xr.open_dataset("tests/data/bccaqv2_subset_sample/tasmin_subset.nc")
dataset.attrs = {}  # cleaner output

subsets = []
for lon, lat in zip(longitudes, latitudes):
    subset = subset_gridpoint(dataset, lon=lon, lat=lat)
    subsets.append(subset)

concatenated = xr.concat(subsets, dim="region")
multi_index = pd.MultiIndex.from_arrays(
    [concatenated.lon, concatenated.lat], names=["lon", "lat"]
)
concatenated = concatenated.drop(["lon", "lat"])
output_ds = xr.Dataset(
    coords={"region": multi_index, "time": concatenated.time}, attrs=concatenated.attrs,
)
for d in concatenated.data_vars:
    output_ds[d] = concatenated[d]

print(output_ds)
output_ds = output_ds.reset_index(["region"])
print(output_ds)

Here is the dataset before using reset_index (can't be written to disk):

<xarray.Dataset>
Dimensions:  (region: 3, time: 100)
Coordinates:
  * region   (region) MultiIndex
  - lon      (region) float64 -72.79 -72.71 -72.88
  - lat      (region) float64 46.04 46.12 46.12
  * time     (time) object 1950-01-01 12:00:00 ... 1950-04-10 12:00:00
Data variables:
    tasmin   (region, time) float32 -21.693748 -27.193098 ... -4.1428237

And after calling reset_index:


<xarray.Dataset>
Dimensions:  (region: 3, time: 100)
Coordinates:
  * time     (time) object 1950-01-01 12:00:00 ... 1950-04-10 12:00:00
    lon      (region) float64 -72.79 -72.71 -72.88
    lat      (region) float64 46.04 46.12 46.12
Dimensions without coordinates: region
Data variables:
    tasmin   (region, time) float32 -21.693748 -27.193098 ... -4.1428237

tlogan2000 · 2020-01-14T15:30:09Z

Ok. I didn't realize we couldn't save to netcdf with the multiIndex.

What does reset_index() do exactly?

Looks like it simply brings the data back to the original xr.concat(dim='point_id') result... I suggest just forgetting about my MutiIndex comments for now.

nilshempelmann · 2020-01-14T16:12:16Z

We had some similar issues in flyingpigen bird-house/flyingpigeon#171.
Just a note from my side: if you are writing back to netCDF try to reproduce the same format that 'cdo' subsetting is producing to keep compatibillity

tlogan2000 · 2020-01-14T17:00:43Z

Here is a test script:

import xarray as xr
import pandas as pd
from xclim.subset import subset_gridpoint

longitudes = -72.8, -72.7, -72.9
latitudes = 46.0, 46.1, 46.1

dataset = xr.open_dataset("tests/data/bccaqv2_subset_sample/tasmin_subset.nc")
dataset.attrs = {}  # cleaner output

subsets = []
for lon, lat in zip(longitudes, latitudes):
    subset = subset_gridpoint(dataset, lon=lon, lat=lat)
    subsets.append(subset)

concatenated = xr.concat(subsets, dim="region")

By just stopping the script here : I get a ds concatenated that is identical to the final output_ds (i.e. after reset_index() )

nilshempelmann · 2020-01-14T17:18:13Z

Here is what we had played around in the early days:
https://github.com/bird-house/notebooks/blob/master/sandbox/cdo_ocgis.ipynb

and if you are subsetting over the 180 lon:
https://github.com/bird-house/notebooks/blob/master/sandbox/ocgis_subsettests.ipynb

davidcaron · 2020-01-14T18:46:31Z

So I think I'm done, here is where I'm at:

For any grid cell subset (single or multiple) the dimensions will look like this (same as shown previously):

<xarray.Dataset>
Dimensions:  (region: 3, time: 100)
Coordinates:
  * time     (time) object 1950-01-01 12:00:00 ... 1950-04-10 12:00:00
    lon      (region) float64 -72.79 -72.71 -72.88
    lat      (region) float64 46.04 46.12 46.12
Dimensions without coordinates: region
Data variables:
    tasmin   (region, time) float32 -21.693748 -27.193098 ... -4.1428237

For bounding box subset, the lon and lat dimensions will be kept, like in the source dataset.

And I fixed the csv output so that there is one line for each combination of lat-lon pair and timestamp.

@nilshempelmann Thank you for your input, could you tell me if the way I organized the data in the multiple grid subset is compatible with the cdo outputs? From what I could tell, I believe it does, but I'm not sure.

types should be strings

davidcaron · 2020-01-15T20:56:54Z

From my point of view, I would be ready to merge, I would need at least an approval.

finch/processes/wps_bccaqv2_heatwave.py

davidcaron · 2020-01-15T22:13:57Z

@huard Thanks, I'll merge and release tomorrow.

David Caron added 16 commits December 19, 2019 15:13

add start_date and end_date to bccaqv2 subset

805f544

(the dates were never used in production for this process)

split bccaqv2 bbox and grid point processes

8ef1ffa

deprecate lon0 and lat0 for ...

96646f9

SubsetGridPointBCCAQV2Process

add test for bccaqv2 boundingbox subset and ...

614260e

use less mocks

add exception message when request fails in tests

2ae93bc

read tests output in memory instead of writing to disk

f817ea4

change the point subset to accept a comma ...

e5253ef

separated list of floats

add test for deprecation of lat0 and lon0

107e6b5

extract monkeypatching in a single fixture

5d39de2

pin pywps~=4.2.3

81fe8e0

subset multiple grid cells

5aa7c03

keep global_attributes when merging datasets

58b99e2

fix dimensions for multiple grid points

446f43f

fix and test csv conversions

1e3e91b

skip online heat_wave test

b5156d3

flake8

a924510

davidcaron requested a review from huard January 10, 2020 16:48

David Caron added 6 commits January 10, 2020 12:06

typo and documentation

8acb667

speed up csv creation

253a4f0

docs

9071d14

explicitely close thread pool

623bb59

fix csv output dropna

63a132e

flake8

85f531a

davidcaron changed the title ~~Bccaqv2 split bbox and grid point~~ [WIP] Bccaqv2 split bbox and grid point Jan 13, 2020

huard reviewed Jan 13, 2020

View reviewed changes

David Caron added 2 commits January 14, 2020 12:10

fix multiple grid cells output and csv output

6ba2c34

formatting (black) and fix tests assertions

7cf9332

drop na values after dropping 'region' index

5c1aacc

David Caron added 5 commits January 14, 2020 13:48

bump xsubsetpoint version number

66c4a0e

extract common inputs to wpsio.py

affab20

clarify unit test

4980409

heatwave inherits from point subset, so the input...

3780ef7

types should be strings

remove unneeded optimization in heatwave process

e00401c

davidcaron changed the title ~~[WIP] Bccaqv2 split bbox and grid point~~ Bccaqv2 split bbox and grid point Jan 15, 2020

huard reviewed Jan 15, 2020

View reviewed changes

finch/processes/wps_bccaqv2_heatwave.py Outdated Show resolved Hide resolved

extract lat and lon to wpsio.py

4f7df80

huard approved these changes Jan 15, 2020

View reviewed changes

David Caron added 2 commits January 15, 2020 17:15

flake8

73617f3

Merge branch 'master' into bccaqv2-split-bbox-and-grid-point

ae632b5

davidcaron merged commit 4f23921 into master Jan 16, 2020

davidcaron deleted the bccaqv2-split-bbox-and-grid-point branch January 16, 2020 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bccaqv2 split bbox and grid point #60

Bccaqv2 split bbox and grid point #60

davidcaron commented Jan 10, 2020

huard Jan 13, 2020

tlogan2000 Jan 13, 2020

Zeitsperre commented Jan 13, 2020

huard Jan 13, 2020

davidcaron Jan 14, 2020

huard Jan 15, 2020

huard Jan 13, 2020

davidcaron commented Jan 13, 2020

huard commented Jan 13, 2020

davidcaron commented Jan 13, 2020

huard commented Jan 13, 2020

tlogan2000 commented Jan 14, 2020

davidcaron commented Jan 14, 2020

tlogan2000 commented Jan 14, 2020

davidcaron commented Jan 14, 2020 •

edited

Loading

tlogan2000 commented Jan 14, 2020

davidcaron commented Jan 14, 2020

tlogan2000 commented Jan 14, 2020

nilshempelmann commented Jan 14, 2020

tlogan2000 commented Jan 14, 2020

nilshempelmann commented Jan 14, 2020

davidcaron commented Jan 14, 2020

davidcaron commented Jan 15, 2020

davidcaron commented Jan 15, 2020

Bccaqv2 split bbox and grid point #60

Bccaqv2 split bbox and grid point #60

Conversation

davidcaron commented Jan 10, 2020

Overview

huard Jan 13, 2020

Choose a reason for hiding this comment

tlogan2000 Jan 13, 2020

Choose a reason for hiding this comment

Zeitsperre commented Jan 13, 2020

huard Jan 13, 2020

Choose a reason for hiding this comment

davidcaron Jan 14, 2020

Choose a reason for hiding this comment

huard Jan 15, 2020

Choose a reason for hiding this comment

huard Jan 13, 2020

Choose a reason for hiding this comment

davidcaron commented Jan 13, 2020

huard commented Jan 13, 2020

davidcaron commented Jan 13, 2020

huard commented Jan 13, 2020

tlogan2000 commented Jan 14, 2020

davidcaron commented Jan 14, 2020

tlogan2000 commented Jan 14, 2020

davidcaron commented Jan 14, 2020 • edited Loading

tlogan2000 commented Jan 14, 2020

davidcaron commented Jan 14, 2020

tlogan2000 commented Jan 14, 2020

nilshempelmann commented Jan 14, 2020

tlogan2000 commented Jan 14, 2020

nilshempelmann commented Jan 14, 2020

davidcaron commented Jan 14, 2020

davidcaron commented Jan 15, 2020

davidcaron commented Jan 15, 2020

davidcaron commented Jan 14, 2020 •

edited

Loading