-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bccaqv2 split bbox and grid point #60
Conversation
(the dates were never used in production for this process)
SubsetGridPointBCCAQV2Process
use less mocks
separated list of floats
finch/processes/wps_xsubsetpoint.py
Outdated
for lon, lat in zip(longitudes, latitudes): | ||
subset = subset_gridpoint(dataset, lon=lon, lat=lat, start_date=start, end_date=end) | ||
subset = subset.expand_dims(["lat", "lon"]) | ||
output_ds = output_ds.combine_first(subset) if output_ds is not None else subset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The different sites are ordered along which dimension ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make more sense to have individual points with a multi-index coordinate of lon&lat together?
e.g. something like this:
index1 = pd.MultiIndex.from_arrays([longitudes, latitudes], names=['lon','lat'])
output = xr.Dataset(coords={'point_id': index1, 'time': subset.time}, attrs=subset.attrs)
@davidcaron I sent you a bit of code via email last week with an exemple not sure if you recieved?
You might want to bump the version of the process (currently at |
def __init__(self): | ||
inputs = [ | ||
LiteralInput( | ||
"variable", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest putting inputs that are being reused in multiple processes in wpsio.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but I will keep it as is until I remove the lat0 and lon0 for the subset_point process (we have to support both to update the process properly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
tests/test_utils.py
Outdated
@@ -64,6 +64,24 @@ def test_netcdf_to_csv_to_zip(): | |||
n_files = 15 | |||
assert len(z.infolist()) == n_files + n_calendar_types | |||
assert sum(1 for f in z.infolist() if f.filename.startswith("metadata")) == n_files | |||
data_filename = [n for n in z.namelist() if 'metadata' not in n] | |||
for filename in data_filename: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could probably use pytest parameterized fixtures here.
@tlogan2000 I wanted to avoid doing that... but maybe I'm wrong. This process can be used to subset either a single cell, multiple grid cells, and it can output wither a csv or a netcdf. I wanted the outputs to be of the same format whether or not the user asked for a single grid cell or multiple. So does it make sense for example to output a netcdf file for a single grid cell with this multi-index?
@huard I'm not sure what you mean by that... you mean the alignment of the datasets? |
@davidcaron The output is one single netCDF file with all requested points, correct ? Not multiple files.
|
Yes, to merge the grid cells, as it is actually, the single netcdf file will contain NaN cells, with lat and lon indices. The csv output can remove these NaN values. This was my idea of what would be the best output, but you're the people actually using these datasets and outputs, so your opinions are better than mine. |
This is why I thought a multiindex coordinate name=='point_id' (or similar) would maybe work better. I'm not 100% sure either just my feeling |
Ok, I'll go with region (or point_id? I'll let you guys decide) I should have something worked out today. |
With respect to the multindex stuff I was talking about .... the only advantage is that the coordinate values of the point_id dimension become the lon/lat coordinates of the sites versus a simple range of integers (0 to n)... Maybe a nice to gave but not critical? |
Hum... I think I'm running into pydata/xarray#1077 where you can't write a netcdf file to disk with a multi-index. The error message is:
Would using coordinate variables be a good idea? |
Ok. I didn't realize we couldn't save to netcdf with the multiIndex. What does reset_index() do exactly? |
Here is a test script: import xarray as xr
import pandas as pd
from xclim.subset import subset_gridpoint
longitudes = -72.8, -72.7, -72.9
latitudes = 46.0, 46.1, 46.1
dataset = xr.open_dataset("tests/data/bccaqv2_subset_sample/tasmin_subset.nc")
dataset.attrs = {} # cleaner output
subsets = []
for lon, lat in zip(longitudes, latitudes):
subset = subset_gridpoint(dataset, lon=lon, lat=lat)
subsets.append(subset)
concatenated = xr.concat(subsets, dim="region")
multi_index = pd.MultiIndex.from_arrays(
[concatenated.lon, concatenated.lat], names=["lon", "lat"]
)
concatenated = concatenated.drop(["lon", "lat"])
output_ds = xr.Dataset(
coords={"region": multi_index, "time": concatenated.time}, attrs=concatenated.attrs,
)
for d in concatenated.data_vars:
output_ds[d] = concatenated[d]
print(output_ds)
output_ds = output_ds.reset_index(["region"])
print(output_ds) Here is the dataset before using reset_index (can't be written to disk):
And after calling reset_index:
|
Looks like it simply brings the data back to the original xr.concat(dim='point_id') result... I suggest just forgetting about my MutiIndex comments for now. |
We had some similar issues in flyingpigen bird-house/flyingpigeon#171. |
By just stopping the script here : I get a ds |
Here is what we had played around in the early days: and if you are subsetting over the 180 lon: |
So I think I'm done, here is where I'm at: For any grid cell subset (single or multiple) the dimensions will look like this (same as shown previously):
For bounding box subset, the And I fixed the csv output so that there is one line for each combination of lat-lon pair and timestamp. @nilshempelmann Thank you for your input, could you tell me if the way I organized the data in the multiple grid subset is compatible with the cdo outputs? From what I could tell, I believe it does, but I'm not sure. |
From my point of view, I would be ready to merge, I would need at least an approval. |
@huard Thanks, I'll merge and release tomorrow. |
Overview
Add possibility to subset multiple grid points.
Changes:
lat
andlon
(same for bccaqv2 grid point subset process)subset_ensemble_BCCAQv2
processpywps~=4.2.3
Comments:
I thought that providing a list of floats was more intuitive for the user of the WPS process, instead of asking for multiple lat and lon parameters that have to be in the exact order to respect the coordinate pairs.