-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change resample_cube_temporal, align with GDAL and improve resampling descriptions in general #244
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
proposals/resample_cube_temporal.json
How should the back-end compute for example the average? If we have data for every first day of each month and I want to resample to every 15th of each month, what should I compute? The average between the closest valid samples on the left and right in time? It is still not so clear from the definition in my opinion, because computing the average over all valid samples doesn't make so much sense in this case.
@clausmichele That's exactly that type of feedback I need to get (from back-ends/users) as I think it's better to define what software supports instead of just coming up with an arbitrary specification no one can support. I tried to figure out from other software how they handle temporal resampling but couldn't find much except in gdalcubes. So if you or anyone else has good pointers to implementations, please let me know. |
My naive approach would be that the cube just spans time ranges until they are "half-way" to the neighbours, but So if your
The example is totally made-up (I hope I counted days correctly) and we may not want to divide days to erase any potential ambiguities. Even more likely that's not the best approach or is not backed by implementations though. |
This seems really too complicated. I see a simpler use case:
If I want to merge their data I see two possibilities:
In my opinion those two possibilities are already enough, for other use cases we still have aggregate_temporal and aggregate_temporal_period |
I like the simplicity of @clausmichele 's proposal. One note though: how do you handle NaN's? If your nearest neighbor is NaN: do you pick that, or do you look further? |
We need to reason about this. A different scenario could be having cloud masked data filled with NaNs in the cloud "holes": here if you try to look further for valid data in the cloud masked areas, you would get a "composite", which is not desired in my opinion. Concluding, if someone would like to have only valid data and no NaNs, he/she should use the interpolation method filling the gaps. |
…mension parameter less restrictive. #194
I've simplified the process. It only supports nearest neighbor as resampling method so that the process gets easier to implement. I have not found a lot of details about nearest neighbor for temporal resampling. Has anyone good documentation that we can link to? I'm especially looking for an indication of how ties are resolved, which we should document in the process. The other open question is NaN handling as mentioned above. |
# Conflicts: # tests/.words
@przell maybe something about the wetsnow pipeline/use case could be useful to clarify the behavior of this process? Do we have some public documentation for it? |
The process resample_temporal was just foreseen as a helper in the wet snow use case to align the temporal dimension of the two collections to a common timeseries. So nothing official available from this side. Sorry.
|
@clausmichele @jdries @przell I've tried to clarify the behavior for ties and invalid values. As discussed in the dev telco, I've added a new parameter "valid_within". Does that all make sense for you? |
Will merge end of next week at the latest if nothing major comes in... |
Fine for me, having the valid_within parameter is a good trade off. |
Hi Matthias, We could specify to not set the valid_within to a higher value than the temporal resolution. Or throw an error. |
Yes, right now I'd assume that it looks beyond the surrounding timesteps so that values could be assigned twice. Should we leave this up to the users to make the right decision and choose the right range? Should we add a warning to the parameter? Such as
Although assigning values twice may happen anyway, right? Let's say you have data on the 1st, 5th and 9th and the other cube has data on the 3rd and 7th... What to do? It seems obvious to assign the values twice by default. If you then choose 2 day range you'll get nodata for all values. If you choose 3 days it is assigned twice... |
Ok, I see your point. I never thought about resampling a temporally sparse data cube to a temporally denser one. Then the same dates naturally have to be assigned multiple times.
And this but it's quite complicated to follow:
|
Only applies if valid_within is given, of course.
For the example above, it's already true if the value for valid_within is half the temporal resolution. I'm not sure we can describe this in a concise way. Maybe we just need to say that it may happen that values are assigned multiple times in certain circumstances and give an example.
🤔 Is that three times per day or every three days? |
…s (added new parameter valid_within)
Changes:
resample_cube_spatial
andresample_spatial
: Aligned with recent changes in GDAL and addedrms
andsum
options to methods. Also added descriptions for each option.resample_cube_temporal
: Replaced parameterprocess
(callback-style) with a parametermethod
that aligns with the spatial resampling processes. See resample_cube_temporal behavior #194 for context. Do all these methods make sense for temporal resampling or should we add/remove some? Does this help with solving the issues at all?I've also sorted the options for a hopefully better user experience.
I'm still looking into improving the documentation, but want to bring the first draft up ASAP to get feedback.