Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change resample_cube_temporal, align with GDAL and improve resampling descriptions in general #244

Merged
merged 7 commits into from
Jun 25, 2021

Conversation

m-mohr
Copy link
Member

@m-mohr m-mohr commented Apr 27, 2021

Changes:

  • resample_cube_spatial and resample_spatial: Aligned with recent changes in GDAL and added rms and sum options to methods. Also added descriptions for each option.
  • resample_cube_temporal: Replaced parameter process (callback-style) with a parameter method that aligns with the spatial resampling processes. See resample_cube_temporal behavior #194 for context. Do all these methods make sense for temporal resampling or should we add/remove some? Does this help with solving the issues at all?

I've also sorted the options for a hopefully better user experience.

I'm still looking into improving the documentation, but want to bring the first draft up ASAP to get feedback.

@m-mohr m-mohr added this to the 1.1.0 milestone Apr 27, 2021
@m-mohr m-mohr linked an issue Apr 27, 2021 that may be closed by this pull request
@m-mohr m-mohr marked this pull request as draft April 27, 2021 16:17
Copy link
Member

@clausmichele clausmichele left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proposals/resample_cube_temporal.json
How should the back-end compute for example the average? If we have data for every first day of each month and I want to resample to every 15th of each month, what should I compute? The average between the closest valid samples on the left and right in time? It is still not so clear from the definition in my opinion, because computing the average over all valid samples doesn't make so much sense in this case.

@m-mohr
Copy link
Member Author

m-mohr commented Apr 28, 2021

@clausmichele That's exactly that type of feedback I need to get (from back-ends/users) as I think it's better to define what software supports instead of just coming up with an arbitrary specification no one can support. I tried to figure out from other software how they handle temporal resampling but couldn't find much except in gdalcubes. So if you or anyone else has good pointers to implementations, please let me know.

@m-mohr
Copy link
Member Author

m-mohr commented Apr 28, 2021

My naive approach would be that the cube just spans time ranges until they are "half-way" to the neighbours, but

So if your data datcube has timestamps 2020-01-01, 2020-01-02, ..., 2020-01-31 and your target datacube has timestamps 2020-01-04, 2020-01-14, 2020-01-23, 2020-01-31 I'd expect the following mapping:

  • 2020-01-04 uses 2020-01-01 - 2020-01-09 midday
  • 2020-01-14 uses 2020-01-09 midday - 2020-01-18
  • 2020-01-23 uses 2020-01-19 - 2020-01-27 midday
  • 2020-01-31 uses 2020-01-27 midday - 2020-01-31

The example is totally made-up (I hope I counted days correctly) and we may not want to divide days to erase any potential ambiguities. Even more likely that's not the best approach or is not backed by implementations though.

@clausmichele
Copy link
Member

clausmichele commented Apr 30, 2021

This seems really too complicated. I see a simpler use case:

  • Datacube A: high temporal density
  • Datacube B: lower temporal density

If I want to merge their data I see two possibilities:

  1. Resample A based on B: reduce temporal density of A to align with the time steps of B. We can either use the Nearest Neighbor approach or interpolation (which can be linear or something more complex.)
  2. Resample B based on A: increase temporal density of B to align with time steps of B. We can use again Nearest Neighbor approach or Interpolation.

In my opinion those two possibilities are already enough, for other use cases we still have aggregate_temporal and aggregate_temporal_period

@m-mohr
Copy link
Member Author

m-mohr commented Apr 30, 2021

I'd be happy to just provide nearest neighbor support without any additional methods. That makes the process spec so much simpler.
I guess we could provide the interpolation support in a different way? #233 #173

@clausmichele
Copy link
Member

It would also be good to hear an opinion from @aljacob @przell @jdries @lforesta

@soxofaan
Copy link
Member

I like the simplicity of @clausmichele 's proposal.

One note though: how do you handle NaN's? If your nearest neighbor is NaN: do you pick that, or do you look further?

@clausmichele
Copy link
Member

I like the simplicity of @clausmichele 's proposal.

One note though: how do you handle NaN's? If your nearest neighbor is NaN: do you pick that, or do you look further?

We need to reason about this.
If we work with raster data, let's say Sentinel-2, there could be an empty area filled with NaNs due to the acquisition geometry.
So, if I want to do the temporal resampling, I'll keep the empty area as it is, because looking further in time wouldn't give any additional info if the area is always empty.

A different scenario could be having cloud masked data filled with NaNs in the cloud "holes": here if you try to look further for valid data in the cloud masked areas, you would get a "composite", which is not desired in my opinion.

Concluding, if someone would like to have only valid data and no NaNs, he/she should use the interpolation method filling the gaps.

@m-mohr
Copy link
Member Author

m-mohr commented May 31, 2021

I've simplified the process. It only supports nearest neighbor as resampling method so that the process gets easier to implement.

I have not found a lot of details about nearest neighbor for temporal resampling. Has anyone good documentation that we can link to? I'm especially looking for an indication of how ties are resolved, which we should document in the process. The other open question is NaN handling as mentioned above.

@m-mohr m-mohr marked this pull request as ready for review May 31, 2021 10:30
@clausmichele
Copy link
Member

I've simplified the process. It only supports nearest neighbor as resampling method so that the process gets easier to implement.

I have not found a lot of details about nearest neighbor for temporal resampling. Has anyone good documentation that we can link to? I'm especially looking for an indication of how ties are resolved, which we should document in the process. The other open question is NaN handling as mentioned above.

@przell maybe something about the wetsnow pipeline/use case could be useful to clarify the behavior of this process? Do we have some public documentation for it?

@przell
Copy link
Member

przell commented Jun 8, 2021

The process resample_temporal was just foreseen as a helper in the wet snow use case to align the temporal dimension of the two collections to a common timeseries. So nothing official available from this side. Sorry.
The topic, available options and use cases are quite manyfold. So using nearest neighbor with, for now fixed parameters is a good starting point.

  • ties: I would choose the first (just arbitrary for now)
  • NaNs: As far as I understand from the discussion this is the case when the dense time series has no observation in the period between two dates of the sparse time series. In this case I would vote for creating a timestep where the raster is completely filled with NaNs.

@m-mohr
Copy link
Member Author

m-mohr commented Jun 16, 2021

@clausmichele @jdries @przell I've tried to clarify the behavior for ties and invalid values. As discussed in the dev telco, I've added a new parameter "valid_within". Does that all make sense for you?

@m-mohr
Copy link
Member Author

m-mohr commented Jun 16, 2021

Will merge end of next week at the latest if nothing major comes in...

@clausmichele
Copy link
Member

Fine for me, having the valid_within parameter is a good trade off.

@przell
Copy link
Member

przell commented Jun 17, 2021

Hi Matthias,
this seems like a straight forward solution so far.
There is one unclear point for me though:
What happens if the user specifies the valid_within parameter larger than the temporal resolution of the target data set? It is not clear to me what happens then. Theoretically a date could be assigned twice. In the case where there is no nearest neighbor within a time step, but the valid_within parameter allows to take one from the next/last time step. (I hope this is clear somehow)

We could specify to not set the valid_within to a higher value than the temporal resolution. Or throw an error.

@m-mohr
Copy link
Member Author

m-mohr commented Jun 18, 2021

Yes, right now I'd assume that it looks beyond the surrounding timesteps so that values could be assigned twice. Should we leave this up to the users to make the right decision and choose the right range? Should we add a warning to the parameter? Such as

Choosing a range that is close to or larger as the temporal resolution may lead to values being assigned to two target dates.

Although assigning values twice may happen anyway, right?

Let's say you have data on the 1st, 5th and 9th and the other cube has data on the 3rd and 7th... What to do? It seems obvious to assign the values twice by default. If you then choose 2 day range you'll get nodata for all values. If you choose 3 days it is assigned twice...

@m-mohr m-mohr requested review from sophieherrmann and removed request for lforesta June 18, 2021 10:06
@przell
Copy link
Member

przell commented Jun 18, 2021

Ok, I see your point. I never thought about resampling a temporally sparse data cube to a temporally denser one. Then the same dates naturally have to be assigned multiple times.
Is this clearly understandable?:

The function searches for available time steps within the date range given in "valid_within". If there are multiple dates available the closest is chosen. On a tie the first is chosen. If "valid_within" is chosen equal to or larger than the temporal resolution of the target temporal dimension time steps from the source data cube may be assigned multiple times.

And this but it's quite complicated to follow:

When resampling a data cube with a dense temporal dimension (e.g., daily) to a target one with a sparse temporal dimension (e.g., three daily) the duplicated assignment of a time step can be avoided by choosing "valid_within" smaller than the temporal resolution of the target temporal dimension.
When resampling a data cube with a sparse temporal dimension (e.g., three daily) to a target one with a dense temporal dimension (e.g., daily) duplicated assignment of time steps can naturally not be avoided

@m-mohr
Copy link
Member Author

m-mohr commented Jun 18, 2021

The function searches for available time steps within the date range given in "valid_within".

Only applies if valid_within is given, of course.

If "valid_within" is chosen equal to or larger than the temporal resolution of the target temporal dimension time steps from the source data cube may be assigned multiple times.

For the example above, it's already true if the value for valid_within is half the temporal resolution. I'm not sure we can describe this in a concise way. Maybe we just need to say that it may happen that values are assigned multiple times in certain circumstances and give an example.

three daily

🤔 Is that three times per day or every three days?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

resample_cube_temporal behavior
4 participants