Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flatten / Reshape data cube dimensions #308

Closed
m-mohr opened this issue Nov 26, 2021 · 8 comments · Fixed by #316
Closed

Flatten / Reshape data cube dimensions #308

m-mohr opened this issue Nov 26, 2021 · 8 comments · Fixed by #316
Assignees
Milestone

Comments

@m-mohr
Copy link
Member

m-mohr commented Nov 26, 2021

There are sometimes use cases that need to "flatten" (or stack: xarray, pands) data cube dimensions. Right now, VITO uses "apply_dimension" + target_bands as a workaround, but that may not fully be covered through the specification.

We need to check whether we really want to use that approach long-term, it is a bit weird to use a const operation as callback.
The better approach could be to actually define a new process.

This is already required by multiple use cases: SRR2 UC3, SRR3 UC8
It has already been discussed as part of two other issues at least:

@m-mohr m-mohr added this to the 1.2.0 milestone Nov 26, 2021
@m-mohr m-mohr self-assigned this Nov 26, 2021
@clausmichele
Copy link
Member

clausmichele commented Nov 29, 2021

So, the input could be:

  1. x, y, bands, time
  2. x, y, bands
  3. x, y, time
  4. other?

in my opinion the result should have a shape of MxN, resulting from a combination of the available ones in the datacube.
Hence, the process should allow to recombine depending on the user's input.

Practical example:

  • I train a random forest regressor with NDVI values as target and [B04,B08] as predictors/features.
  • For training, if the data comes out an aggregate_spatial process has already a MxN shape. (M=number of polygons, N=number of bands). This is possible since aggreate_spatial removes the spatial bands (x,y).
  • For predicting, the data must have again the MxN shape. No problems if we use again aggregate_spatial. However, if we want to apply the prediction over a raster-cube, we need to flatten the data first.
  • If the input data to prediction is then x,y, bands (with bands= [B04,B08] for this particular example), we need to flatten it to MxN, with M= x * y and N = 2 (the number of bands).

If the input data has also the time dimensions, we need to allow a result like:

  • M = x * y * time N = bands
  • M = x * y N = bands * time (time series regression)

Anyway, we will lose the necessary information for reshaping the output of the machine learning algorithm, so maybe we will also need another process to reshape the output or having a more general reshape process allowing to flatten the data but also reconstruct it following a sample datacube (the data before flattening for instance.)

references:
numpy flatten
https://numpy.org/doc/stable/reference/generated/numpy.ndarray.flatten.html
numpy reshape
https://numpy.org/doc/stable/reference/generated/numpy.reshape.html

@m-mohr m-mohr modified the milestones: 1.2.0, 1.3.0 Nov 29, 2021
@m-mohr m-mohr added the ML label Dec 13, 2021
@m-mohr
Copy link
Member Author

m-mohr commented Dec 13, 2021

Porting over a discussion from #306 - posted by @jdries:

in our case, the flattening has been taken care of by apply_dimension, but it's fine if another process is defined for that (doing the same thing)

We've just checked the process description closely and we believe that the behavior is not covered by the process description of apply_dimension. As far as I've understood, it is that if you have a data cube x,y,t,b with 5 labels in t (as parameter dimension) and 3 labels in b (as parameter target_dimension) that you somehow want to combine b with t (like a matrix multiplication) and end up with 15 labels. That is not covered by the specification as you don't have "read" access to b but instead simply replace it in the target dimension:

The pixel values in the target dimension get replaced by the computed pixel values.

It is possible that I've misunderstood how you envision the flattening approach through apply_dimenesion and it would be good to look at an example. What I found in UC3 used for computeStats doesn't really seem to be flattening and as such the behavior in UC3 seems fine. Having that said, it seems like the better approach to make it possible through a new process right now. In this case, it would also not necessarily be required to generate new labels through the client. The flattening process would be based upon existing implementations in e.g. xarray and should not be too hard to implement then.

cc @edzer - How does a user flatten in stars? (edit: st_redimension)

@jdries
Copy link
Contributor

jdries commented Dec 14, 2021

Looking at the example above, we also typically solve that one without flattening.
What we would do is:
cube = (x,y, t, bands)
ndvi = cube.reduce_dimension(dimension='bands', callback=random_forest_inference)

The random_forest_inference callback then simply gets the 2 band values per pixel and timestep, and predicts the NDVI. Also our more complex cases based on deep learning work like that, no flattening and reshaping is needed.

The big problem lies more with training models, because that is a 'global' operation that can not be split up using the callback approach. On the other hand, the point sampling through aggregate_spatial does solve the problem of 'flattening' the spatial dimensions.

@clausmichele
Copy link
Member

In my opinion, we would need a reshape process that does raster-cube to vector-cube and vice-versa (or two separate processes)

When the input is a raster-cube it flattens the data to a vector-cube:

Input: (x,y,time,band) with shape (M,N,T,B) = (10,20,100,2)
Parameters: "predictor dimension": "bands"
Output: vactor-cube with shape (M * N * T, B) = (20000,2)

Input: (x,y,time,band) with shape (M,N,T,B) = (10,20,100,2)
Parameters: "predictor dimension": "time"
Output: vactor-cube with shape (M * N * B, T) = (400,100)

Input: (x,y,time) with shape (M,N,T) = (10,20,100)
Parameters: "predictor dimension": "time"
Output: vactor-cube with shape (M * N, T) = (200,100)

Input: (x,y,band) with shape (M,N,B) = (10,20,2)
Parameters: "predictor dimension": "time"
Output: vactor-cube with shape (M * N, B) = (200,2)

When the input is a vector-cube it reshapes the data to a raster-cube given a target cube.
It would raise an error if the vector-cube can't be reshaped do to a mismatch.

Using combinations of apply_dimension and reduce_dimension might be difficult to understand for someone with a machine learning background and as we have seen it does not cover all the possible scenarios.

@jdries
Copy link
Contributor

jdries commented Dec 15, 2021

Could it be that we have some confusion on the 'vector-cube' concept?
You seem to be thinking of something like a big matrix. Which is a lot like a generalization of rastercube (in the sense that it doesn't have spatial dimensions).
My idea of the concept is more similar to a 'featurecollection' in GeoJSON. So a cube where spatial dimension is replaced with geometric shapes (polygons/lines/points/...) and that can otherwise still have a time and bands dimension.

@edzer
Copy link
Member

edzer commented Dec 16, 2021

Could it be that we have some confusion on the 'vector-cube' concept?

I tried to clarify some of this in #68.

@clausmichele
Copy link
Member

Could it be that we have some confusion on the 'vector-cube' concept? You seem to be thinking of something like a big matrix. Which is a lot like a generalization of rastercube (in the sense that it doesn't have spatial dimensions). My idea of the concept is more similar to a 'featurecollection' in GeoJSON. So a cube where spatial dimension is replaced with geometric shapes (polygons/lines/points/...) and that can otherwise still have a time and bands dimension.

That's also fine, but to train a ML model we do need that this vector-cube or how we want to call it, has just 2 dimensions. So it could also have the structure that you mentioned, where each row has also the (x,y) or polygon property that generated it, but we still need a process that reshapes back and forth the data.

@jdries
Copy link
Contributor

jdries commented Dec 16, 2021

Not sure if I agree. To train an ML model (and also for inference), we need to provide a matrix to the model, where the shape of that matrix indeed depends on the model.
Then it seems that we now have two proposals to achieve that:

  • Reshaping the entire raster cube into a more general matrix
  • Reusing existing callback based methods (reduce_dimension, apply_neighborhood,...)

My biggest problems with the reshaping proposal:

  • If implemented literally, it implies large data reorganization inside the backend, for no clear reason
  • It consists of a couple of steps: reshape -> apply ML model -> reshape back to rastercube shape, with some kind of target cube (where does that come from?)
  • Backend specific, but I would need to design entirely new datastructures to work with this new type of cubes. (It's basically a 3rd type, next to raster, and geometry)

The main argument for reusing the existing processes is simply that we have them already, and we have to teach our users anyway how to work with them. I agree that these are not the most simple processes, but for EO researcher that have ambitions to use machine learning and probably deep learning as well, this should be will within their skillset.

@m-mohr m-mohr changed the title Flatten data cube dimensions Flatten / Reshape data cube dimensions Dec 16, 2021
m-mohr added a commit that referenced this issue Dec 20, 2021
@m-mohr m-mohr linked a pull request Dec 20, 2021 that will close this issue
m-mohr added a commit that referenced this issue Dec 20, 2021
m-mohr added a commit that referenced this issue Dec 20, 2021
@m-mohr m-mohr closed this as completed Mar 9, 2022
@m-mohr m-mohr modified the milestones: 1.3.0, 2.0.0 Feb 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants