Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

functions to pass data to other formats #35

Open
JessicaS11 opened this issue Mar 11, 2020 · 9 comments
Open

functions to pass data to other formats #35

JessicaS11 opened this issue Mar 11, 2020 · 9 comments
Labels
enhancement New feature or request help wanted Extra attention is needed question Further information is requested

Comments

@JessicaS11
Copy link
Member

One of icepyx's primary goals is to make it easier to work with ICESat-2 data. The present class object data structure aims to facilitate switching between multiple data structures to capitalize on other resources for data analysis. Thus, we need a set of functions for the class (e.g. to_dataframe, to_geodataframe, to_dict, to_xarray, to_netcdf, to_hdf5) that enable the user to easily put their data in these formats for further analysis. How this proceeds is part of ongoing conversations about development directions, so please contribute to the conversation here or on Discourse even if you don't think you'll actually work on this issue directly.

@JessicaS11 JessicaS11 added enhancement New feature or request help wanted Extra attention is needed question Further information is requested labels Mar 11, 2020
@JessicaS11 JessicaS11 added this to the ICESat-2 2020 Hackweek milestone Mar 11, 2020
@weiji14
Copy link
Member

weiji14 commented May 31, 2020

Hi @JessicaS11, pretty sure that most of the to_* functions you mentioned are available if you load the HDF5 files into an xarray.Dataset, except for to_geodataframe. Was that what you had in mind? Plus you get all of the data variables loaded nicely, see the example code/screenshots below:

# ...
order_id: str = region_a.granules.orderIDs[0]
download_path: str = f"icepyx_data/{order_id}"
region_a.download_granules(path=download_path)
ds: xr.Dataset = xr.open_mfdataset(
    paths=f"{download_path}/**/processed_*h5",
    engine="h5netcdf",
    group="gt2l/land_ice_segments",
    combine="by_coords",
)
ds

xarray loading ICESat-2 HDF5 file

xarray.Dataset to_ methods

I'd be keen to make this part of icepyx, just need to know which part of the library I should put it in 😄

@JessicaS11
Copy link
Member Author

Excellent @weiji14! We'd be happy to have you join the team and help make some of these things happen. I agree that leveraging existing libraries is the way to go wherever possible. The purpose of icepyx is to try and make it easier to go from data acquisition to final data products in a single workspace (versus, e.g., a download API, data processing API, then separate figure making API) and for scientists who are familiar with one data format but not necessarily transitioning between multiple formats (or using those that are cloud optimized).

I think I see conversions to other data formats happening as part of a two step process:

  1. Actually add the ability to bring ICESat-2 data files from the local file system in as an icesat2data object
  2. Add the functionality to read/manipulate/write the data

I think (1) would occur primarily within the icesat2py module itself (with appropriate calls to the validate_inputs module for checking inputs). (2) will likely happen primarily in the granules module, which was designed to be generic enough to handle granules (i.e. files) through the download process and/or from the file system (once the latter functionality is added).

There may also be some tools within captoolkit that we can leverage to accomplish some tasks, as they were designed specifically for working with altimetry data such as ICESat-2.

@weiji14
Copy link
Member

weiji14 commented Jun 3, 2020

Cool, happy to get on board! I'll take a closer look soon. My first impression of the current 'icesat2data' class and granules.py are that they are purely download scripts at the moment, with fancy subsetting capabilities, and some metadata properties. It's a bit confusing how the pieces fit together, but I'll try and wrap my head around it first.

@weiji14
Copy link
Member

weiji14 commented Jun 4, 2020

I think I see conversions to other data formats happening as part of a two step process:

1. Actually add the ability to bring ICESat-2 data files from the local file system in as an icesat2data object

2. Add the functionality to read/manipulate/write the data

I think (1) would occur primarily within the icesat2py module itself (with appropriate calls to the validate_inputs module for checking inputs). (2) will likely happen primarily in the granules module, which was designed to be generic enough to handle granules (i.e. files) through the download process and/or from the file system (once the latter functionality is added).

Ok, I've made a small PR as a start in #59 (to get the actual filepaths in the download directory locally). It's actually quite hard to work on the data if you can't actually find it! Really need to sort out the unzipping as mentioned in #33 as the files are in so many subdirectories.

Also, not too sure why you would need to bring in the local HDF5 file(s) in as an icesat2data object for point (1). Is it to access specific metadata (which should already be embedded in the HDF5 file)? It seems quite hard to reverse engineer a downloaded HDF5 granule back into an icesat2data object.

Loading the HDF5 data to place it in an icesat2data Python class object can be a memory intensive task if there's many files, and it seems strange to then convert it to an xarray object for example, when you could just read the HDF5 files directly (and more efficiently) using those well-tested libraries.

@JessicaS11
Copy link
Member Author

I think you raise a bunch of good points, and perhaps it could be helpful to shift the conversation towards what the best way to handle this is in the broader context of next development steps. I agree that there is limited overlap between the attributes needed for data acquisition and those needed for data processing. Maybe the solution is creating another overarching class within icepyx (similar to icesat2data, but for managing and manipulating data files rather than data access... we could always rename the icesat2data class/object to something more appropriate). Odds are that people don't actually want their data to stay in separate directories and files but rather want it all in an xarray or some other format that allows them to have more spatial continuity (since granule boundaries are arbitrary). And as you say, this gets into dealing with #33.

I think the best way forward is to try and connect to have a brainstorming conversation about this. I can also try to engage others during the upcoming Hackweek to see what the user and development team perspective is. Please feel free to contact me directly at jbscheick at gmail dot com so we can set something up!

@fspaolo
Copy link

fspaolo commented Jun 13, 2020

I think this is in line with the topics we started discussing today. A few things to consider (from a heavy data user perspective):

  • Do we want to keep the original nested-dictionary structure when converting to other formats? (most users only need a fraction of the information)
  • It won't be easy to map to xarray all the paths for each group/variable for each data product (in the past, these paths have even changed from version to version)
  • If dealing with HDF5, we want a bunch of small files in a single directory (for fast query and easy parallelization)

So maybe to facilitate/simplify things, we could take the following approach:

  1. Extract variables of interest from the original granules (x, y, t, h, std, ...). We can define mappings for all the data dictionaries, e.g {'x': '/path/to/variable/x', ...} so the user won't need to worry about hard coding these paths (that's what we do now).
  2. Having all variables as 1D arrays, add option for filtering at the point level (e.g. using the quality flag, or user-defined thresholds on specific variables). This operation becomes trivial.
  3. Save these (clean) variables in whatever format we want. This now becomes easy to load into an xarray as we have {'x': [1, 2, 3, ...]; 'y': [1, 2, 3, ...], 'h': [1, 2, 3, ...], ..}.

@weiji14
Copy link
Member

weiji14 commented Jun 13, 2020

Good to see the discussion flowing! You've raised a couple of good points, especially on the fact that a nested directory/dictionary structure is a pain to traverse efficiently. Personally I still find xarray to be a delight to use compared to h5py, but I've only been using the land ice products (ATL06 and ATL11) so I can only speak from my experience with those hdf5 files.

  • Do we want to keep the original nested-dictionary structure when converting to other formats? (most users only need a fraction of the information)

Depends on what formats they want to convert to 😄 xarray offers a way to load a dataset 'lazily' via dask (see http://xarray.pydata.org/en/stable/dask.html), so users can see all of the information, subset to what they need, all while loading nothing into memory (until they need to do calculations on it). But I agree that this might be relevant just be for heavy data users.

  • It won't be easy to map to xarray all the paths for each group/variable for each data product (in the past, these paths have even changed from version to version)

What I've done is to read across the different groups (e.g. 'gt1l', 'gt1r', 'gt2l', 'gt2r', gt3l', 'gt3r') and concatenate them together in xarray. Ideally there would be a way to read such parallel 'groups' using xarray, and I do believe it's technically possible, we just need to make a case for it at https://github.com/pydata/xarray.

  • If dealing with HDF5, we want a bunch of small files in a single directory (for fast query and easy parallelization)

Agree, or you could try Zarr which is another n-dimensional storage format optimized for cloud access, but I've personally found it to work much better than HDF5 locally as well.

  1. Extract variables of interest from the original granules (x, y, t, h, std, ...). We can define mappings for all the data dictionaries, e.g {'x': '/path/to/variable/x', ...} so the user won't need to worry about hard coding these paths (that's what we do now).

  2. Having all variables as 1D arrays, add option for filtering at the point level (e.g. using the quality flag, or user-defined thresholds on specific variables). This operation becomes trivial.

Might be better to just point to the top level path (e.g. 'gt1l/land_ice_segments') instead of to every variable (since there's a lot!). The filtering could also be done directly using xarray (see http://xarray.pydata.org/en/latest/indexing.html#masking-with-where). It should just be a matter of having some jupyter notebooks showing people what variables are there, and how to mask out poor quality data points.

Also, I'm not to sure that you can safely assume everything in every ATLxx product is a 1D array. The ATL11 dataset for example has a 'h_corr' variable which is a 2D array ('ref_pt' x 'cycle_number').

I guess what i'm trying to say is that xarray is the best data structure to start with, since it comes built in with so many functions to do I/O conversions, parallelization, masking, plotting, etc on complicated n-dimensional stuff. Not to mention the variety of projects built on top of xarray itself. It would be worth talking to the xarray devs floating around on https://discourse.pangeo.io/ once we have a well scoped out issue.

@fspaolo
Copy link

fspaolo commented Jun 13, 2020

so users can see all of the information, subset to what they need, all while loading nothing into memory (until they need to do calculations on it). But I agree that this might be relevant just be for heavy data users.

You don't need to load anything into memory to see the entire HDF5 structure, you don't even need to write code (from the command line): h5ls -r file.h5

The first thing I do (with the original data files) is deciding what variables I want/need (which is just a fraction of the original dataset). From this point on, all the steps are calculations, starting with filtering and applying geophysical corrections. (I am a heavy data user)

Might be better to just point to the top level path (e.g. 'gt1l/land_ice_segments') instead of to every variable (since there's a lot!).

Again, why would you want to drag the entire data-dictionary structure along your processing pipeline? Not every variable, only the user-selected ones (which usually are a fraction of all the variables).

Nothing against xarray (I use it a lot). But the "best" data structure is relative to the task at hand. Here are some xarray limitations (from experience):

  • It was designed for n-dimensional arrays, in particular n > 1. It is not optimal for sparse scattered point data (what we have).
  • Even on 3D arrays, ad-hoc operations along specific dimensions are not very efficient. For example, applying a simple Gaussian 2D interpolation/smoothing (from another library) along the 3rd dimension (i.e. on each slice)... or simply iterating over each x/y grid cell and performing a time-reduction operation such as a polynomial fit (along the 3rd dimension). These can be implemented in xarray but are much slower than operating on bare bones Numpy arrays. So I end up defaulting to .values.
  • xarray does not support basic operations such as interpolation using curvilinear grids (quite common with geographical data). See the discussion (and my comments) here: Does interp() work on curvilinear grids (2D coordinates) ?  pydata/xarray#2281

So perhaps the "best" approach forward is to find out what the average user needs. We want to make ICESat-2 data processing and analysis as close to trivial as possible.

@weiji14
Copy link
Member

weiji14 commented Jun 16, 2020

So perhaps the "best" approach forward is to find out what the average user needs. We want to make ICESat-2 data processing and analysis as close to trivial as possible.

Agree. There's probably a whole spectrum of users from those working on land/sea ice applications, and atmospheric or forest canopy scientists that want to do different things. It would be nice to figure out whether most people are 'heavy' data users, that value memory efficiency or 'light' data users that value metadata richness, or somewhere in between. Might be good to take a survey of ICESat-2 hackweek participants to figure this one out 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants