-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
functions to pass data to other formats #35
Comments
Hi @JessicaS11, pretty sure that most of the # ...
order_id: str = region_a.granules.orderIDs[0]
download_path: str = f"icepyx_data/{order_id}"
region_a.download_granules(path=download_path) ds: xr.Dataset = xr.open_mfdataset(
paths=f"{download_path}/**/processed_*h5",
engine="h5netcdf",
group="gt2l/land_ice_segments",
combine="by_coords",
)
ds I'd be keen to make this part of |
Excellent @weiji14! We'd be happy to have you join the team and help make some of these things happen. I agree that leveraging existing libraries is the way to go wherever possible. The purpose of icepyx is to try and make it easier to go from data acquisition to final data products in a single workspace (versus, e.g., a download API, data processing API, then separate figure making API) and for scientists who are familiar with one data format but not necessarily transitioning between multiple formats (or using those that are cloud optimized). I think I see conversions to other data formats happening as part of a two step process:
I think (1) would occur primarily within the icesat2py module itself (with appropriate calls to the validate_inputs module for checking inputs). (2) will likely happen primarily in the granules module, which was designed to be generic enough to handle granules (i.e. files) through the download process and/or from the file system (once the latter functionality is added). There may also be some tools within captoolkit that we can leverage to accomplish some tasks, as they were designed specifically for working with altimetry data such as ICESat-2. |
Cool, happy to get on board! I'll take a closer look soon. My first impression of the current 'icesat2data' class and granules.py are that they are purely download scripts at the moment, with fancy subsetting capabilities, and some metadata properties. It's a bit confusing how the pieces fit together, but I'll try and wrap my head around it first. |
Ok, I've made a small PR as a start in #59 (to get the actual filepaths in the download directory locally). It's actually quite hard to work on the data if you can't actually find it! Really need to sort out the unzipping as mentioned in #33 as the files are in so many subdirectories. Also, not too sure why you would need to bring in the local HDF5 file(s) in as an icesat2data object for point (1). Is it to access specific metadata (which should already be embedded in the HDF5 file)? It seems quite hard to reverse engineer a downloaded HDF5 granule back into an icesat2data object. Loading the HDF5 data to place it in an icesat2data Python class object can be a memory intensive task if there's many files, and it seems strange to then convert it to an |
I think you raise a bunch of good points, and perhaps it could be helpful to shift the conversation towards what the best way to handle this is in the broader context of next development steps. I agree that there is limited overlap between the attributes needed for data acquisition and those needed for data processing. Maybe the solution is creating another overarching class within icepyx (similar to icesat2data, but for managing and manipulating data files rather than data access... we could always rename the icesat2data class/object to something more appropriate). Odds are that people don't actually want their data to stay in separate directories and files but rather want it all in an xarray or some other format that allows them to have more spatial continuity (since granule boundaries are arbitrary). And as you say, this gets into dealing with #33. I think the best way forward is to try and connect to have a brainstorming conversation about this. I can also try to engage others during the upcoming Hackweek to see what the user and development team perspective is. Please feel free to contact me directly at jbscheick at gmail dot com so we can set something up! |
I think this is in line with the topics we started discussing today. A few things to consider (from a heavy data user perspective):
So maybe to facilitate/simplify things, we could take the following approach:
|
Good to see the discussion flowing! You've raised a couple of good points, especially on the fact that a nested directory/dictionary structure is a pain to traverse efficiently. Personally I still find
Depends on what formats they want to convert to 😄
What I've done is to read across the different groups (e.g. 'gt1l', 'gt1r', 'gt2l', 'gt2r', gt3l', 'gt3r') and concatenate them together in xarray. Ideally there would be a way to read such parallel 'groups' using xarray, and I do believe it's technically possible, we just need to make a case for it at https://github.com/pydata/xarray.
Agree, or you could try Zarr which is another n-dimensional storage format optimized for cloud access, but I've personally found it to work much better than HDF5 locally as well.
Might be better to just point to the top level path (e.g. 'gt1l/land_ice_segments') instead of to every variable (since there's a lot!). The filtering could also be done directly using xarray (see http://xarray.pydata.org/en/latest/indexing.html#masking-with-where). It should just be a matter of having some jupyter notebooks showing people what variables are there, and how to mask out poor quality data points. Also, I'm not to sure that you can safely assume everything in every ATLxx product is a 1D array. The ATL11 dataset for example has a 'h_corr' variable which is a 2D array ('ref_pt' x 'cycle_number'). I guess what i'm trying to say is that |
You don't need to load anything into memory to see the entire HDF5 structure, you don't even need to write code (from the command line): The first thing I do (with the original data files) is deciding what variables I want/need (which is just a fraction of the original dataset). From this point on, all the steps are calculations, starting with filtering and applying geophysical corrections. (I am a heavy data user)
Again, why would you want to drag the entire data-dictionary structure along your processing pipeline? Not every variable, only the user-selected ones (which usually are a fraction of all the variables). Nothing against
So perhaps the "best" approach forward is to find out what the average user needs. We want to make ICESat-2 data processing and analysis as close to trivial as possible. |
Agree. There's probably a whole spectrum of users from those working on land/sea ice applications, and atmospheric or forest canopy scientists that want to do different things. It would be nice to figure out whether most people are 'heavy' data users, that value memory efficiency or 'light' data users that value metadata richness, or somewhere in between. Might be good to take a survey of ICESat-2 hackweek participants to figure this one out 😄 |
One of icepyx's primary goals is to make it easier to work with ICESat-2 data. The present class object data structure aims to facilitate switching between multiple data structures to capitalize on other resources for data analysis. Thus, we need a set of functions for the class (e.g. to_dataframe, to_geodataframe, to_dict, to_xarray, to_netcdf, to_hdf5) that enable the user to easily put their data in these formats for further analysis. How this proceeds is part of ongoing conversations about development directions, so please contribute to the conversation here or on Discourse even if you don't think you'll actually work on this issue directly.
The text was updated successfully, but these errors were encountered: