local file operations for subsetting and file conversion #72

tsutterley · 2020-06-16T01:20:00Z

adding another layer of in-memory bytesIO objects for performing local operations on subsetted files from icepyx

subsetting to valid data points using provided or calculated quality flags
converting to different file formats not available from NSIDC (such as zarr)

basic addition to https://github.com/icesat2py/icepyx/blob/master/icepyx/core/granules.py#L390 will be like this (with the file operations coming after):

for zfile in z.filelist:
    # Remove the subfolder name from the filepath
    zfile.filename = os.path.basename(zfile.filename)
    fileID = io.BytesIO(z.read(zfile))
    fileID.seek(0)
    # open in-memory HDF5 file and perform operations
    with h5py.File(fileID,'r') as source:

The text was updated successfully, but these errors were encountered:

fspaolo · 2020-06-16T01:27:35Z

@tsutterley I haven't used bytesIO before. What would be the advantage over simply working on the downloaded HDF5s (in the standard way)?

tsutterley · 2020-06-16T01:32:44Z

@fspaolo mostly that the end user would only see the end result (the data they want in the form they want). I think it would make using ICESat-2 data a little bit more approachable (if outputting the "valid" data in a simple form). Could even split into 6 beam-level datasets to work with captoolkit. Basically a little work on the "back end" to make things easier for people on the "front end". I think the captoolkit interface aspect could be pretty useful. It's a good question though if this would be wanted by anyone.

fspaolo · 2020-06-16T01:39:30Z

Another use case we might want to think about is "what happens if the user finds out later it needs another variable"? Will the user need to download the data again (since we have not persisted the original files)?

tsutterley · 2020-06-16T01:51:52Z

that is a good point. With the slimmed down "valid" versions that certainly would be an issue. But I guess if the users want to "level up" and get more variables that'd be their next step. Having these slimmed down forms might also allow for some more exploratory uses of the data. This certainly could be interesting (I am thinking of an animation from @joemacgregor that used an early version of the NSIDC subsetting API to look at glacier termini with ATM). On the other end of the spectrum for cloud computing purposes, if outputting as zarr might not want to store 2 full copies of the data.

fspaolo · 2020-06-16T01:57:15Z

I think this can also be mitigated by providing some guidance on the "most common variables" for standard use cases, and perhaps also suggesting potential additional variables for more complex use cases.

joemacgregor · 2020-06-16T12:02:56Z

Here's a couple of the videos Tyler mentioned for Greenland termini:
All Greenland 1993-2016: https://www.youtube.com/watch?v=8o4DnXLIhxc
Just Helheim: https://www.youtube.com/watch?v=BZR4czv2Kag
Used early NSIDC subsetter informed by termini traced from satellite imagery. Integrated it all in MATLAB. Would be cool to show this with ICESat-2 as well.

JessicaS11 · 2020-06-16T14:18:18Z

This is a great discussion and a critical one for where icepyx goes next. It will be important to have a way for people to use/interact with data locally that is not dependent on them having just downloaded it, which raises a few questions about where/when some of these subsetting and conversion operations should happen and what files are ultimately stored for the user. The modus operandi I've been using can be summarized as "make most of these decisions automatically for the user based on best practices and recommendations from the science team, assuming users just want some basic data without having to make many decisions, but implement those defaults in a way (i.e. with flags and keywords) that make it easy for the heavy-data user to choose something different". For instance, this is the idea behind the default automatic use of the NSIDC subsetter for spatial and temporal subsetting - most people don't need full granules if they've already created a region of interest, so we only give them data where they've asked for it, but if they really want full granules, it's easy to get them.

fspaolo · 2020-06-16T20:42:47Z

@tsutterley and @JessicaS11 so from our discussion, we can try a (very simple) example having BytesIO functionality in granules.py calling a function from (?) that should do the same thing with an object in-memory or on-disk. The minimal steps that would be nice to have working are:

A) Request a granule specifying variables ['x', 'y', 'z'] / get an HDF5 with /x, /y, /z

B) Request a granule / get granule / call function( ) asking for ['x', 'y', 'z'] / get an HDF5 with /x, /y, /z

So for the above operations I can see 3 scenarios:

Download only selected variables

region = icepyx.icesat2data.Icesat2Data(
    variables=["x", "y", "z"],
    ...
)

Get selected variables from downloaded granules

region.get_vars(variables=["x", "y", "z"])

Get selected variables from existing folder/file

icepyx.get_vars(fname='/path/to/data', variables=["x", "y", "z"])

Can we really split the functionality/code in this way?

fspaolo · 2020-06-17T16:31:47Z

By further inspection of the codebase, I noticed there is an intention of having a "data" object similar to the region object above. So in order to implement a data object to represent and operate local files, I think we need to first define the scope of the functionalities we want icepyx to have for local files (what should this data object do?). This will then define the structure and methods of the data object, which should be different from the already-implemented request/download object.

tsutterley added the enhancement New feature or request label Jun 16, 2020

tsutterley assigned fspaolo and tsutterley Jun 16, 2020

tsutterley added the help wanted Extra attention is needed label Jun 16, 2020

JessicaS11 mentioned this issue Jun 17, 2020

handling multiple files with different datasets/variables #80

Closed

tsutterley closed this as completed Feb 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

local file operations for subsetting and file conversion #72

local file operations for subsetting and file conversion #72

tsutterley commented Jun 16, 2020

fspaolo commented Jun 16, 2020 •

edited

Loading

tsutterley commented Jun 16, 2020 •

edited

Loading

fspaolo commented Jun 16, 2020 •

edited

Loading

tsutterley commented Jun 16, 2020 •

edited

Loading

fspaolo commented Jun 16, 2020

joemacgregor commented Jun 16, 2020

JessicaS11 commented Jun 16, 2020

fspaolo commented Jun 16, 2020 •

edited

Loading

fspaolo commented Jun 17, 2020 •

edited

Loading

local file operations for subsetting and file conversion #72

local file operations for subsetting and file conversion #72

Comments

tsutterley commented Jun 16, 2020

fspaolo commented Jun 16, 2020 • edited Loading

tsutterley commented Jun 16, 2020 • edited Loading

fspaolo commented Jun 16, 2020 • edited Loading

tsutterley commented Jun 16, 2020 • edited Loading

fspaolo commented Jun 16, 2020

joemacgregor commented Jun 16, 2020

JessicaS11 commented Jun 16, 2020

fspaolo commented Jun 16, 2020 • edited Loading

fspaolo commented Jun 17, 2020 • edited Loading

fspaolo commented Jun 16, 2020 •

edited

Loading

tsutterley commented Jun 16, 2020 •

edited

Loading

fspaolo commented Jun 16, 2020 •

edited

Loading

tsutterley commented Jun 16, 2020 •

edited

Loading

fspaolo commented Jun 16, 2020 •

edited

Loading

fspaolo commented Jun 17, 2020 •

edited

Loading