Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow reading files passing file objects #1299

Open
carloshorn opened this issue Aug 5, 2020 · 12 comments · May be fixed by #1470
Open

Allow reading files passing file objects #1299

carloshorn opened this issue Aug 5, 2020 · 12 comments · May be fixed by #1470

Comments

@carloshorn
Copy link

Feature Request

Is your feature request related to a problem? Please describe.
We are using satpy in pygac-fdr to process all AVHRR GAC level 1b data. The data (orbits) is packed in tarballs (one month of orbits) and each orbit file is individually gzip compressed. In order to create the Scene object, we would need to extract all files from the tarballs to give a filename. Furthermore, we would also need to unzip all of these files, although pygac is able to process gzipped files, otherwise, satpy would not recognize the filename, because of the .gz suffix (maybe renaming would be sufficient to trick satpy...).

Describe the solution you'd like
I would like to see a clear distinction between file location and filename, so I could pass an open file object or pathlib.Path object or a string to a potentially renamed file which together with a specified reader should be sufficient to read the product. If for some reason, the filename is still needed, a potential strategy could be to allow the user to explicitly provide a filename, or if not given try to extract it from the file location, e.g. using the .name attribute of the file/Path object.

Describe any changes to existing user workflow
If the pattern check would be handled by a Scene argument and it would default to True (i.e. please do pattern checks with the filename), then there should be no impact on existing workflows, because passing a string as file location would result in the current behavior.

Additional context
The discussion started here: pytroll/pygac-fdr#4

Whenever I say filename in the context of Scene argument, I mean Scene(filenames=[filename], ...).

@djhoese
Copy link
Member

djhoese commented Aug 5, 2020

Do you want this specifically for AVHRR GAC formats or for all readers? Some existing readers do support reading compressed formats because of how common the compression is, but this is currently only supported by uncompressing on the fly and requires creating a temporary directory as far as I know. Other popular formats like HDF5 and NetCDF4 have builtin internal compression which is preferred in my opinion. Otherwise, I also wanted to mention that xarray.open_dataset can read open file objects for NetCDF files using the h5netcdf library.

So depending on the format you want this for, are there other options that would be "good enough" or does it have to be supported gzipped archives?

@carloshorn
Copy link
Author

carloshorn commented Aug 5, 2020

Hi @djhoese,
Thanks for the swift reply. It is not about the compression, It is about providing a file object instead of a string as file location which is currently equivalent to filename.
If this is possible, the user can handle whatever compression himself and pass the decompressed file object.
Another example:
Imagine you are getting your data via a download URL. If you use the requests module to load the file into memory (or better just open the stream) and you wrap it into an io.BytesIO object or equivalent, then you will not have a filename which at the same time is the file location.

@djhoese
Copy link
Member

djhoese commented Aug 5, 2020

We've looked in to something like this with S3. A lot of the times we depend on the underlying file reading library (ex. xarray, NetCDF4, HDF5, rasterio - geotiff) to handle this. Recently @gerritholl has looked at adding support for fsspec (https://filesystem-spec.readthedocs.io/en/latest/) to support S3 in the find_files_and_readers utility function. I'd like to get that added to those readers that could support these to provide a reusable interface for accessing these files. See https://satpy.readthedocs.io/en/latest/api/satpy.readers.html#satpy.readers.find_files_and_readers for the usage of these "file system" objects.

I'm not saying we shouldn't support BytesIO objects, just that there might be other ways of accessing the data rather than the user providing the Bytes directly.

@carloshorn
Copy link
Author

If locating files on a local file system, the returned dictionary can be passed directly to the Scene object through the filenames keyword argument. If it points to a remote file system, it is the responsibility of the user to download the files first (directly reading from cloud storage is not currently available in Satpy).

It would be quite simple to write a sort of TarballFS class with a glob method which loops through all tarfile.TarInfo objects and returns the filenames. But the important part is the last parenthesis in the quote from the satpy documentation. This is again equivalent to extracting all data from the archive instead of simply passing the file object(s).

In my vision, I have a processing pipeline, where the data may come from any sort of stream, and as long as I can tell which reader to use, satpy should not complain.

Of course, not all readers may support any sort of streams. Reader calling open should be easy to fix, and I know that h5py accepts file objects. However, some others are still struggling with it, e.g. Unidata/netcdf4-python#295, but these readers should be welcome to through an exception.

@djhoese
Copy link
Member

djhoese commented Aug 5, 2020

But the important part is the last parenthesis in the quote from the satpy documentation. This is again equivalent to extracting all data from the archive instead of simply passing the file object(s).

How so? Reading directly from cloud storage would use the proper HTTP requests to only read the data that is used from a file. This assumes that data is not stored on the cloud storage (ex. S3) in a compressed archive like a gzipped tarball.

Reader calling open should be easy to fix, and I know that h5py accepts file objects. However, some others are still struggling with it, e.g. Unidata/netcdf4-python#295, but these readers should be welcome to through an exception.

NetCDF4 C (and the python wrapper) recently had the functionality added to read from S3 using HTTP byte ranges:

https://twitter.com/dopplershift/status/1286415993347047425

HDF5 recently released version 1.12 with a read-only s3 virtual driver, however h5py doesn't currently use this as far as I know as they have released a version that supports HDF5 1.12.

It would be quite simple to write a sort of TarballFS class with a glob method which loops through all tarfile.TarInfo objects and returns the filenames.

Looks like there is already a zip implementation: https://filesystem-spec.readthedocs.io/en/latest/_modules/fsspec/implementations/zip.html

Another nice thing about using fsspec as an interface is that it should overlap really well with the python Intake library (https://intake.readthedocs.io/en/latest/index.html) and the work the Pangeo community has been doing with defining "catalogs" of data: https://pangeo.io/catalog.html

@djhoese
Copy link
Member

djhoese commented Aug 5, 2020

fsspec has this in an example:

https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.open_files

>>> files = open_files('2015-*-*.csv')  
>>> files = open_files(
...     's3://bucket/2015-*-*.csv.gz', compression='gzip'
... )  

Not that this means anything but it is an interface fsspec has for defining compression on a remote file.

I think if we can wrap this functionality into a TarballFS as you've described then that would be great. I do think support would have to be reader to reader as I'm not sure we can support it for all readers (as you've said already).

@carloshorn
Copy link
Author

Okay, let's think about a possible implementation.
Imagine, I have my class TarFileSystem(fsspec.AbstractFileSystem) (or any other file system class) implementation and I want to use an instance of it to retrieve the data in my satpy application.
The applications that I have seen, use satpy.Scene objects to access the data. Since satpy.Scene can handle multiple files of different formats, do you think that passing multiple FileSystem instances is better, or would it be sufficient to only allow one single instance and ask the user to build a sort of parent file system object, which serves as a mapping to the child file systems if the data comes from different sources?
I prefer a single file system instance and would pass it as a key word argument to the constructor. To ensure backward compatibility, the default should be the current behaviour. From now on, the internal satpy logic is not well known to me, but at some point it instantiates the reader.
To use the file system, I would set it as a reader attribute and implement an open_file method where each reader can implement how it deals with these file systems. The base class implementation would be a NotImplemented exception to indicate that the reader does not support this feature yet.

@djhoese, do you have any further suggestions?

@djhoese
Copy link
Member

djhoese commented Aug 17, 2020

Disclaimer: I have very little experience with fsspec's FileSystem objects.

I think the single file system per reader makes sense. You could think of your normal local file system as the default/simplest case. The user gives you "file identifiers" (I'm making this term up) and the reader accesses those file objects from the provided file system.

There are a couple complications I can think of and some lower-level satpy stuff that I should describe here (some of which you probably know):

  1. The Scene can technically create multiple readers at once. To do this you call it as Scene(filenames={'abi_l1b': [ ... list of abi l1b files ...], 'ahi_hsd': [ ... list of ahi hsd files ...]}). Having a single keyword argument for the file system here may be difficult, but let's ignore this for now until we get further in the implementation and then think about ways to make this case work.
  2. In a basic Scene use case there is a single Reader object that passes each filename/file-object to individual file handlers. These file handlers do the actual file opening/reading. This isn't too big of a deal except that it means a lot more code would have to be updated to support this file system reading as there are a lot of custom file handler classes, but only a couple Reader classes (all based on the base classes in yaml_reader.py). That said all file handlers are based off of one base class (or utility classes based off of that base class).
  3. I wonder if this could be added as a "MixIn" class. This would allow file handler developers to opt-in to the behavior without having to complicate using the file handler base classes for other developers. For example, someone who is using the NetCDF4FileHandler as their base class may not want this functionality for some reason... 🤔 nah I change my mind. This could still be an opt-in like behavior but could be made available as something like open_file_object(...) method in the base file handler. The base class could just hold on to the handle of this file object or in the case of readers like HDF5FileHandler or NetCDF4FileHandler they could pass it to the appropriate library for reading/parsing (xarray, netcdf4-python, h5py, etc).

@carloshorn
Copy link
Author

Thanks for sharing these thoughts.

Disclaimer: I have very little experience with fsspec's FileSystem objects.

Me too, but let's change this :-)

  1. Even if the actual files are spread around on multiple "physical" file systems, it will always be possible to write a more general FileSystem class that interacts with multiple FileSystem instances. This is what I meant with parent FileSystem. In the simplest implementation, I guess it will just use itertools.chain on the output of any child FileSystem instances. Or in a more advanced implementation, it may extract some info from the "file identifier" to find the relevant child FileSystem. To give an example, in my case, the files sit in many tarballs, but not randomly distributed. From the "file identifier", I am able to find the corresponding tarball, and therefore do not need to open every single one. So if I write a TarFileSystem to access single tarballs as child FileSystem, I can still write a MyMultiTarFileSystem as parent FileSystem which implements the mapping from "file identifier" to tarball and in the MyMultiTarFileSystem open, glob, get... methods, it will use the mapping to avoid opening each tarball. So FileSystem can still be a single key word argument of Scene even in the multiple readers case. It would just mean some extra work for the user to implement HisMultiFileSystem class.
  2. Now I see that I confused Reader with Handler in my previous post...
  3. I am totally with you, using an open_file_object(...) method in the base file handler reduces the implementation to a single location. If the specific file handler class uses this method is a different story (Example: Instead of using the FileSystem.open method to get a file object, the FileHandler could use the FileSystem.get method to download the entire file to a temp directory and open the local file with some other method).

Hopefully, I find some time this week to start the PR. At least, I have an idea where to start.

@djhoese
Copy link
Member

djhoese commented Aug 17, 2020

Your response for 1 makes me think if it doesn't exist already in fsspec they should ad a CacheFileSystem that takes two file systems. The first being a file system acts as a cache, anything not in the cache is requested from the second one, opened, and written to the first file system (the cache). Oh boy. I should probably do my real work today before I get any more distracted thinking about this.

@gerritholl
Copy link
Collaborator

Related: #1062

@gerritholl
Copy link
Collaborator

@djhoese See fsspec remote caching.

@carloshorn carloshorn linked a pull request Dec 4, 2020 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants