Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing file objects to netCDF4.Dataset doesn't work #295

Open
rabernat opened this issue Oct 5, 2014 · 30 comments
Open

Passing file objects to netCDF4.Dataset doesn't work #295

rabernat opened this issue Oct 5, 2014 · 30 comments

Comments

@rabernat
Copy link

rabernat commented Oct 5, 2014

I am trying to port some code from scipy.io.netcdf_file to netCDF4.Dataset. I have encountered an issue which is pretty significant for me. netCDF4.Dataset expects a string as its argument and is unable to accept an open file object. The issue can be seen in the following code

import netCDF4
from scipy.io import netcdf_file

fobj = open('MODIS.nc', 'rb')
nc3 = netcdf_file(fobj)
fobj.close()

fobj = open('MODIS.nc', 'rb')
nc4 = netCDF4.Dataset(fobj) # this fails
fobj.close()

The second-to-last line raises

TypeError: expected string or Unicode object, file found

This may seem like an unnecessary feature (why not just pass the filename directly), but the problem is that I have a large archive of bzipped netcdf files on disk. The way I usually read them is

import bz2
bz2_fobj = bz2.BZ2File('MODIS.nc.bz2')
nc3 = netcdf_file(bz2_fobj)

If I can't do this with netCDF4, I will have do design a clumsy workaround involving system commands to manually unzip the files.

I considered tying to add this feature myself, but then I realized that the whole library was written in C. Hopefully you will consider adding support for reading file objects.

@shoyer
Copy link
Contributor

shoyer commented Oct 5, 2014

We would also love to have this, but unfortunately, I don't think it's an easy fix (as it would require delving into the netcdf C library). Hopefully @jswhit can elaborate.

@jswhit
Copy link
Collaborator

jswhit commented Oct 6, 2014

scipy.io_netcdf is a pure python module that reads and writes netcdf-3 formatted files directly. netcdf4-python is a python interface to the netcdf C library, and can handle the netcdf HDF5-based file format. There's no way for the C library to utilize a python file object.

@shoyer
Copy link
Contributor

shoyer commented Oct 6, 2014

I doubt there's "no way", but I don't have well defined sense of how difficult it would be (or who actually knows enough to do it). There is, for example, a C side API for working with Python file objects: https://docs.python.org/2/c-api/file.html

@jswhit
Copy link
Collaborator

jswhit commented Oct 6, 2014

I should have said "impossible without extensive modifications to the HDF5 and netCDF C libs".

I suppose as a workaround we could dump the bytes from the open file object to a temp file, and then pass the name of that temp file to the netCDF C lib.

@shoyer
Copy link
Contributor

shoyer commented Oct 6, 2014

I suppose as a workaround we could dump the bytes from the open file object to a temp file, and then pass the name of that temp file to the netCDF C lib.

Indeed, this is the simplest way to solve this problem. But I would say that sort of solution belongs in user code, not this library.

@marqh
Copy link

marqh commented Oct 29, 2014

maybe I'm missing something here, but a filehandle has a .name attribute, so could the code to work around this, and then by extension offer the way to fix the issue in the python layer look like:

fobj = open('MODIS.nc', 'rb')
nc4 = netCDF4.Dataset(fobj.name)
fobj.close()

?

It's not especially pretty, but at least it enables expected behaviour to be preserved.

@shoyer
Copy link
Contributor

shoyer commented Oct 29, 2014

@marqh that would work for this example, but in general a python file object only needs to adhere the file API -- it need not be an actual file on disk (e.g., it could be a BytesIO object).

@niallrobinson
Copy link

Hi everyone - I'd love to see a fix for this. Its coming up quite regularly in "map reduce" world (i.e. Hadoop, Spark, Dask) where we want to be able to pass file objects around and read them quickly, that is without dumping to disk. Is there anything on the horizon that might help out with this?

@dopplershift
Copy link
Member

netCDF-C last fall gained support for reading directly from an in-memory buffer that contains the bytes of a netCDF file. It's been on my TODO list to expose this in the Cython wrappers here, but I haven't gotten to it. That's probably your best bet--it's not a file-like object, but at least you wouldn't have to have a file on disk any more.

@niallrobinson
Copy link

great - thanks for the update

@jgerardsimcock
Copy link

@dopplershift any update on this?

@rabernat
Copy link
Author

rabernat commented Nov 3, 2016

As the original creator of this issue, I am pleased to see it is still alive. I am still very interested, although more for the reasons described by @niallrobinson. I believe the in-memory buffer solution could solve things. To clarify, would be be able to pass a BytesIO object?

@niallrobinson
Copy link

yup - still actively thinking/worrying about this ;)

@dopplershift
Copy link
Member

It's still on my todo list, but it hasn't bubbled to the top. I'll try to squeeze it in sooner rather than later (since I don't think it's that hard), but can't make any promises (especially before AMS annual meeting in January).

I don't see BytesIO being supported, since the core functionality would be to read the entire contents of a file into memory and point netCDF at it. BytesIO is about wrapping such a buffer so you can access it like a file. So in my mind it would work like this (borrowing from above):

import bz2
from netCDF4 import Dataset
bz2_fobj = bz2.BZ2File('MODIS.nc4.bz2')
nc4 = Dataset(bz2_fobj.read())

Would that would serve the use cases mentioned here?

@shoyer
Copy link
Contributor

shoyer commented Nov 3, 2016

An interface that accepts file images in the form of bytes would be a big improvement over what we have now.

The driver of performance is the number of memory copies. With scipy.io.netcdf and ByteIO, you can actually pull out np.memmap arrays from an in-memory file image with zero copies. In general, this is impossible for netCDF4, due to the fact that HDF5's memory layout is (often) incompatible with NumPy. But, if we can avoid making a copy in netCDF and simply reuse the raw bytes from Python, that would be very nice. If that's not possible, a memory copy is still an improvement over needing to read from disk.

@jswhit
Copy link
Collaborator

jswhit commented Nov 3, 2016

Here's the documentation for the netCDF-C routine (nc_open_mem) that we could wrap in cython:

http://www.unidata.ucar.edu/software/netcdf/docs/group__datasets.html#gac12fdf7579a2619b2aeb238cea2e7377

@thehesiod
Copy link
Contributor

thehesiod commented Apr 28, 2017

@jswhit nice!!! I'm going to try seeing if I can get this to work in a fork. Update, created linked PR, unfortunately nc_open_mem is broken :(

@thehesiod
Copy link
Contributor

update for others on this thread, in master you can now open a file from memory (not released to pypi yet unfortunately)

@ReimarBauer
Copy link

@thehesiod can you show an example please. I am interested to use this with pyfilesystem2, e.g. webdav, ftp direct access.

@dopplershift
Copy link
Member

dopplershift commented Oct 23, 2017

You should be able to use:

Dataset('myname', memory=fobj.read())

There was a problem with myname needing to point to an existing (and valid) netCDF file, but that should be fixed in netCDF 4.5.0, which was just released.

@thehesiod
Copy link
Contributor

thehesiod commented Oct 23, 2017

Still must be non-empty name I believe

@kuchaale
Copy link

kuchaale commented Apr 10, 2018

@ReimarBauer if still interested, here is the solution where I used pyfilesystem2 to read zipped netcdf files:

from fs.zipfs import ZipFS
import xarray as xr
import netCDF4

new_zip = ZipFS("results.zip")
bytes = new_zip.getbytes(u'one_file_within_zip.nc')
nc4_ds = netCDF4.Dataset('name', mode = 'r', memory=bytes)
store = xr.backends.NetCDF4DataStore(nc4_ds)
ds = xr.open_dataset(store)

@mir-una
Copy link

mir-una commented Apr 23, 2018

Hello, I am retrieving a BytesIO object from a REST API response and I would like to read directly the Dataset from it without having to first write the object on disk. Is there a way to do this?

@dopplershift
Copy link
Member

@mir-una Just like the other ones above:

data_bytes = response.read()
nc4_ds = netCDF4.Dataset('name', mode='r', memory=data_bytes)

@mir-una
Copy link

mir-una commented Apr 23, 2018

@dopplershift thank you but I cannot figure it out, I am using the requests package, read() does not seem to be a method supported... I am doing the following:
response = requests.get(my_url,params=token, stream=True)
x=BytesIO(response.content)
y=Dataset('name',mode='r',memory=x)
I get an error: netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.init()
ValueError: memory mode only works with 'r' modes and must be bytes
I am using Python 2.7, is this Python 2.7 does not have a definition of bytes or is it something else?

@dopplershift
Copy link
Member

@mir-una Using BytesIO is unnecessary, try:

response = requests.get(my_url, params=token, stream=True)
y=Dataset('name', mode='r', memory=respons.content)

@tam203
Copy link

tam203 commented Mar 29, 2019

I'm trying to read in to a Dataset from memory as per the docs but it's not working tried 2.7 and 3.7 and get the same error

[ec2-user@ip-172-31-12-20 project]$ python3 inmem.py
Traceback (most recent call last):
  File "inmem.py", line 5, in <module>
    netCDF4.Dataset("in-mem-file", mode='r', memory=data)
  File "netCDF4/_netCDF4.pyx", line 2285, in netCDF4._netCDF4.Dataset.__init__
  File "netCDF4/_netCDF4.pyx", line 1855, in netCDF4._netCDF4._ensure_nc_success
FileNotFoundError: [Errno 2] No such file or directory: b'in-mem-file'

code:

import netCDF4
with open('./db8d6757c80a3fa51779a325ba76336451ea0344.nc','rb') as fp:
    data = fp.read()
ds = netCDF4.Dataset("in-mem-file", mode='r', memory=data)
print(ds)

netCDF4 version '1.5.0'

a FileNotFoundError seems irrelevant since I'm trying to read from memory. Help much appreciated.

@jswhit
Copy link
Collaborator

jswhit commented Apr 13, 2019

Can you post the file here? (attach to ticket as a gzipped tar file?)

@jswhit
Copy link
Collaborator

jswhit commented Apr 13, 2019

Also, what version of netcdf-c are you using? (you can check by looking at the __netcdf4libversion__ module variable).

@tam203
Copy link

tam203 commented Apr 15, 2019

Thanks @jswhit you solved my issue over here

It was version 4.4.1.1 of the lib.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests