Replies: 3 comments 2 replies
-
Hi @etsmith14 — thanks for adding a question. Unfortunately it's really difficult to be helpful without a reproducible example; please see #5404. I recognize it's also really hard to provide an example with these sorts of problems when the data is large. At a minimum, could you show timings (e.g. |
Beta Was this translation helpful? Give feedback.
-
I'm surprised this chunking: |
Beta Was this translation helpful? Give feedback.
-
The raw netcdf files are chunked {time=1, lat=360, lon=720}. I've not actually done anything additional to set up my dask cluster (apologize again for being a newbie). I'll dig deeper into Dask documentation. However, I came across this older thread (#1385) which seemed to be my problem. If I use decode_cf=False, I get a 10x performance increase. Also, I found chunking by time only with the chunk size = 25 was optimal (not exactly sure why).
|
Beta Was this translation helpful? Give feedback.
-
I am trying to figure out the most efficient workflow when working with large datasets. I would like to be able to open numerous files as an xarray dataset and access small pieces/single time series within these files quickly (several seconds) without first loading the entire dataset into memory. I currently use open_mfdataset to open about 254GB of netcdf files from an SSD. These files contain climate model data with dimensions of latitude (360 lats), longitude (720 lons), and time (~27,000 days). Opening is very fast (0.15s), however, selecting a single time series (single lat/lon) from this dataset (GFDL_pr_historical) and plotting it is very slow (254s). Unsurprisingly, loading a single time series into memory is also very slow. Theoretically, the single time series should be very small, so I assume there is something I am not considering that may be affecting the way the time series is read, like chunking. To that end, I have tried chunking the data based on various sized lat, lon, and time chunks. I read that I should choose a sufficiently large chunk, preferably along the time dimension, for better performance, but that didn't speed up the read. I am relatively new to python and using xarray so any insight would be great. The exact code I use for this is below with the time it takes to complete each section in the comment next to print. I've tried many other chunk sizes with similar results.
Beta Was this translation helpful? Give feedback.
All reactions