Selecting Subset of Dataset More Efficiently #5558

etsmith14 · 2021-07-01T14:58:44Z

etsmith14
Jul 1, 2021

I am trying to figure out the most efficient workflow when working with large datasets. I would like to be able to open numerous files as an xarray dataset and access small pieces/single time series within these files quickly (several seconds) without first loading the entire dataset into memory. I currently use open_mfdataset to open about 254GB of netcdf files from an SSD. These files contain climate model data with dimensions of latitude (360 lats), longitude (720 lons), and time (~27,000 days). Opening is very fast (0.15s), however, selecting a single time series (single lat/lon) from this dataset (GFDL_pr_historical) and plotting it is very slow (254s). Unsurprisingly, loading a single time series into memory is also very slow. Theoretically, the single time series should be very small, so I assume there is something I am not considering that may be affecting the way the time series is read, like chunking. To that end, I have tried chunking the data based on various sized lat, lon, and time chunks. I read that I should choose a sufficiently large chunk, preferably along the time dimension, for better performance, but that didn't speed up the read. I am relatively new to python and using xarray so any insight would be great. The exact code I use for this is below with the time it takes to complete each section in the comment next to print. I've tried many other chunk sizes with similar results.

import xarray as xr
import matplotlib.pyplot as plt
from timeit import default_timer as timer


folder = 'folder_path'
variable = 'prAdjust' 

#%% chunk size 1
start = timer()
GFDL_pr_historical = xr.open_mfdataset(folder + 'gfdl-esm4_r1i1p1f1_w5e5_historical_' + variable + '_global_daily_*.nc',
                       concat_dim="time", data_vars='minimal', coords='minimal',compat='override', parallel=True, chunks = {'time': 100, 'lat':25, 'lon':25})
plt.plot(GFDL_pr_ssp126.tasmaxAdjust[:,0,0])
end = timer()
print(end - start) #55s

#%% chunk size 2
start = timer()
GFDL_pr_historical = xr.open_mfdataset(folder + 'gfdl-esm4_r1i1p1f1_w5e5_historical_' + variable + '_global_daily_*.nc',
                       concat_dim="time", data_vars='minimal', coords='minimal', compat='override', parallel=True, chunks = {'time': 1000})
plt.plot(GFDL_pr_ssp126.tasmaxAdjust[:,0,0])
end = timer()
print(end - start) #54.1s

#%% chunk size 3
start = timer()
GFDL_pr_historical = xr.open_mfdataset(folder + 'gfdl-esm4_r1i1p1f1_w5e5_historical_' + variable + '_global_daily_*.nc',
                       concat_dim="time", data_vars='minimal', coords='minimal',compat='override', parallel=True, chunks = {'lat':10, 'lon':10})
plt.plot(GFDL_pr_ssp126.tasmaxAdjust[:,0,0])
end = timer()
print(end - start) #54.5s

#%% chunk size 4
start = timer()
GFDL_pr_historical = xr.open_mfdataset(folder + 'gfdl-esm4_r1i1p1f1_w5e5_historical_' + variable + '_global_daily_*.nc',
                       concat_dim="time", data_vars='minimal', coords='minimal',compat='override', parallel=True, chunks = 'auto')
plt.plot(GFDL_pr_ssp126.tasmaxAdjust[:,0,0])
end = timer()
print(end - start) #54s

max-sixty · 2021-07-01T18:54:45Z

max-sixty
Jul 1, 2021
Maintainer

Hi @etsmith14 — thanks for adding a question. Unfortunately it's really difficult to be helpful without a reproducible example; please see #5404. I recognize it's also really hard to provide an example with these sorts of problems when the data is large. At a minimum, could you show timings (e.g. %timeit) with various selections and the chunks of the dataset?

1 reply

etsmith14 Jul 1, 2021
Author

Thanks for the quick reply! Sorry I wasn't able to create a reproducible example. Hopefully the addition of the time it takes to run with different chunks helps. Could it also be one of the other arguments in open_mfdataset causing the issue? Perhaps my expectations are too high and this is the expected performance?

dcherian · 2021-07-01T19:43:37Z

dcherian
Jul 1, 2021
Maintainer

I'm surprised this chunking: chunks = {'lat':10, 'lon':10}) didn't help since that will minimize the amount read per file. Perhaps this chunking doesn't align with the chunking of the netCDF file itself (check ncdump -sh file and look for _ChunkSizes)

1 reply

dcherian Jul 1, 2021
Maintainer

Oh! How are you setting up your dask cluster? To maximize read speed, you'll want a processes-heavy cluster. HDF5 uses a global lock, so a threads-only cluster will get you serial access basically.

A snapshot of your dask dashboard "task stream" will help diagnose this. Or you ccould make a performance report: https://distributed.dask.org/en/latest/diagnosing-performance.html#performance-reports

etsmith14 · 2021-07-01T22:07:17Z

etsmith14
Jul 1, 2021
Author

The raw netcdf files are chunked {time=1, lat=360, lon=720}. I've not actually done anything additional to set up my dask cluster (apologize again for being a newbie). I'll dig deeper into Dask documentation. However, I came across this older thread (#1385) which seemed to be my problem. If I use decode_cf=False, I get a 10x performance increase. Also, I found chunking by time only with the chunk size = 25 was optimal (not exactly sure why).

start = timer()
GFDL_pr_ssp126 = xr.open_mfdataset(folder + '\gfdl-esm4_r1i1p1f1_w5e5_ssp126_' + variable + '_global_daily_*.nc',
                        concat_dim="time", data_vars='minimal', coords='minimal',compat='override', decode_cf=False, parallel=True, chunks={'time':25}) 
plt.plot(GFDL_pr_ssp126.tasmaxAdjust[:,0,0])
end = timer()
print(end - start) #1.6s

start = timer()
GFDL_pr_ssp126 = xr.open_mfdataset(folder + '\gfdl-esm4_r1i1p1f1_w5e5_ssp126_' + variable + '_global_daily_*.nc',
                        concat_dim="time", data_vars='minimal', coords='minimal',compat='override', decode_cf=True, parallel=True, chunks={'time':25}) 
plt.plot(GFDL_pr_ssp126.tasmaxAdjust[:,0,0])
end = timer()
print(end - start) #13s

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Selecting Subset of Dataset More Efficiently #5558

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Selecting Subset of Dataset More Efficiently #5558

etsmith14 Jul 1, 2021

Replies: 3 comments · 2 replies

max-sixty Jul 1, 2021 Maintainer

etsmith14 Jul 1, 2021 Author

dcherian Jul 1, 2021 Maintainer

dcherian Jul 1, 2021 Maintainer

etsmith14 Jul 1, 2021 Author

etsmith14
Jul 1, 2021

Replies: 3 comments 2 replies

max-sixty
Jul 1, 2021
Maintainer

etsmith14 Jul 1, 2021
Author

dcherian
Jul 1, 2021
Maintainer

dcherian Jul 1, 2021
Maintainer

etsmith14
Jul 1, 2021
Author