Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Implementation] Dramatically speed up dataset creation by caching geographic coordinates #341

Merged
merged 13 commits into from
Jun 2, 2023

Conversation

meridionaljet
Copy link
Contributor

This is an updated implementation of #338 , addressing a massive performance bottleneck when opening a GRIB file as an xarray dataset. Currently, cfgrib calls cfgrib.dataset.build_geography_coordinates() for every parameter in the index when creating a dataset. Each call requires eccodes's grib_get_array to be called, which reads coordinate arrays from disk. This is prohibitively expensive for large files with many records, and almost always unnecessary since GRIB files typically have identical grids for each record.

This pull request introduces automatic caching of geographic coordinate data by default when calling cfgrib.open_dataset() or cfgrib.open_datasets(). The caching logic is embedded into cfgrib.dataset.build_variable_components(), utilizing the md5sum of the Grid Definition Section of the GRIB file (thanks @iainrussell for that suggestion).

This approach reduces the cfgrib.open_dataset() time for a 262MB HRRR file from NCEP from 3.4 seconds to 45 milliseconds on my machine. If the full 400MB HRRR file with 43 different hypercube types is opened with cfgrib.open_datasets(), the time taken is reduced from 38 seconds to 2 seconds. This thus results in a speedup of 1-2 orders of magnitude, depending on the size of the file and the number of unique hypercubes.

The only possible negative side effect that I can see is a small one: the cache must be implemented globally and thus can theoretically grow unboundedly in a long-lived application wherein cfgrib opens many different grid geometries. I have thus included a mechanism for the user to opt out of coordinate caching by passing cache_geo_coords=False to backend_kwargs. Practically, this should be a rare need, since the total data size would cause memory issues for a typical user long before the coordinate cache would, and most workflows read a small number of unique grid geometries.

The speedup offered here releases a significant bottleneck in data processing workflows using xarray and cfgrib , especially for large files, making xarray dataset creation for GRIB almost as cheap as it is for other data formats like NetCDF and zarr.

@meridionaljet
Copy link
Contributor Author

Fixed the failing code format check

@codecov-commenter
Copy link

codecov-commenter commented Jun 1, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.03 🎉

Comparison is base (2b2e190) 95.62% compared to head (001f003) 95.65%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #341      +/-   ##
==========================================
+ Coverage   95.62%   95.65%   +0.03%     
==========================================
  Files          26       26              
  Lines        2056     2073      +17     
  Branches      236      238       +2     
==========================================
+ Hits         1966     1983      +17     
  Misses         59       59              
  Partials       31       31              
Impacted Files Coverage Δ
cfgrib/xarray_plugin.py 88.40% <ø> (ø)
cfgrib/dataset.py 98.45% <100.00%> (+0.05%) ⬆️
tests/test_40_xarray_store.py 100.00% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

cfgrib/dataset.py Outdated Show resolved Hide resolved
tests/test_40_xarray_store.py Outdated Show resolved Hide resolved
@iainrussell
Copy link
Member

Many thanks @meridionaljet , this is a really nice improvement - I just added a couple of comments above, then I think we're close to merging it in!

@meridionaljet
Copy link
Contributor Author

Requested tweaks by @iainrussell have been implemented

@iainrussell iainrussell merged commit cccbdb7 into ecmwf:master Jun 2, 2023
@iainrussell
Copy link
Member

Thank you @meridionaljet ! I really like this improvement, and the fact that you added documentation, a test, and also a way to disable it in case of it being used as part of a long-running server. Thanks also for being patient with my suggestions and taking them on board, I think this solution is nice because it works 'out of the box' and does not have the risk of corrupted xarrays if the incoming GRIB file has multiple geometries. Thanks again!

@martindurant
Copy link

For kerchunk's use, we would really most like to simple not calculate coordinates at all, as we can store them elsewhere. If it were possible, then, to just skip the bytes that define the geometry to the actual measurements in a given message, all the better. Do you think this is possible?

@iainrussell
Copy link
Member

Hi @martindurant, could you create a new issue for this use case please? It would be good to see an example of a GRIB file and how you would like the resulting xarray to look. It's not clear if you want to remove all the coordinates, including the time and vertical dimensions, and if this is for performance, memory or aesthetics. So if if it really would be useful, pop it in another issue and we can discuss there!
Cheers,
Iain

@TAdeJong
Copy link

TAdeJong commented Jan 24, 2024

Edit: cfgrib 0.9.11.0 incorporating these changes has now been released! 😀

This pull_request greatly increases the speed of our workflow. However, installing from source is somewhat of a hassle.
@iainrussell, I see you are recently doing work on this repository again. Are there plans for a new release soon? It would greatly help us, and I am sure a lot of other people using grib files and xarray 😄 .

(I couldn't really think of another place to ask this, so I hope this way is OK.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants