Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load esmvalcore.dataset.Dataset objects in parallel using Dask #2517

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

bouweandela
Copy link
Member

@bouweandela bouweandela commented Sep 9, 2024

Description

Load the individual files in a dataset in parallel using Dask and add the option to get a dask.delayed.Delayed back from esmvalcore.dataset.Dataset that can be fed to dask.compute to get an iris.cube.Cube. This can considerably speed up loading datasets that consist of many files or, when used with the delayed option, speed up loading multiple datasets.

Related to #2300 and #2316

Link to documentation: https://esmvaltool--2517.org.readthedocs.build/projects/ESMValCore/en/2517/api/esmvalcore.dataset.html#esmvalcore.dataset.Dataset.load


Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.


To help with the number pull requests:

@bouweandela bouweandela added the dask related to improvements using Dask label Sep 9, 2024
Copy link

codecov bot commented Sep 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.84%. Comparing base (4b0dd41) to head (4057950).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2517   +/-   ##
=======================================
  Coverage   94.83%   94.84%           
=======================================
  Files         251      251           
  Lines       14191    14210   +19     
=======================================
+ Hits        13458    13477   +19     
  Misses        733      733           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is brilliant, bud! I've been meaning to get delayed in places in Core for some time. Got one possible nagging comment through - from https://docs.dask.org/en/stable/delayed-best-practices.html they say "Every delayed task has an overhead of a few hundred microseconds. Usually this is ok, but it can become a problem if you apply dask.delayed too finely. In this case, it’s often best to break up your many tasks into batches or use one of the Dask collections to help you." - I am guessing this applies to O(millions) (at least) but can we maybe run a test with one of those mega recipes that loads hundreds of datasets?

@valeriupredoi
Copy link
Contributor

oh and maybe a line or two in the documentation perhaps? Bit of an advanced topic, so maybe a very short reference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dask related to improvements using Dask
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants