Load `esmvalcore.dataset.Dataset` objects in parallel using Dask #2517

bouweandela · 2024-09-09T07:52:13Z

Description

Load the individual files in a dataset in parallel using Dask and add the option to get a dask.delayed.Delayed back from esmvalcore.dataset.Dataset that can be fed to dask.compute to get an iris.cube.Cube. This can considerably speed up loading datasets that consist of many files or, when used with the delayed option, speed up loading multiple datasets.

Related to #2300 and #2316

Link to documentation: https://esmvaltool--2517.org.readthedocs.build/projects/ESMValCore/en/2517/api/esmvalcore.dataset.html#esmvalcore.dataset.Dataset.load

Before you get started

☝ Create an issue to discuss what you are going to do

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.

🧪 The new functionality is relevant and scientifically sound
🛠 This pull request has a descriptive title and labels
🛠 Code is written according to the code quality guidelines
🧪 and 🛠 Documentation is available
🛠 Unit tests have been added
🛠 Changes are backward compatible
🛠 The list of authors is up to date
🛠 All checks below this pull request were successful

To help with the number pull requests:

🙏 We kindly ask you to review two other open pull requests in this repository

codecov · 2024-09-09T07:58:13Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.84%. Comparing base (4b0dd41) to head (4057950).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2517   +/-   ##
=======================================
  Coverage   94.83%   94.84%           
=======================================
  Files         251      251           
  Lines       14191    14210   +19     
=======================================
+ Hits        13458    13477   +19     
  Misses        733      733

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

valeriupredoi

this is brilliant, bud! I've been meaning to get delayed in places in Core for some time. Got one possible nagging comment through - from https://docs.dask.org/en/stable/delayed-best-practices.html they say "Every delayed task has an overhead of a few hundred microseconds. Usually this is ok, but it can become a problem if you apply dask.delayed too finely. In this case, it’s often best to break up your many tasks into batches or use one of the Dask collections to help you." - I am guessing this applies to O(millions) (at least) but can we maybe run a test with one of those mega recipes that loads hundreds of datasets?

valeriupredoi · 2024-09-20T12:48:21Z

oh and maybe a line or two in the documentation perhaps? Bit of an advanced topic, so maybe a very short reference

bouweandela added the dask related to improvements using Dask label Sep 9, 2024

bouweandela added 2 commits September 9, 2024 14:13

Parallel load datasets

4649d87

Add test and improve documentation

bc889ba

bouweandela force-pushed the parallel-load branch from 56d24a9 to bc889ba Compare September 9, 2024 12:13

bouweandela marked this pull request as ready for review September 11, 2024 07:15

This was referenced Sep 12, 2024

Performance improvement: recipe_extremes_wind_3h.yml #2301

Open

Performance improvement: recipe_easy_ipcc.yml #2300

Open

Merge branch 'main' into parallel-load

a21996d

valeriupredoi approved these changes Sep 20, 2024

View reviewed changes

bouweandela added 3 commits September 26, 2024 21:39

Use ruff formatting

0bd3da1

Merge remote-tracking branch 'origin/main' into parallel-load

7e232a3

Add type hint

4057950

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load `esmvalcore.dataset.Dataset` objects in parallel using Dask #2517

Load `esmvalcore.dataset.Dataset` objects in parallel using Dask #2517

bouweandela commented Sep 9, 2024 •

edited

Loading

codecov bot commented Sep 9, 2024 •

edited

Loading

valeriupredoi left a comment

valeriupredoi commented Sep 20, 2024

Load esmvalcore.dataset.Dataset objects in parallel using Dask #2517

Are you sure you want to change the base?

Load esmvalcore.dataset.Dataset objects in parallel using Dask #2517

Conversation

bouweandela commented Sep 9, 2024 • edited Loading

Description

Before you get started

Checklist

codecov bot commented Sep 9, 2024 • edited Loading

Codecov Report

valeriupredoi left a comment

Choose a reason for hiding this comment

valeriupredoi commented Sep 20, 2024

Load `esmvalcore.dataset.Dataset` objects in parallel using Dask #2517

Load `esmvalcore.dataset.Dataset` objects in parallel using Dask #2517

bouweandela commented Sep 9, 2024 •

edited

Loading

codecov bot commented Sep 9, 2024 •

edited

Loading