Add support for configuring Dask distributed #2049

bouweandela · 2023-05-19T14:54:16Z

Description

Add support for configuring the Dask scheduler/cluster that is used to run the tool. See the linked documentation for information on how to use the feature.

Use this together with ESMValGroup/ESMValTool#3151 to let the Python diagnostics also make use of the Dask cluster.

Closes #2040

Link to documentation: https://esmvaltool--2049.org.readthedocs.build/projects/ESMValCore/en/2049/quickstart/configure.html#dask-distributed-configuration

Before you get started

☝ Create an issue to discuss what you are going to do

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.

🧪 The new functionality is relevant and scientifically sound
🛠 This pull request has a descriptive title and labels
🛠 Code is written according to the code quality guidelines
🧪 and 🛠 Documentation is available
🛠 Unit tests have been added
🛠 Changes are backward compatible
🛠 Any changed dependencies have been added or removed correctly
🛠 The list of authors is up to date
🛠 All checks below this pull request were successful

To help with the number pull requests:

🙏 We kindly ask you to review two other open pull requests in this repository

codecov · 2023-05-22T14:07:37Z

Codecov Report

Merging #2049 (2b428a3) into main (3d6ed66) will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #2049      +/-   ##
==========================================
+ Coverage   92.87%   92.90%   +0.03%     
==========================================
  Files         234      235       +1     
  Lines       12447    12506      +59     
==========================================
+ Hits        11560    11619      +59     
  Misses        887      887

Impacted Files	Coverage Δ
esmvalcore/_main.py	`90.87% <100.00%> (+0.07%)`	⬆️
esmvalcore/_task.py	`72.42% <100.00%> (+0.80%)`	⬆️
esmvalcore/config/_dask.py	`100.00% <100.00%> (ø)`
esmvalcore/experimental/recipe.py	`90.16% <100.00%> (+0.16%)`	⬆️

valeriupredoi

small bits and bobs before I go into the Python scripts, bud 🍺

valeriupredoi · 2023-05-23T12:04:44Z

doc/quickstart/configure.rst

+'`scheduler <https://docs.dask.org/en/stable/scheduling.html>`_'.
+The default scheduler in Dask is rather basic, so it can only run on a single
+computer and it may not always find the optimal task scheduling solution,
+resulting in excessive memory use when using e.g. the


Suggested change

resulting in excessive memory use when using e.g. the

resulting in excessive memory use when running an already memory-intensive task like e.g. the

The problem isn't so much that it's memory-intensive, but that the task graph becomes too complicated for the built-in scheduler.

yes - for regular Joe the Modeller: moar memory! Let's scare them before they even think of touching anything 😁

doc/quickstart/configure.rst

environment.yml

Co-authored-by: Valeriu Predoi <[email protected]>

valeriupredoi

couple more comments, bud. test_dask.py to follow, then we done here 😁

esmvalcore/config/_dask.py

valeriupredoi · 2023-05-23T14:28:17Z

esmvalcore/config/_dask.py

+    if CONFIG_FILE.exists():
+        config = yaml.safe_load(CONFIG_FILE.read_text(encoding='utf-8'))
+        if config is not None:
+            dask_args = config


a warning would be nice, telling the user to have the config available and configured if they want to use dasky stuff

Added in 25dc5ce

esmvalcore/config/_dask.py

setup.py

valeriupredoi

very nice work @bouweandela 🎖️ Couple comments I left earlier that have not been resolved, up to you to address, no biggie. It looks very nice and hopefully it'll work well in practice too ie iris will play well

valeriupredoi · 2023-05-23T15:05:25Z

btw I've just created a label called "Dask" - let's use that for Dask-related improvements, and maybe even collect Dask items in the changelog, at least for this upcoming release, that'll be Dask-heavy. Wanted to call it Bouwe instead of Dask 😁

remi-kazeroni

Thanks a lot @bouweandela! Just taking a look at the well written Docs to get a better idea of these new developments.

doc/quickstart/configure.rst

Co-authored-by: Rémi Kazeroni <[email protected]>

schlunma · 2023-05-24T13:46:24Z

I'd like to test this with a couple of recipes, please do not merge just yet 😁

valeriupredoi · 2023-05-24T14:05:17Z

I'd like to test this with a couple of recipes, please do not merge just yet grin

good call, Manu! 🍺

schlunma

Hi @bouweandela, I just tested this thoroughly - this is really amazing stuff and works super well! 🚀 For the first time ever I was able to run ESMValTool on multiple nodes! In addition, monitoring an ESMValTool run with a dask dashboard is also super convenient. See here a nice visualization of our example recipe:

I have some general comments/questions:

How does this interact with our current multiprocessing-based parallelization? I tested this with different --max-parallel-tasks options and all seemed to work well, but I do not quite understand how. Does every process use it's own cluster? Or do they share it? Is this safe?
As far as I understand the only way to use multiple nodes in a SLURMCluster is to use the cluster.scale(jobs=n_jobs) method, see here. Would it make sense to add a scale option to the configuration file where one could add keyword arguments for scale?
When running this, I always get UserWarning: Sending large graph of size 56.83 MiB. This may cause some slowdown. Consider scattering data ahead of time and using futures., even for the example recipe. Not sure if this is/will be a problem. Not something we need to tackle here though; just wanted to document it.

Thanks!! 👍

schlunma · 2023-05-25T07:29:54Z

doc/quickstart/configure.rst

+Create a Dask distributed cluster on the
+`Levante <https://docs.dkrz.de/doc/levante/running-jobs/index.html>`_
+supercomputer using the `Dask-Jobqueue <https://jobqueue.dask.org/en/latest/>`_
+package:


It would be nice to mention that this needs to be installed by the user (e.g., mamba install dask-jobqueue) because it's not part of our environment.

I like @valeriupredoi's suggestion of just adding it to the dependencies. It doesn't have any dependencies that we do not already have and it's a very small Python package.

Sounds good, that's even better!

Done in 25dc5ce

valeriupredoi · 2023-05-25T08:15:59Z

That's wonderful to see, Manu, awesome you tested, and of course, awesome work by Bouwe! I'll let Bouwe answer your q's (although I know the answer for the first q from code review hehe), but about the jobqueue dep - we should probably just add it in our env, I'd reckon whatever's Dask-related stuff should be tweaked at a minimum by the user. About data scattering - I've done it before, and have seen marginal performance improvement, but I used fairly small data sizes, so it could be better for busloads of data, as we have 😁

bouweandela · 2023-05-25T09:17:27Z

Thanks for testing @schlunma!

How does this interact with our current multiprocessing-based parallelization? I tested this with different --max-parallel-tasks options and all seemed to work well, but I do not quite understand how. Does every process use it's own cluster? Or do they share it? Is this safe?

When no cluster is configured, every process will use it's own. This is what is causing memory errors at the moment. With the pull request, this is fixed (provided that a user sets up a cluster), by passing the scheduler address to the process running the task and making it connect to the cluster. You can see that here in the code:

ESMValCore/esmvalcore/_task.py

Lines 774 to 775 in 98edcb1

    
           future = pool.apply_async(_run_task, 
        
                                     [task, scheduler_address])

ESMValCore/esmvalcore/_task.py

Lines 808 to 818 in 98edcb1

    
           def _run_task(task, scheduler_address): 
        
               """Run task and return the result.""" 
        
               if scheduler_address is None: 
        
                   client = contextlib.nullcontext() 
        
               else: 
        
                   client = Client(scheduler_address) 
        
               with client: 
        
                   output_files = task.run() 
        
               return output_files, task.products

The multiprocessing configuration is used to parallelize the metadata crunching (i.e. all the stuff with coordinates etc). In a future pull request, I intend to make it possible to also run that on a Dask cluster, see #2041.

As far as I understand the only way to use multiple nodes in a SLURMCluster is to use the cluster.scale(jobs=n_jobs) method, see here. Would it make sense to add a scale option to the configuration file where one could add keyword arguments for scale?

This can also be configured with the parameter n_workers:
https://jobqueue.dask.org/en/latest/generated/dask_jobqueue.SLURMCluster.html#dask-jobqueue-slurmcluster

When running this, I always get UserWarning: Sending large graph of size 56.83 MiB. This may cause some slowdown. Consider scattering data ahead of time and using futures., even for the example recipe. Not sure if this is/will be a problem. Not something we need to tackle here though; just wanted to document it.

I suspect this warning occurs because some of the preprocessor functions are not lazy, e.g. multi_model_statistics.

schlunma · 2023-05-25T10:17:41Z

When no cluster is configured, every process will use it's own. This is what is causing memory errors at the moment.

Could you elaborate on that? What do you mean by "at the moment"? The current main? Do you have an example for such a memory error?

With the pull request, this is fixed (provided that a user sets up a cluster), by passing the scheduler address to the process running the task and making it connect to the cluster. The multiprocessing configuration is used to parallelize the metadata crunching (i.e. all the stuff with coordinates etc).

True, could have guessed that from the code 😁 I suppose the cluster is smart enough to handle input from different processes? Are there any recommendations for the ratio max_parallel_tasks/n_workers? E.g., does it make sense to use 16 parallel tasks with just 1 worker?

In a future pull request, I intend to make it possible to also run that on a Dask cluster, see #2041.

Cool!

This can also be configured with the parameter n_workers: https://jobqueue.dask.org/en/latest/generated/dask_jobqueue.SLURMCluster.html#dask-jobqueue-slurmcluster

I tried that with n_workers=4, but just got 1 node. I now used n_workers=16 and got 4 nodes. So I guess the number of nodes is a combination of the different parameters.

@schlunma Did you also have a chance to test this with some recipes using a large amount of data? I would be interested in your results.

Not yet, but I can try to run a heavy recipe with this 👍

doc/quickstart/configure.rst

schlunma · 2023-05-25T15:00:27Z

Alright, I did a lot of tests now. Note: Varying max_parallel_tasks did not significantly change the outcome in any test below.

Test with existing heavy recipe using `distributed.LocalCluster` on a 512 GiB compute node

Bad news first: I couldn't get one of the heaviest recipes I know (recipe_schlund20esd.yml) to run. I tried many different options (see table below), but always got a concurrent.futures._base.CancelledError (see log below).

Test with simple heavy recipe that uses a purely lazy preprocessor using `distributed.LocalCluster` on a 512 GiB compute node

I then created a very simple heavy recipe that uses some high-res CMIP6 data:

preprocessors:
  test:
    regrid:
      target_grid: 2x2
      scheme: linear

diagnostics:

  test:
    variables:
      ta:
        preprocessor: test
        mip: Amon
        project: CMIP6
        exp: hist-1950
        start_year: 1995
        end_year: 2014
    additional_datasets:
      - {dataset: CMCC-CM2-VHR4,   ensemble: r1i1p1f1, grid: gn}
      - {dataset: CNRM-CM6-1-HR,   ensemble: r1i1p1f2, grid: gr}
      - {dataset: ECMWF-IFS-HR,    ensemble: r1i1p1f1, grid: gr}
      - {dataset: HadGEM3-GC31-HM, ensemble: r1i1p1f1, grid: gn}
      - {dataset: MPI-ESM1-2-XR,   ensemble: r1i1p1f1, grid: gn}
    scripts:
      null

The good news: With a purely lazy preprocessor (regrid), it worked pretty well and I also got a significant speed boost.

`n_workers`	`memory_limit`	run time	memory usage	comment
4	128 GiB	3:59 min	3.6GB
16	32 GiB	2:07 min	17.9GB	default settings when using `distributed.LocalCluster()` without any arguments
8	32 GiB	1:56 min	8.7GB
-	-	9:31 min	15.7GB	no cluster (i.e., current `main`)

As you can see, I could easily get a speed boost of ~5!

Test with simple heavy recipe that uses a non-lazy preprocessor using `distributed.LocalCluster` on a 512 GiB compute node

preprocessors:
  test:
    area_statistics:
      operator: mean

diagnostics:

  test:
    variables:
      ta:
        preprocessor: test
        mip: Amon
        project: CMIP6
        exp: hist-1950
        start_year: 1995
        end_year: 2014
    additional_datasets:
      - {dataset: CMCC-CM2-VHR4,   ensemble: r1i1p1f1, grid: gn}
      - {dataset: CNRM-CM6-1-HR,   ensemble: r1i1p1f2, grid: gr}
      - {dataset: ECMWF-IFS-HR,    ensemble: r1i1p1f1, grid: gr}
      - {dataset: HadGEM3-GC31-HM, ensemble: r1i1p1f1, grid: gn}
      - {dataset: MPI-ESM1-2-XR,   ensemble: r1i1p1f1, grid: gn}
    scripts:
      null

However, when using a non-lazy preprocessor (area_statistics), I always got the same concurrent.futures._base.CancelledError as for the other heavy recipe, regardless of the settings (main_log_debug.txt).

I also tried combinations where the cluster only had a small fraction of the memory (e.g., n_workers: 4, memory_limit: 64 GiB, i.e., 256 GiB remain!), which also did not work.

I'm not familiar enough with dask to tell why this is not working, but it looks like this is related to using a non-lazy function. I thought one could maybe address by allocated enough memory for the "non-dask" part, but that also didn't work.

Tests with `dask_jobqueue.SLURMCluster`

I got basically the same behavior as above. With the following settings the lazy recipe ran in 8:27 min (memory usage of 0.8 GB; not quite sure if that is correct or wrong because the heavy stuff is now done on a different node):

cluster:
  type: dask_jobqueue.SLURMCluster
  ...
  queue: compute
  cores: 128
  memory: 32 GiB
  n_workers: 8

There's probably a lot to optimize here, but tbh I don't full understand the all the options for SLURMCluster yet 😁

To sum this all up: for lazy preprocessors this works exceptionally well! 🚀

bouweandela · 2023-05-25T15:21:51Z

Could you elaborate on that? What do you mean by "at the moment"? The current main? Do you have an example for such a memory error?

Yes, in the current main branch. Maybe the recipe #1193 could be an example?

valeriupredoi · 2023-05-25T15:22:30Z

hellow cool stuff! Very cool indeed @schlunma 💃 Here's some prelim opinions:

lazy preprocs are the only ones that work as expected because this whole machinery is optimized for lazy stuff
settings depend on the amount of data and computation needed - and resources used really do vary based on the cluster settings - afraid there is no best set of settings parameters (eg try half the data for your lazy regrid runs and you'll see that performance changes and other settings return the best balance)
the concurrent Futures issue seems o be coming from very small scattered data packets that get created, then deleted then recreated, see eg Getting concurrent.futures._base.CancelledError from simple binary tree built from futures dask/distributed#4612 - it'd be good to first scatter the data (since it's not lazy) then do computations, since once scattered the data is smaller, and default scattering by the cluster may not happen anymore (that is bad, since the cluster doesn't really know how to do it properly)

Very cool, still 🍺

bouweandela · 2023-05-25T15:29:13Z

I suppose the cluster is smart enough to handle input from different processes? Are there any recommendations for the ratio max_parallel_tasks/n_workers?

Provided that all preprocessor functions are lazy, I would recommend setting max_parallel_tasks equal to the number of CPU cores in the machine, since coordinates are mostly realized, meaning this will use on average 1 CPU core per task. I suspect that in most cases it does not make sense to have n_workers larger than max_parallel_tasks until we have delayed saving mentioned in #2042, since a task can only save one file at a time and I would expect a worker to be able to process all data required for a single output file. Depending on what you're actually trying to compute of course.

valeriupredoi · 2023-05-25T15:32:55Z

and I would expect a worker to be able to process all data required for a single output file

that's the ideal case, right? Minimum data transfer between workers - that's why Manu's jobs are failing now for non-lazy pp's - too much ephemeral data exchange between workers, I thnk

valeriupredoi · 2023-05-25T16:03:44Z

this post says the Cancelled error comes from some node/worker being idle for too long - interesting - so it's either too short a computation that deletes the key on the worker, but the Client still needs it/recreates it, or the thing's too long - it'd be worth examining the "timeout" options see https://docs.dask.org/en/stable/configuration.html - beats me which one though, used to be a scheduler_timeout

schlunma · 2023-05-26T09:51:20Z

I now also tested the (fully lazy) recipe of #1193 now (which used to not run back then):

# ESMValTool
---
documentation:
 description: Test ERA5.

 authors:
   - schlund_manuel

 references:
   - acknow_project


preprocessors:

 mean:
   zonal_statistics:
     operator: mean


diagnostics:

 ta_obs:
   variables:
     ta:
       preprocessor: mean
       mip: Amon
       additional_datasets:
         - {dataset: ERA5, project: native6, type: reanaly, version: v1, tier: 3, start_year: 1990, end_year: 2014}
   scripts:
     null

The good new is: it is now running without any dask configuration, and I could also get a significant speed up (5:41 min -> 1:56 min) by using a LocalCluster.

Replacing the lazy zonal_statistics with the non-lazy area_statistics lead to the same error as already reported.

schlunma · 2023-05-26T09:56:26Z

esmvalcore/config/_dask.py

+        logger.warning(
+            "Using the Dask basic scheduler. This may lead to slow "
+            "computations and out-of-memory errors. See https://docs."
+            "esmvaltool.org/projects/ESMValCore/en/latest/quickstart/"
+            "configure.html#dask-distributed-configuration for information "
+            "on how to configure the Dask distributed scheduler."
+        )


I am not too happy with this being a warning at the current stage: as shown by my tests, using a distributed scheduler can actually lead to recipes not running anymore. Thus, I would be very careful recommending this to users at the moment.

We should either phrase this more carefully (maybe add that this is an "experimental feature") or convert it to an info message. Once we are more confident with this we can change it back.

Also: this does not raise a warning if the dask config file exists but is empty. Should we also consider this case?

I can change it back to INFO if @valeriupredoi agrees because he asked for a WARNING in #2049 (comment). Note that not configuring the scheduler can also lead to recipes not running, so it rather depends on what you're trying to do what the best setting is, as also noted in the documentation.

yeah that was me before Manu's actual testing. But alas, Bouwe is right too - I'd keep it as a warning but add what Manu suggests - experimental feature with twitchy settings that depend on the actual run

Also: this does not raise a warning if the dask config file exists but is empty. Should we also consider this case?

Won't that default to a LocalCluster etc?

I'd keep it as a warning but add what Manu suggests - experimental feature with twitchy settings that depend on the actual run

Fine for me!

Won't that default to a LocalCluster etc?

No, this also results in the basic scheduler: https://github.com/ESMValGroup/ESMValCore/pull/2049/files#diff-b046a48e3366bf6517887e3c39fe7ba6f46c0833ac02fbbb9062fb23654b06bdR64-R69

aw. Nicht gut. That needs be communicated to the user methinks - ah it is, didn't read the very first sentence 😆 Fine for me then

Done in 80f5c62

schlunma

Thanks for all the changes @bouweandela! I love the changes to the documentation, with this I could get a recipe to run in 30s which previously took 6 mins!

Just ouf of curiosity: where did you get that n_jobs = n_workers / processes from? I searched for that yesterday but couldn't find anything.

Please make sure to adapt the warning as discussed in the corresponding comment. Will approve now since I will be traveling next week.

Thanks again, this is really cool! 🚀

valeriupredoi · 2023-05-26T14:01:41Z

where did you get that n_jobs = n_workers / processes from?

I think that's just Bouwe being sensible, as he usually is 😁 ie you'd ideally want one process per worker

valeriupredoi · 2023-05-26T14:03:17Z

awesome testing @schlunma - cheers for that! I'll dig deeper into those errors you've been seeing once I start testing myself, sometime next week 👍

sloosvel

Thanks @bouweandela ! I managed to create some SLURM clusters just fine with these changes. Regarding the configuration, is it possible to have multiple configurations in dask.yml? Not every recipe will require the same type of resources.

However, I am running in a machine that has quite modest technical specifications, to put it nicely, and even running in a distributed way does not seem to help much in the performance and memory usage.

Some issues that were found for instance:

The concatenation of multiple files seems to slow down
The heaviest jobs all seem to die during the saving of the files.
Even for jobs that finish successfully, I get errors in the middle of the debug log about workers dying, not being responsive and communications dying. Not sure where these are coming from.

It could very well be that I am not configuring the jobs properly though.

It is true though that everything that is already lazy performs even faster. The lazy regridder went from taking 4 minutes to run in a high resolution experiment, to taking less than 2 minutes.

fnattino

Very nice @bouweandela, looks great to me! Nice that you have both the cluster and the client config sections, so that the following also works as a 'shortcut' to create a local cluster:

client:
  n_workers: 2
  threads_per_worker: 4

doc/quickstart/configure.rst

bouweandela · 2023-05-30T07:42:20Z

Just ouf of curiosity: where did you get that n_jobs = n_workers / processes from? I searched for that yesterday but couldn't find anything.

@schlunma From the documentation of the processes argument:

Cut the job up into this many processes.

i.e. processes is the number of workers launched by a single job.

bouweandela · 2023-05-30T07:44:35Z

Regarding the configuration, is it possible to have multiple configurations in dask.yml? Not every recipe will require the same type of resources.

Good idea @sloosvel. @ESMValGroup/technical-lead-development-team Shall I add a warning that the current configuration file format is experimental and may change in the next release? We can then discuss how to best add this to the configuration at the SMHI workshop ESMValGroup/Community#98.

bouweandela · 2023-05-30T07:54:28Z

@schlunma and @sloosvel Thank you for running the tests! Really good to learn from your experiences. Regarding the dying workers: I suspect they are using more than the configured amount of memory and then killed, so giving them more memory may solve the problem, though in case the preprocessor functions you're using are not lazy that may not help.

Regarding the area_statistics preprocessor function: this is actually lazy (except for median), but there seems to be something odd going on. If I run the example recipe provided by @schlunma in #2049 (comment) with just one model and one year of data, I already get the warning that 1.5 GB of data is sent to the workers (i.e. the size of the weights array), even though the weights array is lazy. Maybe #47 is applicable here.

bouweandela · 2023-05-30T08:23:45Z

The concatenation of multiple files seems to slow down

@sloosvel If this is a problem for your work, could you please open an issue and add an example recipe so we can look into it?

…re-dask-distributed

bouweandela · 2023-05-31T09:54:50Z

@ESMValGroup/technical-lead-development-team I think this is ready to be merged. Could someone please do a final check and merge?

remi-kazeroni · 2023-06-01T09:08:29Z

Thanks a lot @bouweandela for your work on this great extension! It should definitely one of the highlights of the next Core release! And thanks to everyone who reviewed this PR and tested the new Dask capabilities! Merging now 👍

Make it possible to configure dask in ~/.esmvaltool/dask.yml

737f912

bouweandela added the enhancement New feature or request label May 19, 2023

Fix tests

ff001ed

bouweandela added this to the v2.9.0 milestone May 22, 2023

bouweandela added 5 commits May 22, 2023 17:26

Add docs

8bb0f87

Add more tests

79b38db

Documentation updates

54bc482

Fix lint

41ab178

Add another test

16c11ee

bouweandela marked this pull request as ready for review May 23, 2023 08:34

bouweandela requested a review from fnattino May 23, 2023 08:34

bouweandela mentioned this pull request May 23, 2023

Add support for using a dask distributed scheduler ESMValGroup/ESMValTool#3151

Merged

8 tasks

valeriupredoi requested changes May 23, 2023

View reviewed changes

bouweandela and others added 3 commits May 23, 2023 15:25

Improve docs

e56e562

Co-authored-by: Valeriu Predoi <[email protected]>

Improve docs

cb71b2a

Co-authored-by: Valeriu Predoi <[email protected]>

Remove pip install of ESMValTool_sample_data

2898217

valeriupredoi requested changes May 23, 2023

View reviewed changes

valeriupredoi mentioned this pull request May 23, 2023

Add support for configuring Dask distributed #2040

Open

valeriupredoi approved these changes May 23, 2023

View reviewed changes

valeriupredoi added the dask related to improvements using Dask label May 23, 2023

remi-kazeroni reviewed May 23, 2023

View reviewed changes

doc/quickstart/configure.rst Outdated Show resolved Hide resolved

doc/quickstart/configure.rst Outdated Show resolved Hide resolved

Correct spelling

98edcb1

Co-authored-by: Rémi Kazeroni <[email protected]>

schlunma self-requested a review May 24, 2023 13:45

schlunma reviewed May 25, 2023

View reviewed changes

sloosvel reviewed May 25, 2023

View reviewed changes

doc/quickstart/configure.rst Outdated Show resolved Hide resolved

Small improvements

25dc5ce

schlunma mentioned this pull request May 26, 2023

Memory issues despite lazy operations #1193

Closed

schlunma reviewed May 26, 2023

View reviewed changes

Improve documentation

296cca1

schlunma approved these changes May 26, 2023

View reviewed changes

sloosvel approved these changes May 29, 2023

View reviewed changes

fnattino approved these changes May 30, 2023

View reviewed changes

doc/quickstart/configure.rst Outdated Show resolved Hide resolved

Improve documentation

80f5c62

sloosvel mentioned this pull request May 30, 2023

Using Dask distributed may slow down preprocessor steps running before concatenate #2073

Open

Merge branch 'main' of github.com:esmvalgroup/esmvalcore into configu…

2b428a3

…re-dask-distributed

remi-kazeroni merged commit 1c1e6f1 into main Jun 1, 2023

remi-kazeroni deleted the configure-dask-distributed branch June 1, 2023 09:09

remi-kazeroni mentioned this pull request Jun 1, 2023

Update iris pin #2026

Closed

3 tasks

	resulting in excessive memory use when using e.g. the
	resulting in excessive memory use when running an already memory-intensive task like e.g. the

Add support for configuring Dask distributed #2049

Add support for configuring Dask distributed #2049

Conversation

bouweandela commented May 19, 2023 • edited Loading

Description

codecov bot commented May 22, 2023 • edited Loading

Codecov Report

valeriupredoi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valeriupredoi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valeriupredoi left a comment

Choose a reason for hiding this comment

valeriupredoi commented May 23, 2023

remi-kazeroni left a comment • edited Loading

Choose a reason for hiding this comment

schlunma commented May 24, 2023

valeriupredoi commented May 24, 2023

schlunma left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valeriupredoi commented May 25, 2023

bouweandela commented May 25, 2023

schlunma commented May 25, 2023

schlunma commented May 25, 2023

Test with existing heavy recipe using distributed.LocalCluster on a 512 GiB compute node

Test with simple heavy recipe that uses a purely lazy preprocessor using distributed.LocalCluster on a 512 GiB compute node

Test with simple heavy recipe that uses a non-lazy preprocessor using distributed.LocalCluster on a 512 GiB compute node

Tests with dask_jobqueue.SLURMCluster

bouweandela commented May 25, 2023

valeriupredoi commented May 25, 2023

bouweandela commented May 25, 2023

valeriupredoi commented May 25, 2023

valeriupredoi commented May 25, 2023

schlunma commented May 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valeriupredoi May 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valeriupredoi May 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schlunma left a comment

Choose a reason for hiding this comment

valeriupredoi commented May 26, 2023

valeriupredoi commented May 26, 2023

sloosvel left a comment

Choose a reason for hiding this comment

fnattino left a comment

Choose a reason for hiding this comment

bouweandela commented May 30, 2023 • edited Loading

bouweandela commented May 30, 2023 • edited Loading

bouweandela commented May 30, 2023

bouweandela commented May 30, 2023

bouweandela commented May 31, 2023

remi-kazeroni commented Jun 1, 2023

bouweandela commented May 19, 2023 •

edited

Loading

codecov bot commented May 22, 2023 •

edited

Loading

remi-kazeroni left a comment •

edited

Loading

Test with existing heavy recipe using `distributed.LocalCluster` on a 512 GiB compute node

Test with simple heavy recipe that uses a purely lazy preprocessor using `distributed.LocalCluster` on a 512 GiB compute node

Test with simple heavy recipe that uses a non-lazy preprocessor using `distributed.LocalCluster` on a 512 GiB compute node

Tests with `dask_jobqueue.SLURMCluster`

valeriupredoi May 26, 2023 •

edited

Loading

valeriupredoi May 26, 2023 •

edited

Loading

bouweandela commented May 30, 2023 •

edited

Loading

bouweandela commented May 30, 2023 •

edited

Loading