Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proof of concept: running metadata and data computations on Dask #2316

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

bouweandela
Copy link
Member

@bouweandela bouweandela commented Jan 31, 2024

Description

This pull request splits the computation up in three stages:

  1. Preprocessor functions are run in parallel using Dask without saving data
  2. Preprocessor files are populated with data in parallel using Dask
  3. Diagnostic scripts are run

Only works with max_parallel_tasks: 1 at the moment.

Ideas for further improvements:

  1. optimize multi-model functions, as these limit parallelism
  2. use one delayed per group in multi-model/ensemble means to increase parallelism
  3. try to make delayed operations 'pure', e.g. by copying the input cubes in preprocess before calling the preprocessor function
  4. see if splitting Dataset.load prior to preprocessor step concatenate up in multiple delayeds improves parallelism

Blocking issues

These are things that block this from being used in practice.

  1. ESMPy crashes if you try to from a different thread than the main one. Example script that produces the crash:
    import threading
    
    import numpy as np
    
    
    def run():
        import esmpy
        m = esmpy.Manager(debug=True)
        esmpy.Grid(np.array((10, 20)),
                   num_peri_dims=1,
                   staggerloc=[esmpy.StaggerLoc.CENTER])
    
    
    def main():
    
        thread = threading.Thread(target=run)
        thread.start()
        thread.join()
    
    
    if __name__ == '__main__':
        main()
    results in Segmentation fault (core dumped) and a log file called PET0.ESMF_LogFile is written by ESMF with the following content:
    20240226 150217.785 INFO             PET0 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    20240226 150217.785 INFO             PET0 !!! THE ESMF_LOG IS SET TO OUTPUT ALL LOG MESSAGES !!!
    20240226 150217.785 INFO             PET0 !!!     THIS MAY CAUSE SLOWDOWN IN PERFORMANCE     !!!
    20240226 150217.785 INFO             PET0 !!! FOR PRODUCTION RUNS, USE:                      !!!
    20240226 150217.785 INFO             PET0 !!!                   ESMF_LOGKIND_Multi_On_Error  !!!
    20240226 150217.785 INFO             PET0 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    20240226 150217.785 INFO             PET0 Running with ESMF Version   : 8.4.2
    20240226 150217.785 INFO             PET0 ESMF library build date/time: "Apr 26 2023" "11:27:56"
    20240226 150217.785 INFO             PET0 ESMF library build location : /home/conda/feedstock_root/build_artifacts/esmf_1682507633250/work
    20240226 150217.785 INFO             PET0 ESMF_COMM                   : mpiuni
    20240226 150217.785 INFO             PET0 ESMF_MOAB                   : enabled
    20240226 150217.785 INFO             PET0 ESMF_LAPACK                 : enabled
    20240226 150217.785 INFO             PET0 ESMF_NETCDF                 : enabled
    20240226 150217.785 INFO             PET0 ESMF_PNETCDF                : disabled
    20240226 150217.785 INFO             PET0 ESMF_PIO                    : disabled
    20240226 150217.785 INFO             PET0 ESMF_YAMLCPP                : enabled
    20240226 150217.785 ERROR            PET0 ESMCI_VM.C:2169 ESMCI::VM::getCurrent() Internal error: Bad condition  - - Could not determine current VM
    
    Issue reported via ESMF support mailinglist

Concerns

These are things that we need to be careful about, but should not a problem.

  1. thread safety, known unsafe libraries:
    • NetCDF4 library
  2. custom configuration (config-developer, extra facets, custom cmor tables) may not be available on Dask workers
  3. is provenance correctly updated with results from preprocessing before saving?
  4. potential for re-using parts of the computation seems limited da.store loses dependency information dask/dask#8380

Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.


To help with the number pull requests:

@bouweandela bouweandela changed the title Proof of concept of running metadata and data computations on Dask Proof of concept: running metadata and data computations on Dask Feb 5, 2024
@valeriupredoi
Copy link
Contributor

I dig this PR 😍 we should talk about the bigger picture though - may be able to suggest some novel stuffs 🍺

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

2 participants