Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recipe test results for ESMValCore v2.11.0rc1 #2421

Closed
chrisbillowsMO opened this issue May 16, 2024 · 15 comments
Closed

Recipe test results for ESMValCore v2.11.0rc1 #2421

chrisbillowsMO opened this issue May 16, 2024 · 15 comments
Assignees
Labels
Milestone

Comments

@chrisbillowsMO
Copy link
Contributor

chrisbillowsMO commented May 16, 2024

Recipe test results for v2.11.0rc1

This is the initial output from testing done for releasing ESMValCore v2.11.0rc1. Please see the following comment for our evaluation of the failures.

Recipe running session 2024-05-15

Setup

mamba version

levante5> mamba --version
mamba 1.5.8
conda 24.5.0

ESMValTool version

levante5> esmvaltool version
ESMValCore: 2.11.0rc1
ESMValTool: 2.11.0.dev75+g4734caf5a.d20240515

Recipes that ran successfully (132 out of 160)

Click to expand
  • recipe_albedolandcover.yml
  • recipe_anav13jclim.yml
  • recipe_arctic_ocean.yml
  • recipe_autoassess_landsurface_permafrost.yml
  • recipe_autoassess_landsurface_soilmoisture.yml
  • recipe_autoassess_landsurface_surfrad.yml
  • recipe_autoassess_stratosphere.yml
  • recipe_bock20jgr_fig_1-4.yml
  • recipe_bock20jgr_fig_6-7.yml
  • recipe_capacity_factor.yml
  • recipe_climate_change_hotspot.yml
  • recipe_climwip_brunner2019_med.yml
  • recipe_climwip_brunner20esd.yml
  • recipe_climwip_test_basic.yml
  • recipe_climwip_test_performance_sigma.yml
  • recipe_clouds_bias.yml
  • recipe_clouds_ipcc.yml
  • recipe_cmug_h2o.yml
  • recipe_concatenate_exps.yml
  • recipe_consecdrydays.yml
  • recipe_correlation.yml
  • recipe_cox18nature.yml
  • recipe_cvdp.yml
  • recipe_daily_era5.yml
  • recipe_deangelis15nat.yml
  • recipe_deangelis15nat_fig1_fast.yml
  • recipe_decadal.yml
  • recipe_diurnal_temperature_index.yml
  • recipe_eady_growth_rate.yml
  • recipe_ecs.yml
  • recipe_ecs_constraints.yml
  • recipe_ecs_scatter.yml
  • recipe_ensclus.yml
  • recipe_era5-land.yml
  • recipe_esacci_lst.yml
  • recipe_esacci_oc.yml
  • recipe_extract_shape.yml
  • recipe_extreme_index.yml
  • recipe_eyring06jgr.yml
  • recipe_flato13ipcc_figure_914.yml
  • recipe_flato13ipcc_figure_924.yml
  • recipe_flato13ipcc_figure_942.yml
  • recipe_flato13ipcc_figure_945a.yml
  • recipe_flato13ipcc_figure_96.yml
  • recipe_flato13ipcc_figure_98.yml
  • recipe_flato13ipcc_figures_926_927.yml
  • recipe_flato13ipcc_figures_92_95.yml
  • recipe_flato13ipcc_figures_938_941_cmip3.yml
  • recipe_flato13ipcc_figures_938_941_cmip6.yml
  • recipe_galytska23jgr.yml
  • recipe_gier2020bg.yml
  • recipe_globwat.yml
  • recipe_heatwaves_coldwaves.yml
  • recipe_hydro_forcing.yml
  • recipe_hype.yml
  • recipe_iht_toa.yml
  • recipe_impact.yml
  • recipe_ipccwg1ar6ch3_fig_3_42_b.yml
  • recipe_ipccwg1ar6ch3_fig_3_43.yml
  • recipe_ipccwg1ar6ch3_fig_3_9.yml
  • recipe_kcs.yml
  • recipe_landcover.yml
  • recipe_lauer13jclim.yml
  • recipe_lauer22jclim_fig1_clim.yml
  • recipe_lauer22jclim_fig1_clim_amip.yml
  • recipe_lauer22jclim_fig2_taylor.yml
  • recipe_lauer22jclim_fig2_taylor_amip.yml
  • recipe_lauer22jclim_fig6_interannual.yml
  • recipe_lauer22jclim_fig7_seas.yml
  • recipe_lauer22jclim_fig8_dyn.yml
  • recipe_lauer22jclim_fig9-11c_pdf.yml
  • recipe_li17natcc.yml
  • recipe_lisflood.yml
  • recipe_marrmot.yml
  • recipe_meehl20sciadv.yml
  • recipe_model_evaluation_basics.yml
  • recipe_model_evaluation_clouds_clim.yml
  • recipe_model_evaluation_clouds_cycles.yml
  • recipe_model_evaluation_precip_zonal.yml
  • recipe_modes_of_variability.yml
  • recipe_monitor.yml
  • recipe_monitor_with_refs.yml
  • recipe_mpqb_xch4.yml
  • recipe_multimodel_products.yml
  • recipe_my_personal_diagnostic.yml
  • recipe_ncl.yml
  • recipe_ocean_Landschuetzer2016.yml
  • recipe_ocean_amoc.yml
  • recipe_ocean_bgc.yml
  • recipe_ocean_example.yml
  • recipe_ocean_ice_extent.yml
  • recipe_ocean_multimap.yml
  • recipe_ocean_scalar_fields.yml
  • recipe_perfmetrics_CMIP5.yml
  • recipe_perfmetrics_CMIP5_4cds.yml
  • recipe_perfmetrics_land_CMIP5.yml
  • recipe_preprocessor_test.yml
  • recipe_psyplot.yml
  • recipe_pv_capacity_factor.yml
  • recipe_python.yml
  • recipe_python_for_CI.yml
  • recipe_quantilebias.yml
  • recipe_r.yml
  • recipe_radiation_budget.yml
  • recipe_rainfarm.yml
  • recipe_runoff_et.yml
  • recipe_russell18jgr.yml
  • recipe_schlund20jgr_gpp_abs_rcp85.yml
  • recipe_schlund20jgr_gpp_change_1pct.yml
  • recipe_schlund20jgr_gpp_change_rcp85.yml
  • recipe_sea_surface_salinity.yml
  • recipe_seaborn.yml
  • recipe_seaice.yml
  • recipe_seaice_drift.yml
  • recipe_seaice_feedback.yml
  • recipe_shapeselect.yml
  • recipe_smpi.yml
  • recipe_smpi_4cds.yml
  • recipe_snowalbedo.yml
  • recipe_spei.yml
  • recipe_tcr.yml
  • recipe_thermodyn_diagtool.yml
  • recipe_toymodel.yml
  • recipe_validation.yml
  • recipe_validation_CMIP6.yml
  • recipe_variable_groups.yml
  • recipe_weigel21gmd_figures_13_16.yml
  • recipe_wenzel14jgr.yml
  • recipe_wenzel16nat.yml
  • recipe_wflow.yml
  • recipe_williams09climdyn_CREM.yml
  • recipe_zmnam.yml

Recipes that failed because the diagnostic script failed (11 out of 160)

  • recipe_combined_indices.yml
  • recipe_extreme_events.yml
  • recipe_hyint.yml
  • recipe_hyint_extreme_events.yml
  • recipe_martin18grl.yml
  • recipe_miles_block.yml
  • recipe_miles_eof.yml
  • recipe_miles_regimes.yml
  • recipe_pcrglobwb.yml
  • recipe_schlund20esd.yml
  • recipe_wenzel16jclim.yml

Recipes that failed because of missing data (3 out of 160)

  • recipe_aod_aeronet_assess.yml
  • recipe_bock20jgr_fig_8-10.yml
  • recipe_check_obs.yml

Recipes that failed because the run took too long (6 out of 160)

  • recipe_carvalhais14nat.yml
  • recipe_eyring13jgr_12.yml
  • recipe_ipccwg1ar6ch3_fig_3_19.yml
  • recipe_ipccwg1ar6ch3_fig_3_42_a.yml
  • recipe_lauer22jclim_fig5_lifrac.yml
  • recipe_lauer22jclim_fig9-11ab_scatter.yml

Recipes that failed of other reasons or are still running (7 out of 160)

  • recipe_collins13ipcc.yml
  • recipe_easy_ipcc.yml
  • recipe_ipccwg1ar6ch3_atmosphere.yml
  • recipe_lauer22jclim_fig3-4_zonal.yml
  • recipe_ocean_quadmap.yml
  • recipe_preprocessor_derive_test.yml
  • recipe_tebaldi21esd.yml

Recipes that are known to be broken (1 out of 160)

  • recipe_julia.yml
@chrisbillowsMO chrisbillowsMO added this to the v2.11.0 milestone May 16, 2024
@chrisbillowsMO chrisbillowsMO self-assigned this May 16, 2024
@chrisbillowsMO
Copy link
Contributor Author

chrisbillowsMO commented May 16, 2024

Hi @ESMValGroup/technical-lead-development-team @bouweandela @valeriupredoi

Any comments on the following evaluation please? (The original output from running the recipes for the first time is above).

1. R diagnostic failures

The following are R recipes with various errors. Would anyone with R knowledge please take a look?

The errors were either of the below:

Error in (models_dataset == reference_dataset) && (models_exp == reference_exp) :
  'length = 2' in coercion to 'logical(1)'
                     ^ Operator >remapcon2< not found!

2. Python diagnostic failures

We have the capacity to address these errors - should we? Or does anyone already know how to solve these?

KeyError: 'Provenance record for /scratch/b/b382148/esmvaltool_output/recipe_martin18grl_20240515_142625/plots/spi_collect/spi_collect/SPI_time_series_Bremen_Observations.png already exists.'
iris.exceptions.ConcatenateError: failed to concatenate into a single cube.
  Cube metadata differs for phenomenon: precipitation_flux
TypeError: unhashable type: 'CubeAttrsDict'

3. NCL diagnostic failures

There is one NCL recipe with an error. Would anyone with NCL knowledge please take a look?

INFO    fatal: in uajet_sh850, cannot read plev and latrange

4. Recipes that failed because of missing data

We recognise recipe_check_obs.yml is a known broken recipe but should we open a new issue to resolve the missing data issues with ESMValGroup/obs-maintainers?

5. Recipes that failed because the run took too long

  • recipe_eyring13jgr_12.yml
  • recipe_ipccwg1ar6ch3_fig_3_19.yml
  • recipe_ipccwg1ar6ch3_fig_3_42_a.yml
  • recipe_lauer22jclim_fig5_lifrac.yml

We've increased the time on all of these except for recipe_ipccwg1ar6ch3_fig_3_42_a.yml which was already at the maximum time. Is there anything we can do about this?

  • recipe_carvalhais14nat.yml
  • recipe_lauer22jclim_fig9-11ab_scatter.yml

We also had to increase time on these from the "Recipes that failed of other reasons or are still running" section.

6. Recipes that failed because model data couldn't be downloaded

  • recipe_easy_ipcc.yml
  • recipe_ocean_quadmap.yml

7. Recipes that failed because of an HDF5 error

  • recipe_collins13ipcc.yml
  • recipe_ipccwg1ar6ch3_atmosphere.yml
  • recipe_tebaldi21esd.yml

This three are all the same as in v2.10 recipe test results

  • recipe_preprocessor_derive_test.yml

This is a new entry.

8. Recipes that fail because of - we think! - an ESMValCore issue

ValueError: Chunks and shape must be of the same length/dimension. Got chunks=(), shape=(1,)

@valeriupredoi
Copy link
Contributor

valeriupredoi commented May 16, 2024

great summary and work @chrisbillowsMO and @ehogan 🍺

Here is the issue with those three HDF5-related failures, as posted by @bouweandela back in December last year, when they were working on the 2.10 release: ESMValGroup/ESMValTool#3463 (comment)

This is a HDF5 thread unsafe-related issue and it is flaky but it appears it is mostly reproducible (positive flakiness, or was it negative? don't matter). This has to be fixed, most probably by adding a file lock() statement somewhere; I'll have a look myself, but don't set it as roadblock towards the release IMO

@bouweandela
Copy link
Member

This Julia recipe has the following error:

recipe_rainfarm.yml

ERROR: LoadError: ArgumentError: Package YAML [ddb6d928-2868-570f-bddf-ab3f9cf99eb6] is required but does not seem to be installed:

Did you install the Julia dependencies?

@valeriupredoi
Copy link
Contributor

fairly sure no is the answer to that q, bud 😁

@ehogan
Copy link
Contributor

ehogan commented May 17, 2024

This Julia recipe has the following error:
recipe_rainfarm.yml
ERROR: LoadError: ArgumentError: Package YAML [ddb6d928-2868-570f-bddf-ab3f9cf99eb6] is required but does not seem to be installed:

Did you install the Julia dependencies?

No, I had missed the esmvaltool install Julia step. Both Julia recipes now succeed, so I will update the first and second comments to reflect this 👍

@schlunma
Copy link
Contributor

schlunma commented May 17, 2024

10. Recipes that never ran

* recipe_schlund20jgr_gpp_abs_rcp85.yml

* recipe_schlund20jgr_gpp_change_1pct.yml

* recipe_schlund20jgr_gpp_change_rcp85.yml

These have been excluded from the generate.py script. @schlunma might you need to run these?

Successfully tested them 👍 I'll update the comment above to reflect this.

@ehogan
Copy link
Contributor

ehogan commented May 17, 2024

5. Recipes that failed because the run took too long

  • recipe_climate_change_hotspot.yml
  • recipe_eyring06jgr.yml
  • recipe_eyring13jgr_12.yml
  • recipe_ipccwg1ar6ch3_fig_3_19.yml
  • recipe_ipccwg1ar6ch3_fig_3_42_a.yml
  • recipe_ipccwg1ar6ch3_fig_3_42_b.yml
  • recipe_lauer22jclim_fig5_lifrac.yml

We've increased the time on all of these except for recipe_ipccwg1ar6ch3_fig_3_42_a.yml which was already at the maximum time. Is there anything we can do about this?

  • recipe_carvalhais14nat.yml
  • recipe_lauer22jclim_fig9-11ab_scatter.yml

We also had to increase time on these from the "Recipes that failed of other reasons or are still running" section.

The following recipes are now running successfully, so I will update the comments above:

  • recipe_climate_change_hotspot.yml
2024-05-16 13:42:09,525 UTC [170675] INFO    Time for running the recipe was: 4:20:19.772793
2024-05-16 13:42:10,337 UTC [170675] INFO    Maximum memory used (estimate): 50.4 GB
[...]
2024-05-16 13:42:12,725 UTC [170675] INFO    Run was successful
  • recipe_eyring06jgr.yml
2024-05-16 14:24:00,524 UTC [88405] INFO    Time for running the recipe was: 4:58:26.498892
2024-05-16 14:24:01,288 UTC [88405] INFO    Maximum memory used (estimate): 97.0 GB
[...]
2024-05-16 14:24:01,415 UTC [88405] INFO    Run was successful
  • recipe_ipccwg1ar6ch3_fig_3_42_b.yml
2024-05-16 13:57:25,039 UTC [76122] INFO    Time for running the recipe was: 4:32:29.955802
2024-05-16 13:57:25,700 UTC [76122] INFO    Maximum memory used (estimate): 225.9 GB
[...]
2024-05-16 13:57:27,644 UTC [76122] INFO    Run was successful

Should I update the time for these recipes in SPECIAL_RECIPES in generate.py?

What should we do with the recipes that don't run within 8 hours?

@ehogan
Copy link
Contributor

ehogan commented May 17, 2024

6. Recipes that failed because they used too much memory

  • recipe_model_evaluation_basics.yml

We've increased the memory on this one.

The following recipe is now running successfully, so I will update the comments above:

2024-05-16 09:28:34,122 UTC [86954] INFO    Time for running the recipe was: 0:01:42.672771
2024-05-16 09:28:34,977 UTC [86954] INFO    Maximum memory used (estimate): 73.2 GB
[...]
2024-05-16 09:28:35,092 UTC [86954] INFO    Run was successful

This is a new recipe since ESMValTool v2.10.0, so it will need adding to SPECIAL_RECIPES in generate.py.

@ehogan
Copy link
Contributor

ehogan commented May 17, 2024

@bouweandela, @valeriupredoi, would it be possible to get some guidance on what to do now, please? How many of the failures above must we fix before moving onto the ESMValTool freeze and testing stages? Can all the diagnostic and data issues wait until ESMValTool testing? 🤔

@valeriupredoi
Copy link
Contributor

Super work, guys! Here's me 3 cents (2 cents adjusted for inflation):

  • Julia example recipe is in the broken recipes list because the plot it produces is rubbish, see [Julia] Use NCDatasets instead of netCDF - masked values are treated as masked only in NCDatasets ESMValTool#3476
  • it'd be good to have a look at the recipes that failed due to diagnostic error - please add a link against each of those pointing to the output so we have an understanding if it's the same stuff from last release (in which case we should prob put those in the broken recipes) or if it's a new barf, in which case it'd need fixing
  • I'll have a looksee myself, but if it's a broken diagnostic because of diagnostic ie not because of some ESMValCore functionality, best to ask the diagnostic developers by tagging them here (if they are none, then let's see who'd know best)

@schlunma
Copy link
Contributor

A possible reason for some of these failures could be iris' new attribute handling: since version 3.8, iris now distinguishes between local and global attributes. We adopted this new behavior in #2398.

This was the reason for the errors in recipe_schlund20esd.yml (fixed in ESMValGroup/ESMValTool#3605) and recipe_wenzel16jclim.yml (fixed in ESMValGroup/ESMValTool#3603).

@ehogan
Copy link
Contributor

ehogan commented May 22, 2024

Super work, guys! Here's me 3 cents (2 cents adjusted for inflation):

Apologies @valeriupredoi, you did say this previously, and I promptly forgot! I will update the comment above appropriately 👍

@valeriupredoi
Copy link
Contributor

Not a worry, Emma, release time is a very busy one 🙂

@bouweandela
Copy link
Member

bouweandela commented May 23, 2024

@bouweandela, @valeriupredoi, would it be possible to get some guidance on what to do now, please? How many of the failures above must we fix before moving onto the ESMValTool freeze and testing stages? Can all the diagnostic and data issues wait until ESMValTool testing? 🤔

If you suspect it is an ESMValCore issue, I would recommend fixing it before moving on to testing ESMValTool, but otherwise you should be fine to move on.

Should I update the time for these recipes in SPECIAL_RECIPES in generate.py?

Yes, that would be helpful for the next release manager.

What should we do with the recipes that don't run within 8 hours?

Are these recipes still running after 8 hours? In my experience, sometimes processes get killed without SLURM telling you. If there are no more log messages in the debug log or diagnostic scripts logs long before the 8 hours are over, it seems likely that the process has silently crashed. If this is the case, you could try reducing the number of workers used by Dask. This can be done by configuring the distributed scheduler, or if there are non-lazy preprocessor functions #674 in the recipe, you can use the default scheduler and create a file called ~/.config/dask/dask.yml and put

num_workers: 16

in it. That will use just 16 threads instead of the default 128 on a default levante compute node, leaving 256GB/16 = 16GB of RAM per thread instead of just 2GB.

@ehogan
Copy link
Contributor

ehogan commented Jul 1, 2024

Closing this issue in favour of #2468 😊

@ehogan ehogan closed this as completed Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants