Add workflow to run model with supplied nudging tendency #215

oliverwm1 · 2020-04-06T16:41:52Z

Major changes

add workflow for running FV3GFS model with some externally specified nudging tendency (in this case, monthly-mean tendencies from separate nudged-to-observations runs). In the future I plan to extend this workflow to handle making prognostic ML predictions of nudging tendencies.
refactor code that handles configuration of FV3GFS runs which are being nudged towards GFS analysis. This code was previously in the single_fv3gfs_runs workflow but has now been moved to fv3net/pipelines/kube_jobs and tests have been added. This code move goes against our new guideline (I did the refactor before that guideline was discussed). I think the correct ultimate solution is to move the entire kube_jobs set of modules to an external package, as they now being used be four workflows (one_step_jobs, prognostic_c48_run, single_fv3gfs_run, and the new run_with_learned_nudging). I'd rather not do that now, since I don't want to mess with the core ML workflow steps.

Minor changes

add import of get_protocol to vcm.cloud.__init__.py
pin pandas to 1.0.1 in prognostic_run image since there is a bug in later versions which causes issues when doing time operations with Julian datetimes.
added dataset of low-res GFS analysis data to catalog
change API of diagnostics-to-zarr workflow so that by default it saves Zarr stores of diagnostic output in the rundir of the given experiment instead of in the parent of the rundir.

mcgibbon

Some over-arching points:

There's some configuration currently provided using global constants which should probably be provided in a yaml file somewhere. This workflow could take them in as command-line arguments, or via the fv3config.yml.
The "nudging" runfile seems to actually be a "forcing" runfile. I would suggest refactoring the names to reflect that you are forcing a model using saved nudging data, rather than nudging the model against a reference.
Some of the helper methods around config dictionaries should probably be added to fv3config rather than fv3net.

fv3net/pipelines/diagnostics_to_zarr.py

fv3net/pipelines/kube_jobs/nudge_to_obs.py

tests/test_kube_jobs_nudge_to_obs.py

fv3net/pipelines/kube_jobs/nudge_to_obs.py

workflows/run_with_learned_nudging/mean_nudging_runfile.py

workflows/run_with_learned_nudging/submit_to_kubernetes.py

mcgibbon · 2020-04-07T18:12:11Z

fv3net/runtime/config.py

+def get_config():
+    """Return fv3config dictionary"""
+    with open("fv3config.yml") as f:
+        config = yaml.safe_load(f)
+    return config


This should be added to fv3config as a config_from_yaml routine to match config_to_yaml. I would imagine fv3config.yml should be a global constant somewhere, possibly in fv3net.runtime.

I added a config_to_yaml function to fv3config (will put up PR in a moment) and added a TODO to refactor to use that once fv3net points to an updated fv3config version.

Can you just bump the patch version on fv3config also, and point fv3net to the updated fv3config? You could do the version bumping and new features in the same PR as described in the release instructions. Would avoid having to keep track of this and fix it later.

fv3net/pipelines/kube_jobs/nudge_to_obs.py

mcgibbon · 2020-04-07T18:16:11Z

fv3net/runtime/config.py

+def get_timestep():
+    """Return model timestep in seconds"""
+    return get_namelist()["coupler_nml"]["dt_atmos"]


This should be added as a helper method in fv3config (in derive).

Added a function to fv3config, and put in TODO to refactor that usage.

Refactored workflow to use fv3config version of get_timestep().

mcgibbon · 2020-04-07T18:22:48Z

pin pandas to 1.0.1 in prognostic_run image since there is a bug in later versions which causes issues when doing time operations with Julian datetimes

Is it a workflow which has issues with this, or a package like fv3net or vcm? If it's a package, can this also be added to the setup.py for that package?

oliverwm1 · 2020-04-08T16:29:05Z

pin pandas to 1.0.1 in prognostic_run image since there is a bug in later versions which causes issues when doing time operations with Julian datetimes

Is it a workflow which has issues with this, or a package like fv3net or vcm? If it's a package, can this also be added to the setup.py for that package?

Noah ran into this issue in the new one-step workflow, I ran into it in this workflow. These both use the prognostic run image, so just pinning the version there should handle it. Looks like it's already been pinned in fv3net's environment.yml.

spencerkclark · 2020-04-08T17:00:12Z

FYI I think the bug you are referring to has been accommodated for temporarily in the latest release of xarray (0.15.1), pydata/xarray#3764, and will ultimately be fixed in pandas in their next release (1.1.0), pandas-dev/pandas#32905.

nbren12

Thanks Oli. This looks like a lot of work, and overall quite clean.

Apart from some minor comments and FYIs below, my main concern is the number of different scripts used by this workflow. There are scripts for preparing, submitting, and preprocessing the jobs, and it is not 100% clear what the entry point of this workflow is, since it is not documented in the README. My preference would be that running make -C workflows/run_with_learned_nudging from the fv3net root would do everything that needs to happen (including preprocessing) for this job to complete successfully.

Also please add some explanation to HISTORY.rst.

nbren12 · 2020-04-08T16:39:59Z

catalog.yml

@@ -106,6 +106,23 @@ sources:
        access: read_only
      urlpath: "gs://vcm-ml-data/2020-02-25-additional-november-C3072-simulation-C384-diagnostics/atmos_8xdaily_C3072_to_C384.zarr"

+  GFS_analysis_T85_2015_2016:


Thanks for adding this. I think this will be very helpful for us.

nbren12 · 2020-04-08T16:43:03Z

fv3net/pipelines/diagnostics_to_zarr.py

@@ -63,9 +62,9 @@ def _parse_categories(diagnostic_categories, rundir):
        return diagnostic_categories


-def _parse_diagnostic_dir(diagnostic_dir, rundir):
+def _get_diagnostic_dir(diagnostic_dir, rundir):


nit: This function does a very general thing, but has an overly specific name. I would just move this logic to the run function

diagnostic_dir = rundir if args.diagnostic_dir is None else args.diagnostic_dir

nbren12 · 2020-04-08T16:46:57Z

fv3net/pipelines/kube_jobs/nudge_to_obs.py

+    ]
+
+
+def _get_and_write_nudge_files_description_asset(


This function has an "and" in the name. I would move the listing and writing into two separate function calls in update_config_for_nudging.

nbren12 · 2020-04-08T16:50:57Z

workflows/diagnostics_to_zarr/README.md

@@ -1,7 +1,7 @@
 ## Diagnostics-to-zarr workflow
 This workflow takes a path/url to a run directory as an input and saves zarr stores
 of the diagnostic model output to a specified location. This workflow requires a 
-specific xarray version (0.14.0) and so to run locally, one must ensure your 
+specific xarray version (0.15.0) and so to run locally, one must ensure your 


I don't think we should mention this in the README since it is bound to get our of sync, and you already have it in the setup.py.

Removed the reference to a particular version number.

nbren12 · 2020-04-08T16:54:36Z

workflows/run_with_learned_nudging/README.md

@@ -0,0 +1,5 @@
+This workflow allows an external nudging tendency be applied to FV3GFS runs.


Can you add a title with at the ## level so that this README is correctly interpreted by the sphinx docs?

likewise please mention the path workflows/run_with_learned_nudging so people will know how to get to this folder when looking at the sphinx docs.

nbren12 · 2020-04-08T16:59:03Z

workflows/run_with_learned_nudging/Makefile

@@ -0,0 +1,28 @@
+IMAGE = us.gcr.io/vcm-ml/prognostic_run:mean_nudging


It is not clear where this docker image is built from this PR and what is in it. Are you planning to adjust this to point at a versioned tag of the prognostic_run image before mergining?

Okay, added another Dockerfile in fv3net/docker. I didn't add this new docker image to the build_images rule in fv3net's Makefile, since it's not used by any of the steps in the main end-to-end pipeline. There are instructions in the README about building the image though.

nbren12 · 2020-04-08T17:13:14Z

workflows/run_with_learned_nudging/postprocess.sh

+
+# constants
+ROOT_URL=gs://vcm-ml-data/2020-03-30-learned-nudging-FV3GFS-runs
+LATLON_VAR_LIST=DLWRFsfc,DSWRFsfc,DSWRFtoa,LHTFLsfc,PRATEsfc,SHTFLsfc,ULWRFsfc,ULWRFtoa,USWRFsfc,USWRFtoa,TMP2m,TMPsfc,SOILM,PRESsfc,ucomp,vcomp,temp,sphum,ps_dt_nudge,delp_dt_nudge,u_dt_nudge,v_dt_nudge,t_dt_nudge


There are some improvements in #153 that could avoid explicitly listing these in the future.

Good to know.

nbren12 · 2020-04-08T17:29:26Z

workflows/run_with_learned_nudging/postprocess.sh

+if [ "$DO_REGRID" = true ]; then
+    for RUN in $RUNS; do
+        # regrid certain monthly-mean variables to lat-lon grid
+        argo --cluster $ARGO_CLUSTER submit workflows/fregrid_cube_netcdfs/pipeline.yaml \


If you pass -w to argo submit this command will block until the job completes.

Okay. postprocess.sh is the last step, so no need to block, but good to know.

nbren12 · 2020-04-08T17:35:55Z

workflows/run_with_learned_nudging/prepare_config.py

+        config_url = os.path.abspath(config_url)
+    config = prepare_config(template, base_config, args.nudge_label, config_url)
+    fs = vcm.cloud.get_fs(config_url)
+    fs.mkdirs(config_url, exist_ok=True)


FYI, the new run_k8s can be passed the config dictionary directly, without needing it to be uploaded first.

Got it, thanks for reminder. Already uploading other assets, so I'll leave this be.

nbren12 · 2020-04-08T17:39:18Z

workflows/run_with_learned_nudging/postprocess.sh

@@ -0,0 +1,47 @@
+#!/bin/bash


Can you include this postprocessing as a dependency in the makefile someplace? It is currently not clear how it fits into the rest of the pipeline. Ideally, typing make should result in everything being run that needs to be run.

Added some comments to the README as we discussed.

oliverwm1 · 2020-04-09T20:25:23Z

FYI I think the bug you are referring to has been accommodated for temporarily in the latest release of xarray (0.15.1), pydata/xarray#3764, and will ultimately be fixed in pandas in their next release (1.1.0), pandas-dev/pandas#32905.

Thanks for the heads up Spencer!

mcgibbon

LGTM, thanks for the changes!

nbren12

LGTM. Thanks!

nbren12 · 2020-04-09T22:57:10Z

workflows/run_with_learned_nudging/README.md

-This workflow allows an external nudging tendency be applied to FV3GFS runs.
+## Run with learned nudging workflow
+
+This workflow (in `workflows/run_with_learned_nudging`) allows an external nudging


Thanks for improving the readme.

Oliver Watt-Meyer added 30 commits March 30, 2020 13:43

Add workflow runfile, configs, Makefile

0f24ba7

Update runtime module

0aa3632

Fix fv3net runtime __init__

79f4be0

Pin pandas in prognostic run to 1.0.1

f7e5ab0

Update runfile and Makefile

4dea1a2

Update experiment names in configs

fd0dc49

Update Makefile

59569fa

Remove workflow submit_job.py

eedede2

Refactor nudge file handling to kube_jobs

a8f1e5b

Add tests for nudge file handling

aeb0b4f

Merge branch 'master' into feature/apply-mean-nudging

d6f7023

Use common transfer_local_to_remote function

8655c93

Add type hints to nudge_to_obs.py

1a0fecc

Lint

23c5556

Update configurations and Makefile to enable remote runs

e0abddb

Remove leftover debugging logging statement

59425c3

Use common unique_tag() function

48808d2

Change outdirs in Makefile

c838bee

Update rule name in Makefile

6b25cab

Change run length to 91 days

635fcf8

Make layout 2,2 for nudge_mean_T

b73f138

Make runfile work for multiple procs per tile

0fe68de

Add prepare_config.py script to simplify submission

ad2a02d

Add get_fs and get_protocol to vcm.cloud.__init__.py

2834bc4

Fix Makefile

1eccf2b

Make sure absolute paths are used for config_url

7d23241

Update runfile for n_proc>6

85ec55b

Set layout=2,2 in config

aa83013

Rename dimensions as well

d3b87bb

Load all timesteps of nudging data

843dd91

Oliver Watt-Meyer added 8 commits April 6, 2020 16:40

Add post-processing script

ca82095

Lint

afe78be

Add GFS analysis data to catalog.yml

afc2b46

Add back runtime get_runfile_config function

9da7352

Add docstring

5a94309

Add README.md

e70bc05

Add get_runfile_config back to runtime __init__

7202e37

Update postprocessing script

7653ed3

oliverwm1 marked this pull request as ready for review April 6, 2020 23:27

Merge branch 'master' into feature/apply-mean-nudging

d29fa53

mcgibbon requested changes Apr 7, 2020

View reviewed changes

Address Jeremy PR comments

5b74697

nbren12 suggested changes Apr 8, 2020

View reviewed changes

Oliver Watt-Meyer added 3 commits April 8, 2020 14:03

Rename nudging_tendency to mean_nudging_tendency

4b81cfe

Update fv3config submdule to v0.3.1

da3ec8f

Use fv3config get_timestep and config_from_yaml

0661967

Oliver Watt-Meyer added 3 commits April 9, 2020 14:50

Address Noah PR comments

ba80117

Update HISTORY.rst and workflow readme

f2fbdfd

Fix typo

919bb39

mcgibbon approved these changes Apr 9, 2020

View reviewed changes

Oliver Watt-Meyer added 2 commits April 9, 2020 15:05

Merge branch 'master' into feature/apply-mean-nudging

608fc72

Add quotes to filename_pattern in nudge config yamls

4a007f4

nbren12 approved these changes Apr 9, 2020

View reviewed changes

Update length of runs

417d19d

oliverwm1 merged commit 541ae51 into master Apr 9, 2020

oliverwm1 deleted the feature/apply-mean-nudging branch April 9, 2020 23:50

		@@ -0,0 +1,5 @@
		This workflow allows an external nudging tendency be applied to FV3GFS runs.

		@@ -0,0 +1,28 @@
		IMAGE = us.gcr.io/vcm-ml/prognostic_run:mean_nudging

Add workflow to run model with supplied nudging tendency #215

Add workflow to run model with supplied nudging tendency #215

Conversation

oliverwm1 commented Apr 6, 2020 • edited Loading

Major changes

Minor changes

mcgibbon left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcgibbon commented Apr 7, 2020

oliverwm1 commented Apr 8, 2020

spencerkclark commented Apr 8, 2020

nbren12 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oliverwm1 commented Apr 9, 2020

mcgibbon left a comment

Choose a reason for hiding this comment

nbren12 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oliverwm1 commented Apr 6, 2020 •

edited

Loading

mcgibbon left a comment •

edited

Loading

nbren12 left a comment •

edited

Loading