Enhance Ensemble-Stat to apply the HiRA method to ensembles #1583

TaraJensen · 2020-11-24T16:26:14Z

Describe the New Feature

The MET Office will need the computation of the HiRA method applied to ensembles.

This task is to enhance Ensemble-Stat to incorporate HiRA methodology into Ensemble-Stat.

In Point-Stat, the HiRA logic is controlled by the "hira" dictionary in the config file. It defines a whole separate set of processing logic and write ensemble output statistics and line types.

The incorporation in Ensemble-Stat should be much simpler. Ensemble-Stat already writes ensemble output statistics and line types. The only wrinkle added by HiRA is providing more flexibility as to which ensemble member values are used. That logic is currently controlled by the "interp" dictionary. Output statistics are generated for each interpolation type defined in the config file. Right now, all of these methods compute a SINGLE forecast value for each point observation value (e.g. the NEAREST, MIN, MAX, UW_MEAN, or DW_MEAN). Instead of computing this summary over nearby forecast grid points, we just want to use ALL of the nearby grid point values. That increases the ensemble size from N ensemble members to N members * M points in the HiRA neighborhood.

Rather than adding a "hira" dictionary to the Ensemble-Stat config file, recommend supporting a new "interp.type.method" option. See the existing supported methods listed here. Plan to name the new option "HIRA" to be specific about its intended use. But that name could be something else, like "ALL" for "all grid points in the neighborhood".

Most of the code changes to support this should lie in the PairDataEnsemble class, although there will likely be changes to the Ensemble-Stat application code. Also, add checks to make sure that "method = HIRA" is ONLY supported by Ensemble-Stat. All other tools should error out if the method is requested.

Note that Ensemble-Stat can be run with both point and gridded observations. If interp.type.method = HIRA, compute output when verifying against POINT observations, but do not compute output when verifying against GRIDDED analyses. Just log a message stating the HIRA only applies to points and the gridded vx step is being skipped for that interpolation method.

The config entry would look something like this:

interp = {
   field = OBS;
   vld_thresh = 1.0;

   type = [
      { method = HIRA;
        width  = 3;
        shape = SQUARE; }
   ];
}

@mpm-meto please review this logic and advise.

Acceptance Testing

Time Estimate

Estimate the amount of work required here.
Issues should represent approximately 1 to 3 days of work.

Sub-Issues

Consider breaking the new feature down into sub-issues.

Add a checkbox for each sub-issue here.

Relevant Deadlines

Jun 2021

Funding Source

2799991 Met Office

Define the Metadata

Assignee

Select engineer(s) or no engineer required
Select scientist(s) or no scientist required

Labels

Select component(s)
Select priority
Select requestor(s)

Projects and Milestone

Review projects and select relevant Repository and Organization ones or add "alert:NEED PROJECT ASSIGNMENT" label
Select milestone to next major version milestone or "Future Versions"

Define Related Issue(s)

Consider the impact to the other METplus components.

METplus, MET, METdatadb, METviewer, METexpress, METcalcpy, METplotpy
Loading HiRA output with many columns into METdatadb is problematic and will require code changes. The current upper limit for METdatadb is 150 columns.
METdatadb issue for handling very long lines: Handle variable length lines with more than 125 columns after the header, especially for HiRA METdataio#78

New Feature Checklist

See the METplus Workflow for details.

The text was updated successfully, but these errors were encountered:

mpm-meto · 2022-01-07T15:52:40Z

Applying HiRA to ensembles effectively means that what is done for a single deterministic forecast in terms of the extraction of neighbourhoods from the forecast field has to be repeated n times before the stats are computed, i.e. if you have a 3 x 3 neighbourhood it gives a pseudo ensemble of 9 for which the scores are calculated in the deterministic case. When the same 3 x 3 neighbourhood is applied to an ensemble of 18 members then 3 x 3 x 18 values are extracted to create a pseudo ensemble from which the scores are calculated. Put another way the neighbourhood relevant to a given location has to be extracted first from each ensemble member; all the values need to be concatenated to create one (long) vector before being fed into the statistical calculation.

One other thing to bear in mind is that we calculate two kinds of bias ratios with HiRA. A mean bias ratio and a bias ratio which is more like a frequency bias. It would be great if these could also be considered for inclusion at some point.

The mean bias ratio is what it says on the tin, calculating the mean of the neighbourhood values (across a site or all sites) normalised by the single or mean for a site or all sites in a list.
The bias ratio is defined so:

…d data is present in the ensemble mean.

…ult.

… dictionary rather than ens.field, which fails when ens.field is an empty array.

…nsemble members when creating the output txt files... esp when HiRA is not selected. ci-run-unit

JohnHalleyGotway · 2022-02-17T05:48:18Z

This is now working. Still need to:

Add hira dictionary to existing Ensemble-Stat config files
Add/modify a unit test to demonstrate
Update documentation

Two considerations noted during development:

Since we set the ensemble mean to bad data for HiRA, the ECNT ME and RMSE stats are bad data in the output. Also SSVAR is not written since "skill" is computed relative to the ensemble mean which is bad data for HiRA.
With HiRA the number of ensemble members does NOT remain constant in a single run. So the ORANK, RELP, and RHIST line types no longer have a consistent number of columns. The only problem with that is that now the header columns of the _orank.txt, _relp.txt, and _rhist.txt optional ascii files are no longer accurate. The header columns are currently written based on the MAXIMUM number of ensemble members. I could perhaps pad out the shorter lines with a bunch of NA's to make things line up well... but that'd cause more problems than its worth. Stat-Analysis wouldn't know how to handle that padded output and METdatadb wouldn't know how to load it.

…HiRA as an interpolation method for Ensemble-Stat. Having the entire HiRA dictionary in Ensemble-Stat brings unecessary complications when the only part that we actually want/need is the size and shape. This simpler solution requires less documentation updates and should have not upstream impacts on METplus.

… simple interpolation option.

…t. This is also run by unit_met_test_scripts.xml. Also, update the downstream STAT-Analysis job to produce output -by INTERP_MTHD,INTERP_PNTS.

JohnHalleyGotway · 2022-02-23T01:30:40Z

@mpm-meto I have a followup question for you about HiRA in Ensemble-Stat.

@venitahagerty, @TatianaBurek, and I were discussing loading the Ensemble-Stat HiRA output into METdatadb. @venitahagerty was surprised to see an output RHIST line containing 151 ranks. That's the result of running HiRA with a 5x5 neighborhood on an ensemble of size 6 (6 members * 25 neighborhood points + 1 rank = 151).

My question is this. Should we constrain what output line types are written when the interpolation method is set to HiRA in Ensemble-Stat? For example, we could consider skipping the variable length lines, like RELP, RHIST, and ORANK. Line types with a fixed number of columns (like ECNT, RPS, PHIST, and SSVAR) aren't problematic. But the ones that depend on the number of ensemble "members" can get VERY long VERY fast.

Should we skip RELP, RHIST, and ORANK when interp.method = HIRA? Or should we just let the user configure it in whatever way they'd like?

mpm-meto · 2022-02-24T09:04:23Z

@JohnHalleyGotway "But the ones that depend on the number of ensemble "members" can get VERY long VERY fast.".... yes we know that! We currently DO write these out, because of the way our current system works. We also use these values to compute the CRPS and its components, for example. Our output file format is binary however. Even so the files become VERY large and with the output frequency they are a pain. With the 11 x 11 neighbourhoods (the largest neighbourhood used for MOGREPS-UK), which runs hourly with 18 members.... therefore there are 2178 values for each observing location every hour, out to t+120h!

In terms of utility, as the code for deterministic and ensemble is the same in our system at the moment, we have used the rank histograms for the deterministic forecasts.

So I have some questions of my own:

Is the metdatadb coping with the deterministic HiRA RHIST files ok?
Does "skipping" mean that the option of writing these line types (and downstream support) is forever removed? Would it compromise anything else? (e.g. the CRPS calculation)?
Also, wouldn't it introduce an inconsistency with the deterministic HiRA?

Given the urgency I would suggest that these are just set to default OFF for both deterministic and ensembles for now. Switching them on should come with a huge health warning in the documentation too. Though, if this "breaks" the metdatadb then skipping them will be the only option at this point! We do need to revisit this in terms of how useful these line types are more generally and whether we can just switch them off for good. But, I can't make that judgement call at short notice. It requires a little bit more thought. @venitahagerty @TatianaBurek

JohnHalleyGotway · 2022-02-24T15:41:25Z

@mpm-meto thanks for the feedback. I was looking for your confirmation that the MetOffice does indeed use the HiRA derived ranked histograms. I wasn't sure if that was the case or if you only used CRPS from the ECNT line type, for example.

So I'll make no changes to the software. If you request RHIST output and enable HiRA, then you'll get what you get. Two details do come to mind:

In Ensemble-Stat, the output_flag can be defined separately for each vx task... or at least that's supposed to be the case! But during my current development for MET Enhance Ensemble-Stat to compute probabilistic statistics for user-defined or climatology-based thresholds. #1259, I found that that was NOT happening. It was an easy fix and that's included in the Enhance Ensemble-Stat to compute probabilistic statistics for user-defined or climatology-based thresholds. #1259 feature branch.
It occurs to me that the PHIST line type might be very useful for HiRA. The Probability Integral Transform rescales the observation rank (which is totally dependent on the number of members) to a number between 0 and 1. Using PHIST would enable the ranks to be compared across multiple HiRA neighborhood sizes. Have you used PIT Histograms on HiRA output in the past?

@venitahagerty and @TatianaBurek we'll just need to figure out how to handle outputs with 2000+ columns.

JohnHalleyGotway · 2022-03-08T18:43:05Z

The GitHub action testing workflow for METplus flagged differences in the use case output from MET. On 3/8/22, @georgemccabe and I examined those differences and determined that 2 of the 3 of them are the direct and expected result of recent development and fixes in MET.

However, the 3rd difference in the output from Ensemble-Stat was unexpected and likely indicates insufficient logic in the changes for this issue. Specifically, this line of ens_stats.cc likely needs to change:

   // HiRA data has no ensemble mean
   bool skip_mean = pd.mn_na.has(bad_data_double);

The Ensemble-Stat differences appeared in the WEST and CONUS masking regions, but not the EAST and LMV regions. Here's the TMP/Z2 ensemble mean field produced in this case:

Since the CONUS and EAST regions contain missing data values, that "skip_mean" flag was set to true, and therefore all ensemble mean-realted outputs, like ME, RMSE, and SSVAR are missing from the output of this run. Need to refine the skip_mean logic so that it only applies to HiRA outputs.

…sed statistics and SSVAR output. Instead of skipping for ANY instance of bad data in the ensemble mean, only skip when the ensemble mean is entirely bad data.

TaraJensen added this to the MET 10.1 milestone Nov 24, 2020

TaraJensen removed the alert: NEED CYCLE ASSIGNMENT Need to assign to a release development cycle label May 25, 2021

TaraJensen added the required: FOR OFFICIAL RELEASE Required to be completed in the official release for the assigned milestone label May 25, 2021

JohnHalleyGotway removed the component: application code label Jun 10, 2021

JohnHalleyGotway mentioned this issue Sep 15, 2021

Enhance Ensemble-Stat to compute probabilistic statistics for user-defined or climatology-based thresholds. #1259

Closed

21 tasks

TaraJensen added priority: blocker Blocker and removed priority: high labels Jan 6, 2022

TaraJensen assigned TaraJensen and JohnHalleyGotway and unassigned JohnHalleyGotway Jan 18, 2022

JohnHalleyGotway added a commit that referenced this issue Feb 16, 2022

Per #1583, update PairDataEnsemble::compute_ssvar() to bail out if ba…

627b8dc

…d data is present in the ensemble mean.

JohnHalleyGotway added a commit that referenced this issue Feb 16, 2022

Per #1583, check for null pointer in process_n_vld() to avoid a segfa…

d4fe286

…ult.

JohnHalleyGotway added a commit that referenced this issue Feb 16, 2022

Per #1583, parse nbrhd_prob and nmep_smooth from the top-level config…

53d09c2

… dictionary rather than ens.field, which fails when ens.field is an empty array.

JohnHalleyGotway added a commit that referenced this issue Feb 17, 2022

Per #1583, fix logic in ensemble_stat for determining the number of e…

e2fe9a2

…nsemble members when creating the output txt files... esp when HiRA is not selected. ci-run-unit

JohnHalleyGotway added a commit that referenced this issue Feb 17, 2022

Per #1583, update Ensemble-Stat application code to support HiRA as a…

f8440eb

… simple interpolation option.

JohnHalleyGotway added a commit that referenced this issue Feb 17, 2022

Per #1583, ci-run-unit.

89c8308

JohnHalleyGotway added a commit that referenced this issue Feb 17, 2022

Per #1583, update warning messages about when HiRA can be used.

15dc749

JohnHalleyGotway linked a pull request Feb 17, 2022 that will close this issue

Feature 1583 es hira #2056

Merged

14 tasks

JohnHalleyGotway added a commit that referenced this issue Feb 17, 2022

Per #1583, fix documentation indent warning.

d17ebf3

JohnHalleyGotway closed this as completed Feb 18, 2022

This was referenced Feb 18, 2022

Add new bias ratio statistic to the ECNT line type from Ensemble-Stat and for HiRA #2058

Closed

Update develop-ref after #2056, #2057, and #2060 #2059

Merged

JohnHalleyGotway changed the title ~~Enhance Ensemble-Stat to apply the HiRA method to Ensembles~~ Enhance Ensemble-Stat to apply the HiRA method to ensembles Mar 2, 2022

This was referenced Mar 2, 2022

Handle variable length lines with more than 125 columns after the header, especially for HiRA dtcenter/METdataio#78

Closed

Handle variable length lines with more than 125 columns after the header, especially for HiRA (issue #78) dtcenter/METdataio#79

Merged

JohnHalleyGotway reopened this Mar 8, 2022

georgemccabe mentioned this issue Mar 8, 2022

feature 1490 Python Embedding for Point Observations dtcenter/METplus#1491

Merged

13 tasks

JohnHalleyGotway linked a pull request Mar 8, 2022 that will close this issue

Feature 1583 skip_mean #2090

Merged

15 tasks

georgemccabe mentioned this issue Mar 9, 2022

Update Develop-ref after #1491, #1492, and MET changes dtcenter/METplus#1499

Merged

JohnHalleyGotway closed this as completed Mar 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance Ensemble-Stat to apply the HiRA method to ensembles #1583

Enhance Ensemble-Stat to apply the HiRA method to ensembles #1583

TaraJensen commented Nov 24, 2020 •

edited by JohnHalleyGotway

Loading

mpm-meto commented Jan 7, 2022

JohnHalleyGotway commented Feb 17, 2022

JohnHalleyGotway commented Feb 23, 2022

mpm-meto commented Feb 24, 2022 •

edited

Loading

JohnHalleyGotway commented Feb 24, 2022 •

edited

Loading

JohnHalleyGotway commented Mar 8, 2022 •

edited

Loading

Enhance Ensemble-Stat to apply the HiRA method to ensembles #1583

Enhance Ensemble-Stat to apply the HiRA method to ensembles #1583

Comments

TaraJensen commented Nov 24, 2020 • edited by JohnHalleyGotway Loading

Describe the New Feature

Acceptance Testing

Time Estimate

Sub-Issues

Relevant Deadlines

Funding Source

Define the Metadata

Assignee

Labels

Projects and Milestone

Define Related Issue(s)

New Feature Checklist

mpm-meto commented Jan 7, 2022

JohnHalleyGotway commented Feb 17, 2022

JohnHalleyGotway commented Feb 23, 2022

mpm-meto commented Feb 24, 2022 • edited Loading

JohnHalleyGotway commented Feb 24, 2022 • edited Loading

JohnHalleyGotway commented Mar 8, 2022 • edited Loading

TaraJensen commented Nov 24, 2020 •

edited by JohnHalleyGotway

Loading

mpm-meto commented Feb 24, 2022 •

edited

Loading

JohnHalleyGotway commented Feb 24, 2022 •

edited

Loading

JohnHalleyGotway commented Mar 8, 2022 •

edited

Loading