Consolidate regression test baseline set #1217

rafmudaf · 2022-08-16T16:46:44Z

Feature or improvement description

This pull request changes the regression test infrastructure to support a single set of baseline solutions for all supported compiler and platform combinations. Whereas the existing method for comparing results uses a norm for each channel, the new method uses an element-wise comparison along with a tolerance level that is derived from the channel itself.

New method

The primary change is to compare the test and baseline results with two element-wise tolerance values. A relative tolerance determines the allowed deviation from the baseline, and an absolute tolerance sets the smallest values that are considered meaningful. Both tolerances are expressed as orders of mangitude less than the order of magnitude of the range of the channel. The default values are RTOL = 2 and ATOL = 1.9 and these have been tuned to the existing r-test suite. The comparison has the form

$$ | baseline - test |\le atol + rol * | baseline | $$

$$ atol = 10^{\log( baseline - \min (baseline) ) - ATOL} $$

$$ rtol = 10^{- RTOL} $$

atol is a function of the range of the baseline channel. It sets the level of precision required to pass. rtol is a function of the baseline channel values themselves, and this can be thought of as the level of accuracy required to pass the regression test. This comparison allows the threshold to scale with the magnitude of the baseline so that, for example, a deviation of 10 passes for a data point at 1e6 but fails for a data point at 1e2.

Old method

The existing method replaced by this pull request uses a "relative L2 norm" and a single value for a tolerance:

$$ norm = || baseline - test ||_2 $$

$$ norm \le tolerance $$

Where $||baseline||_2 >= 1$, the error norm is divided by the L2 norm of the baseline.

Methods compared

For comparison, the plot below shows the difference between Intel and GNU compiler results for the HydroMxi channel in 5MW_ITIBarge_DLL_WTurb_WavesIrr. By visual comparison, this should pass the regression test.

The old method for determining whether this passes the regression test is shown below. In this case, the test fails since the calculated norm exceeds the default threshold.

The new method for comparison is shown below. The threshold is directly related to the baseline data

To get a sense for how much the test results are allowed to deviate from the baseline, I've plotted the upper and lower limits around the baseline channel and zoomed in to the portion of time with the most deviation. The dotted red line is the test and solid red is baseline. In this case, the bounds could be tighter around the baseline data, but it is a significant improvement over the old method.

OS / Compiler

This has been tested on the following OS / compiler combinations:

OS	Intel	GNU
macOS	✔️	✔️
Ubuntu	✔️	✔️
Windows	✔️	❌

Disabled tests

A few tests have been consistently inconsistent between operating system, compilers, and code changes. These tests have been disabled:

UAE_Dnwind_YRamp_WSt
SWRT_YFree_VS_WTurb

These two tests had errors that were outliers compared to all the rest. For the sake of simplifying the test suite and the effort required to update it, these tests have been disabled. However, they should be investigated and reenabled in the future.

5MW_OC4Jckt_DLL_WTurb_WavesIrr_MGrowth
5MW_ITIBarge_DLL_WTurb_WavesIrr

Future work

This is a step toward improving the regression test infrastructure, and I intend to continue to iterate on this method as I get a sense for how well it captures changes to the results. Better tuning of the tolerance parameters would enable more sensitivity in the comparison. Additionally, it might be worthwhile to allow for specifying tolerances specific to each test case or channel. Further improvements will also involve improving the tests and their outputs to provide more meaningful data for comparison.

Additional supporting information

This allows for regression test baseline results to be updated from a single system (i.e. my MacBook Pro or eventually automatically through GitHub Actions). In the short term, I will continue to manually inspect the differences between solution to ensure that there are no false-passing cases.

Allows to run the regression tests without running the simulations. This is useful for testing the regression test infrastructure itself. The option already existed in the reg test scripts, but it was not connected to CTest.

jjonkman · 2022-08-16T17:29:30Z

@rafmudaf -- This is definitely a welcome change to the regression testing infrastructure; thanks for that! Just a few comments:

From the first equation in your post, my understanding is the comparison is made on a time-step-by-time-step basis. So, the equation is tested at every time step and the max() and min() are used to find the maximum and minimum value over the entire time series; you may want to state that clearly, e.g., by adding "(t)" to the relevant variables.
I would avoid using the term "current" when referring to the method, as it was not clear to me when I first read the post if you were referring to the original method or the proposed method. I suggest using "original" and "proposed" (or "new", or the like) to clarify.
Your post doesn't mention the "precision" used to compile. I assume the "baseline" and "test" solutions you are using are based on compiling in double precision. I would also be curious to know what the new tolerances have to be when comparing single to double precision given that most users of OpenFAST run in single precision, which tends to be much quicker time to solution.

Best regards,

rafmudaf · 2022-08-16T17:33:20Z

Thanks for the comments @jjonkman. Good points on the wording, I'll update the post to be more clear.

As for precision, this is all based on double precision. I didn't test in single precision since it is known that the solution can be different for reasons aside from numerical precision such as described in #1209 (comment). That being said, I'll do a quick study to get a general idea of the differences.

rafmudaf · 2022-08-17T23:39:22Z

In single precision, the following tests fail:

  2 - AWT_WSt_StartUp_HighSpShutDown (Failed)
  3 - AWT_YFree_WSt (Failed)
  4 - AWT_YFree_WTurb (Failed)
  6 - AOC_WSt (Failed)
  7 - AOC_YFree_WTurb (Failed)
  8 - AOC_YFix_WSt (Failed)
 11 - WP_VSP_ECD (Failed)
 13 - SWRT_YFree_VS_EDG01 (Failed)
 15 - 5MW_Land_DLL_WTurb (Failed)
 16 - 5MW_OC3Mnpl_DLL_WTurb_WavesIrr (Failed)
 17 - 5MW_OC3Trpd_DLL_WSt_WavesReg (Failed)
 19 - 5MW_TLP_DLL_WTurb_WavesIrr_WavesMulti (Failed)
 20 - 5MW_OC3Spar_DLL_WTurb_WavesIrr (Failed)
 21 - 5MW_OC4Semi_WSt_WavesWN (Failed)
 22 - 5MW_Land_BD_DLL_WTurb (Failed)
 23 - 5MW_OC4Jckt_ExtPtfm (Failed)
 24 - HelicalWake_OLAF (Failed)
 26 - StC_test_OC4Semi (Failed)

Most failing channels have a few portions of time where the error spikes briefly. Below is an example from 5MW_Land_BD_DLL_WTurb. I consider these false-failures since comparing the test and baseline results (left plot) visually shows no difference. In the old method of checking regression test results, this sort of failing case would pass after visually confirming confirming that the results are close.

However, some real failures are caught such as these large spikes in the HelicalWake_OLAF case.

These test cases are too sensitive to be used reliably for software testing

The function to determine the passing channels was incorrectly using a global minimum instead of channel minimum to determine the absolutel tolerance.

jjonkman · 2022-08-18T12:57:09Z

Thanks, @rafmudaf. What are the two curves (red, blue) in your plots on the right involving ABS(local-baseline)?

rafmudaf · 2022-08-18T15:40:45Z

The red line is the error threshold for the regression test and the blue line is the error, abs(baseline - test). Thanks for asking about this, I'll add a legend to those plots.

andrew-platt · 2022-08-18T20:07:20Z

modules/hydrodyn/python-lib/hydrodyn_library.py

@@ -421,6 +421,7 @@ def check_error(self):
    def check_input_motions(self,nodePos,nodeVel,nodeAcc):
        # make sure number of nodes didn't change for some reason
        if self._initNumNodePts != self.numNodePts:
+            # @ANDY TODO: `time` is not available here so this would be a runtime error


Good catch. This error never occurred in the test cases, so it never got triggered. I have a few other revisions to make to this file, so I'll correct it then.

andrew-platt · 2022-08-18T20:17:45Z

I'll hold off on merging until the legend addition is complete.

reg_tests/executeOpenfastRegressionCase.py

reg_tests/executePythonRegressionCase.py

reg_tests/executeFASTFarmRegressionCase.py

reg_tests/executeOpenfastCppRegressionCase.py

reg_tests/executeOpenfastLinearRegressionCase.py

bjonkman · 2022-08-18T23:39:28Z

The manualRegressionTest.py script should also be changed so that it works with the updated execute*RegressionCase.py files.

rafmudaf added 21 commits August 1, 2022 16:22

Clean up in rtestlib

84addf3

Determine passing channels instead of norms

fa66d5b

Plot pass/fail boundary in error plot

b2b3d5c

Unify baselines

22e8395

Update reg test driver scripts

68ed80a

Update reg test plotting - bug fixes, Bokeh 2.4

acf2c89

Expand precision for BeamDyn and SubDyn module tests

2d75a72

Mark channels with nan or inf as failing

c70fb2b

Add a no-run flag to CTest options

df1cc9e

Allows to run the regression tests without running the simulations. This is useful for testing the regression test infrastructure itself. The option already existed in the reg test scripts, but it was not connected to CTest.

Add fpic compiler flag for C targets

0b76b44

IfW Driver: Fix output headers and column fmt

7c2215b

HD: add precision and update Python driver scripts

761ef68

Test ascii output silently

abcdb12

Add more precision to SubDyn tests

ccaabe4

Update a few HD cases to use RANLUX pRNG

8a59edd

Make atol a function of the magnitude of the data

d1daab1

Rename openfast library ctest function

35387f7

Set ctest driver paths based on system type

ff1cf54

Add a ctest label for fastlib cases

0ffbe85

Add HD module test for OC4 jacket

0d8762b

Connect tolerance parameters through CMake

e46cbaa

rafmudaf added the Type: Testing label Aug 16, 2022

rafmudaf added this to the v3.3.0 milestone Aug 16, 2022

rafmudaf requested a review from andrew-platt August 16, 2022 16:46

rafmudaf self-assigned this Aug 16, 2022

Disable consistently inconsistent test cases

705bf08

These test cases are too sensitive to be used reliably for software testing

Bug fix in finding the channel order of magnitude

92ddaf3

The function to determine the passing channels was incorrectly using a global minimum instead of channel minimum to determine the absolutel tolerance.

rafmudaf force-pushed the q3/one_baseline branch from bab5941 to 92ddaf3 Compare August 17, 2022 23:51

andrew-platt reviewed Aug 18, 2022

View reviewed changes

andrew-platt approved these changes Aug 18, 2022

View reviewed changes

Add legend to error plots

134959c