Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate regression test baseline set #1217

Merged
merged 25 commits into from
Aug 18, 2022

Conversation

rafmudaf
Copy link
Collaborator

@rafmudaf rafmudaf commented Aug 16, 2022

Feature or improvement description

This pull request changes the regression test infrastructure to support a single set of baseline solutions for all supported compiler and platform combinations. Whereas the existing method for comparing results uses a norm for each channel, the new method uses an element-wise comparison along with a tolerance level that is derived from the channel itself.

New method

The primary change is to compare the test and baseline results with two element-wise tolerance values. A relative tolerance determines the allowed deviation from the baseline, and an absolute tolerance sets the smallest values that are considered meaningful. Both tolerances are expressed as orders of mangitude less than the order of magnitude of the range of the channel. The default values are RTOL = 2 and ATOL = 1.9 and these have been tuned to the existing r-test suite. The comparison has the form

$$ | baseline - test |\le atol + rol * | baseline | $$

$$ atol = 10^{\log( baseline - \min (baseline) ) - ATOL} $$

$$ rtol = 10^{- RTOL} $$

atol is a function of the range of the baseline channel. It sets the level of precision required to pass. rtol is a function of the baseline channel values themselves, and this can be thought of as the level of accuracy required to pass the regression test. This comparison allows the threshold to scale with the magnitude of the baseline so that, for example, a deviation of 10 passes for a data point at 1e6 but fails for a data point at 1e2.

Old method

The existing method replaced by this pull request uses a "relative L2 norm" and a single value for a tolerance:

$$ norm = || baseline - test ||_2 $$

$$ norm \le tolerance $$

Where $||baseline||_2 >= 1$, the error norm is divided by the L2 norm of the baseline.

Methods compared

For comparison, the plot below shows the difference between Intel and GNU compiler results for the HydroMxi channel in 5MW_ITIBarge_DLL_WTurb_WavesIrr. By visual comparison, this should pass the regression test.

gnu vs intel

The old method for determining whether this passes the regression test is shown below. In this case, the test fails since the calculated norm exceeds the default threshold.

norm vs tolerance

The new method for comparison is shown below. The threshold is directly related to the baseline data

error vs threshold

To get a sense for how much the test results are allowed to deviate from the baseline, I've plotted the upper and lower limits around the baseline channel and zoomed in to the portion of time with the most deviation. The dotted red line is the test and solid red is baseline. In this case, the bounds could be tighter around the baseline data, but it is a significant improvement over the old method.

bounds

OS / Compiler

This has been tested on the following OS / compiler combinations:

OS Intel GNU
macOS ✔️ ✔️
Ubuntu ✔️ ✔️
Windows ✔️

Disabled tests

A few tests have been consistently inconsistent between operating system, compilers, and code changes. These tests have been disabled:

  • UAE_Dnwind_YRamp_WSt
  • SWRT_YFree_VS_WTurb

These two tests had errors that were outliers compared to all the rest. For the sake of simplifying the test suite and the effort required to update it, these tests have been disabled. However, they should be investigated and reenabled in the future.

  • 5MW_OC4Jckt_DLL_WTurb_WavesIrr_MGrowth
  • 5MW_ITIBarge_DLL_WTurb_WavesIrr

Future work

This is a step toward improving the regression test infrastructure, and I intend to continue to iterate on this method as I get a sense for how well it captures changes to the results. Better tuning of the tolerance parameters would enable more sensitivity in the comparison. Additionally, it might be worthwhile to allow for specifying tolerances specific to each test case or channel. Further improvements will also involve improving the tests and their outputs to provide more meaningful data for comparison.

Additional supporting information

This allows for regression test baseline results to be updated from a single system (i.e. my MacBook Pro or eventually automatically through GitHub Actions). In the short term, I will continue to manually inspect the differences between solution to ensure that there are no false-passing cases.

@rafmudaf rafmudaf added this to the v3.3.0 milestone Aug 16, 2022
@rafmudaf rafmudaf self-assigned this Aug 16, 2022
@jjonkman
Copy link
Collaborator

@rafmudaf -- This is definitely a welcome change to the regression testing infrastructure; thanks for that! Just a few comments:

  • From the first equation in your post, my understanding is the comparison is made on a time-step-by-time-step basis. So, the equation is tested at every time step and the max() and min() are used to find the maximum and minimum value over the entire time series; you may want to state that clearly, e.g., by adding "(t)" to the relevant variables.

  • I would avoid using the term "current" when referring to the method, as it was not clear to me when I first read the post if you were referring to the original method or the proposed method. I suggest using "original" and "proposed" (or "new", or the like) to clarify.

  • Your post doesn't mention the "precision" used to compile. I assume the "baseline" and "test" solutions you are using are based on compiling in double precision. I would also be curious to know what the new tolerances have to be when comparing single to double precision given that most users of OpenFAST run in single precision, which tends to be much quicker time to solution.

Best regards,

@rafmudaf
Copy link
Collaborator Author

rafmudaf commented Aug 16, 2022

Thanks for the comments @jjonkman. Good points on the wording, I'll update the post to be more clear.

As for precision, this is all based on double precision. I didn't test in single precision since it is known that the solution can be different for reasons aside from numerical precision such as described in #1209 (comment). That being said, I'll do a quick study to get a general idea of the differences.

@rafmudaf
Copy link
Collaborator Author

In single precision, the following tests fail:

  2 - AWT_WSt_StartUp_HighSpShutDown (Failed)
  3 - AWT_YFree_WSt (Failed)
  4 - AWT_YFree_WTurb (Failed)
  6 - AOC_WSt (Failed)
  7 - AOC_YFree_WTurb (Failed)
  8 - AOC_YFix_WSt (Failed)
 11 - WP_VSP_ECD (Failed)
 13 - SWRT_YFree_VS_EDG01 (Failed)
 15 - 5MW_Land_DLL_WTurb (Failed)
 16 - 5MW_OC3Mnpl_DLL_WTurb_WavesIrr (Failed)
 17 - 5MW_OC3Trpd_DLL_WSt_WavesReg (Failed)
 19 - 5MW_TLP_DLL_WTurb_WavesIrr_WavesMulti (Failed)
 20 - 5MW_OC3Spar_DLL_WTurb_WavesIrr (Failed)
 21 - 5MW_OC4Semi_WSt_WavesWN (Failed)
 22 - 5MW_Land_BD_DLL_WTurb (Failed)
 23 - 5MW_OC4Jckt_ExtPtfm (Failed)
 24 - HelicalWake_OLAF (Failed)
 26 - StC_test_OC4Semi (Failed)

Most failing channels have a few portions of time where the error spikes briefly. Below is an example from 5MW_Land_BD_DLL_WTurb. I consider these false-failures since comparing the test and baseline results (left plot) visually shows no difference. In the old method of checking regression test results, this sort of failing case would pass after visually confirming confirming that the results are close.
Screen Shot 2022-08-17 at 6 26 14 PM

However, some real failures are caught such as these large spikes in the HelicalWake_OLAF case.
Screen Shot 2022-08-17 at 6 38 33 PM

These test cases are too sensitive to be used reliably for software testing
The function to determine the passing channels was incorrectly using a global minimum instead of channel minimum to determine the absolutel tolerance.
@jjonkman
Copy link
Collaborator

Thanks, @rafmudaf. What are the two curves (red, blue) in your plots on the right involving ABS(local-baseline)?

@rafmudaf
Copy link
Collaborator Author

The red line is the error threshold for the regression test and the blue line is the error, abs(baseline - test). Thanks for asking about this, I'll add a legend to those plots.

@@ -421,6 +421,7 @@ def check_error(self):
def check_input_motions(self,nodePos,nodeVel,nodeAcc):
# make sure number of nodes didn't change for some reason
if self._initNumNodePts != self.numNodePts:
# @ANDY TODO: `time` is not available here so this would be a runtime error
Copy link
Collaborator

@andrew-platt andrew-platt Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This error never occurred in the test cases, so it never got triggered. I have a few other revisions to make to this file, so I'll correct it then.

@andrew-platt
Copy link
Collaborator

I'll hold off on merging until the legend addition is complete.

@rafmudaf rafmudaf merged commit 2ea9bcb into OpenFAST:dev Aug 18, 2022
@bjonkman
Copy link
Contributor

The manualRegressionTest.py script should also be changed so that it works with the updated execute*RegressionCase.py files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants