Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to the CSV compare tool for zero reference signals #4421

Open
casella opened this issue Jun 18, 2024 · 12 comments
Open

Improvements to the CSV compare tool for zero reference signals #4421

casella opened this issue Jun 18, 2024 · 12 comments
Assignees

Comments

@casella
Copy link
Contributor

casella commented Jun 18, 2024

While reviewing regressions from MSL 4.0.0 to MSL 4.1.0 (#4341) with @tobolar, we stumbled upon several reference signals which were very close to zero, but not exactly at zero due to numerical approximation. In some cases (e.g. ModelicaTest.MultiBody.Joints.Universal.diff.universal.w_a) the variable, which has ideally an equilibrium trajectory at constant zero value, is actually unstable, so any small numerical error grows exponentially.

For non-zero variables, numerical errors are normally not a problem, because their relative magnitude is small, so they easily fall within the tolerance tube. Unfortunately, this is not the case for variables that are meant to be zero. In fact, the more precise the solution is, the more absurdly small will the high and low limits will be. We had regressions from MSL 4.0.0 to MSL 4.1.0 because of variables that were off by 1.3e-16 😨

IMHO there are two proper solutions to this problem.

The first could be to somehow include the nominal attribute of each variable in the CSV data. This would allow to check the error with a criterion such as

(v - v_ref)/max(abs(v), v.nominal) < tol

which will be robust also for variables that are ideally zero, since then the relative scaling will be performed with their nominal value.

The second is to take into account the fact that in most (but not all) cases these variables are parts of a vector, e.g. the three Cartesian coordinates of a position, velocity, or acceleration signal, the components of a quaternion, or the Real and Imaginary parts of a Complex number. In this case, the error should of course be evaluated by looking at the entire vector, e.g.

(v[j] - v_ref[j])/max(sqrt(sum(v[j] for j in 1:size(v))), v.nominal) < tol

for 3D vectors or

(v.re - v_ref.re)/max(sqrt(v.re^2 + v.im^2), v.nominal) < tol

Unfortunately, there is no clear Modelica semantics to declare that an array is a 3D vector or a quaternion, or that a Complex operator record is actually a vector in the Gauss plane. Maybe we need some annotations in the language to express that unambiguously? This could be added to the base classes in MSL, without the need of adding this information all the time in user models.

Until these issues are resolved, I am convinced that we should avoid to include reference signals that are expected to be exactly zero i the analytic solution, because their are basically just misleading. Of course you can reproduce them with the same tool, but even different versions of the same tool may produce completely different results, not to mention comparing different tools that use different numerical methods. We should avoid wasting time over "fake regressions" on such variables.

What do you think?

@casella casella changed the title Improvements to the CVS compare tool for zero reference signals. Improvements to the CSV compare tool for zero reference signals. Jun 18, 2024
@AHaumer
Copy link
Contributor

AHaumer commented Jun 18, 2024

I think @casella is right but it could be tedious to define nominal values for all reference signals manually.
As far as I know all tools should determine some sort of nominal values to calculate integration errors to determine proper step size of integration algorithms with variable step size. It's just a question how to access these nominal values to utilize them ...
@HansOlsson any idea?

@casella
Copy link
Contributor Author

casella commented Jun 19, 2024

@AHaumer I meant the nominal attributes. They are often defined in basic SI types (e.g. Pressure has nominal = 1e6)

@HansOlsson
Copy link
Contributor

A simple and pragmatic solution would be to assume that all of the variables have a nominal of at least 1e-3, and update the test-program based on that. We might miss some cases that should have a tighter tolerance - but we can deal with that later - without running the risk of accidentally removing those signals in this step.

@HansOlsson
Copy link
Contributor

HansOlsson commented Jun 19, 2024

I think @casella is right but it could be tedious to define nominal values for all reference signals manually. As far as I know all tools should determine some sort of nominal values to calculate integration errors to determine proper step size of integration algorithms with variable step size. It's just a question how to access these nominal values to utilize them ... @HansOlsson any idea?

After additional thought I think this will be quite complicated, whereas just updating the tool to use a fixed nominal value (of 1e-3 or just 1) will be straightforward and solve the problem.
Note that due to how nominals should work it will only be a relevant for variables whose nominal is lower than this fixed value. Pressures with a nominal of 1e5 will work as well (or badly) as before.

The reasons are due to the internals:

  • The comparison is based on result-files converted to csv-files; and there is no obvious place to store the nominal values in the CSV-files.
    • Storing it in the header would be the most obvious, but would mess up other uses of the CSV-files.
    • Storing it as a value with fake-time would "work", but would also mess it up in other cases.
  • Storing it in the comparisonSignals.txt as a second column would have the additional problems that the information isn't directly used by the csv-compare as far as I understand, and it would also be problematic to update
  • The result files (at least from Dymola) don't currently contain this information so it cannot be copied to the csv-files. It would be possible to add it, but with similar issues.

@henrikt-ma
Copy link
Contributor

The second is to take into account the fact that in most (but not all) cases these variables are parts of a vector, e.g. the three Cartesian coordinates of a position, velocity, or acceleration signal, the components of a quaternion, or the Real and Imaginary parts of a Complex number. In this case, the error should of course be evaluated by looking at the entire vector, e.g.

For what it's worth, this is a feature we use for our own correctness comparisons. However, we don't have any automatic mechanism for identifying when the feature should be used, so we apply it manually as needed.

A closely related feature is that when the unit quaternion is actually marked as a parameterization of SO(3), we can account for the quotient space effect of q and -q representing the same orientation. In practice, we rarely need this feature, but it addresses a fundamental problem when initializing orientations from complicated nonlinear equation systems where it is hard (or not desirable) to control exactly which of the two equivalent quaternions to expect.

@casella
Copy link
Contributor Author

casella commented Jun 28, 2024

A simple and pragmatic solution would be to assume that all of the variables have a nominal of at least 1e-3, and update the test-program based on that. We might miss some cases that should have a tighter tolerance - but we can deal with that later - without running the risk of accidentally removing those signals in this step.

It's not the perfect solution but I see it as a reasonable one.

@HansOlsson the question is: who can take care of this and how long would that take? Can we make it in time for MSL 4.1.0 without delaying it unnecessarily?

@beutlich
Copy link
Member

beutlich commented Sep 10, 2024

What I propose is to add an extra (and optional) row (explicitly marked by keyword "nominal") for the nominal values of each signal. For example:

"time", "v1", "v2"
0.0, 0.0, 1e-8 
1.0, 1.0, -2e-8
2.0, 1.5, 1.5e-9
3.0, 1.0, 5e-9
4.0, 0.0, -1e-8

becomes

"time", "v1", "v2"
"nominal", 1.0, 1e-8
0.0, 0.0, 1e-8 
1.0, 1.0, -2e-8
2.0, 1.5, 1.5e-9
3.0, 1.0, 5e-9
4.0, 0.0, -1e-8

This is all simple, self-contained within the simulation result file, optional and explicit.

@beutlich beutlich changed the title Improvements to the CSV compare tool for zero reference signals. Improvements to the CSV compare tool for zero reference signals Sep 10, 2024
@beutlich beutlich pinned this issue Sep 10, 2024
@casella
Copy link
Contributor Author

casella commented Sep 17, 2024

Would this still qualify as a CSV file?

@HansOlsson
Copy link
Contributor

Would this still qualify as a CSV file?

I would say it would kind of quality, but most tools using CSV-files for signals expect that after the (optional) header there is just numerical data, not texts, so it wouldn't work well.

To me "kind of CSV" would lose the main portability advantage of a normal CSV-file.

@beutlich
Copy link
Member

I see, it's only common to have one optional header record in CSV files (being non-standardized anyway), not more than one. Still I consider my proposal as convenient enough for the above mentioned reasons.

@casella
Copy link
Contributor Author

casella commented Sep 17, 2024

Maybe one could have two CSV output files, one for the simulation results and one for the nominal values. This wouldn't cause problems reading the CSV file with functions meant for CSV files.

@bilderbuchi
Copy link

bilderbuchi commented Sep 18, 2024

I see, it's only common to have one optional header record in CSV files (being non-standardized anyway), not more than one. Still I consider my proposal as convenient enough for the above mentioned reasons.

If you have to go down that route, I'd recommend to at least put such a preamble before the column header lines, i.e. before the regular content, and mark it with a comment character.

# "nominal", 1.0, 1e-8
"time", "v1", "v2"
0.0, 0.0, 1e-8 
1.0, 1.0, -2e-8
2.0, 1.5, 1.5e-9
3.0, 1.0, 5e-9
4.0, 0.0, -1e-8

Some csv parsing APIs have a way to specify how many lines to skip at the beginning (e.g. pandas' read_csv() skiprows argument), so that way the "regular" content can at least be parsed correctly without having to manually delete the first line and without confusing potential datatype detection (otherwise the first value of the time column is a string).

Also, some APIs have a way to indicate a comment character like # with lines that should be ignored (again, read_csv(comment='#',...)), so the leading # helps in that case.

Finally, if you need to get at the content of "nominal" row, just reading in the first line of a file is simpler than having to count the lines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants