Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch minimal_surface Regression Test #2674

Merged

Conversation

JacksonBurns
Copy link
Contributor

@JacksonBurns JacksonBurns commented Jun 9, 2024

Edit 9/15/24: the runner is crashing due to OOM errors. There is a solution to this that involves reducing the memory usage in the regression tests, but implementing this solution is hard. We have opted for an easier solution: severely limit the size of the mechanism. This does not cover as much 'science' as we would like the regression test to do, but does at least offer some basic correctness checks. At this point this is preferable to not testing at all, as we are currently doing since the 'better' solution is quite challenging.

Original PR content follows:

Problem

I discovered in #2595 and #2669 that the minimal_surface regression test comparison causes the runners to crash with exit code 143. To allow things to still be merged in the meantime, I have removed that test from the CI and intend to fix it in this PR.

TODO:

  • diagnose the issue(s)
  • implement a patch
  • drop 862adb8
  • merge, hopefully passing all tests (unit and regression)

Calling in @sevyharris for some help on this one!

@JacksonBurns
Copy link
Contributor Author

I think it is crashing at cat "$regr_test-edge.log" - maybe the contents of that file are too big for the minimal_surface example?

@JacksonBurns
Copy link
Contributor Author

It's actually not that, but instead:

              if python-jl rmgpy/tools/regression.py \
                test/regression/"$regr_test"/regression_input.py \
                $REFERENCE/"$regr_test"/chemkin \
                test/regression/"$regr_test"/chemkin

@sevyharris
Copy link
Contributor

sevyharris commented Jun 10, 2024

Not particularly informative, but I pulled the latest from main and checked that the minimal_surface example runs without issue. I can call regression.py on the result (using the same chemkin folder for old and new) and it passes, so there are no obvious problems with the surface example itself.

I'm going to create a second PR like this one so I can play around with the CI.yml and hopefully get some additional diagnostic info.

@JacksonBurns
Copy link
Contributor Author

Beat me to it! Thanks for running that. Given that it works locally, that leads me to believe that we are probably just running out of memory on the GitHub actions runners...

Copy link

Regression Testing Results

Detailed regression test results.

Regression test minimal_surface:

Reference: Execution time (DD:HH:MM:SS): 00:00:00:43
Current: Execution time (DD:HH:MM:SS): 00:00:00:43
Reference: Memory used: 2885.76 MB
Current: Memory used: 2885.76 MB

minimal_surface Passed Core Comparison ✅

Original model has 11 species.
Test model has 11 species. ✅
Original model has 3 reactions.
Test model has 3 reactions. ✅

minimal_surface Passed Edge Comparison ✅

Original model has 38 species.
Test model has 38 species. ✅
Original model has 38 reactions.
Test model has 38 reactions. ✅
Complete log is available in the GitHub actions artifact.

Observables Test Case: minimal_surface Comparison

✅ All Observables varied by less than 0.500 on average between old model and new model in all conditions!

minimal_surface Passed Observable Testing ✅

beep boop this comment was written by a bot 🤖

@sevyharris
Copy link
Contributor

It looks like you're right about the memory/model size being an issue since it passed with the size limits. ^

Richard has suggested that since it looks like it's failing after it's already generated the model, the problem could be that the regression is trying to compare too many time points. I will see if that's an obvious culprit and if so if it can be reduced to a more reasonable number.

@JacksonBurns
Copy link
Contributor Author

Sounds like a plan! Feel free to edit on this PR.

I will defer to your and Richard's judgement about how to reduce the computational power needed for this test while still keeping it scientifically meaningful. I've cut the number of species by about a factor of 10, which I am assuming is too much...

@rwest
Copy link
Member

rwest commented Jun 10, 2024

Richard has suggested that since it looks like it's failing after it's already generated the model, the problem could be that the regression is trying to compare too many time points. I will see if that's an obvious culprit and if so if it can be reduced to a more reasonable number.

Looking a little bit more at this (I was not recalling how the regression comparisons are done) it is also running Cantera simulations at this step, via /rmgpy/tools/observablesregression.py which calls into rmgpy/tools/canteramodel.py. So it could be any of these steps that are taking the resources - running cantera or comparing the cantera results. Should probably do a bit more instrumentation before guessing the cause.

@JacksonBurns
Copy link
Contributor Author

@sevyharris I just realized we never added the minimal_surface regression test to RMG-database, so we should do that once we have something acceptable here.

@JacksonBurns JacksonBurns marked this pull request as draft June 12, 2024 00:32
Copy link

This pull request is being automatically marked as stale because it has not received any interaction in the last 90 days. Please leave a comment if this is still a relevant pull request, otherwise it will automatically be closed in 30 days.

@github-actions github-actions bot added the stale stale issue/PR as determined by actions bot label Sep 10, 2024
@sevyharris sevyharris removed the stale stale issue/PR as determined by actions bot label Sep 10, 2024
@JacksonBurns
Copy link
Contributor Author

hi @sevyharris i realized this PR and #2675 fell off my radar. Do we want to merge an 'easy' fix like the one that I have here? Or still try and figure out the root cause and get a 'better' fix?

@sevyharris
Copy link
Contributor

hi @sevyharris i realized this PR and #2675 fell off my radar. Do we want to merge an 'easy' fix like the one that I have here? Or still try and figure out the root cause and get a 'better' fix?

Hey @JacksonBurns, I'm in favor of merging this version now. (I totally forgot about this too!) I'll try to come up with a fix to bring down the memory usage for testing without limiting the number edge species so severely, but it'd be great to have at least something in place. Thanks!

github actions runners have relatively limited memory

the mechanism size that would result from this would crash the runners with OOM errors when attempting to run cantera simulations

this fixes the issue by limiting the size of the mechanism. this does remove the scientific validity of the test, but at least covers some basic functionality and correctness checks
@JacksonBurns JacksonBurns changed the title Fix minimal_surface Regression Test Patch minimal_surface Regression Test Sep 15, 2024
@JacksonBurns JacksonBurns marked this pull request as ready for review September 15, 2024 17:53
@JacksonBurns
Copy link
Contributor Author

@sevyharris and/or @rwest please review. This should fail CI since there are no "stable" results for this test to compare against, so I will bypass and merge the failing CI with an approval here. Thanks!

@JacksonBurns
Copy link
Contributor Author

I confirm that the CI is failing where expected.

Copy link
Contributor

@sevyharris sevyharris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@JacksonBurns JacksonBurns merged commit a2e1e60 into ReactionMechanismGenerator:main Sep 16, 2024
3 of 4 checks passed
@JacksonBurns JacksonBurns deleted the ci/minimal_surface branch September 16, 2024 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants