-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix finding of failed tests in output of PyTorch test step #2859
base: develop
Are you sure you want to change the base?
Conversation
…easyblock test step + add test for it
…e-letter local variable 'm')
…led_tests_info in PyTorch easyblock
… extend test for that function
@Flamefire suggested changes implemented and test enhanced, please take another look before I fire off test builds again with this? |
…style checker (E101 indentation contains mixed spaces and tabs)
LGTM except for the failing check for the signaled test. Also I'd suggest to use the named tuple members in the easyblock, not just the test. |
…in test step of PyTorch easyblock
"distributed/fsdp/test_fsdp_input (2 total tests, failures=2)", | ||
"distributions/test_distributions (216 total tests, errors=4)", | ||
"test_autograd (464 total tests, failures=1, skipped=52, expected failures=1)", | ||
"test_fx (123 total tests, errors=2, skipped=2)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This or the SIGSEGV test needs to be another one or failed_test_suites
could pass when the signal was not detected as in the case of a forced termination due to a signla there won't be a test summary, will there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we have a bug here. For the output of
Running distributions/rpc/test_tensorpipe_agent ... [2023-01-12 09:06:37.093571]
...
Ran 123 tests in 7.549s
FAILED (errors=2, skipped=2)
...
test_fx failed! Received signal: SIGSEGV
it must NOT match the RegExp. The original, correct output is:
FAILED (errors=2, skipped=2)
distributed/rpc/test_tensorpipe_agent failed!
So the RegExp must be made to only add the counting of "FAILED" if the next line is the failed test, see https://gist.github.com/Flamefire/dc1403ccefdebfc3412c6fbb2d5cbabd#file-pytorch-1-9-0-foss-2020b_partial-log-L480
I would actually add this wrong output as a test against regressions. With that snipped it should only find test_fx
as failed (with an unknown number of tests)
error_cnt += get_count_for_pattern(r"([0-9]+) error", failure_summary) | ||
failed_test_suites.append(test_suite) | ||
failed_tests_info = extract_failed_tests_info(tests_out) | ||
failure_report = failed_tests_info.failure_report |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: To avoid long distracting lines like sorted(set(failed_tests_info.failed_test_suites))
(2 times "failed") maybe either unwrap the named tuple here: failure_report, failed_test_suites = failed_tests_info.failure_report, failed_tests_info.failed_test_suites
etc. or rename to test_info
as test_info.failed_test_suites
is easier to read.
And maybe the test_cnt
extraction can be done in there too which would fit the name even better (then drop the "failed" from the function name).
@boegel @Flamefire Status on this one? |
This conflicts with #3085 after that got merged possibly making all changes to the EasyBlock file unnecessary. Not fully sure though if there were any changes to the patterns that are not included yet. I also have #3255 which further improves the function by pulling the last bits of the parsing out of the easyblock. |
Test reports in easybuilders/easybuild-easyconfigs#16286 and easybuilders/easybuild-easyconfigs#16385 showed that installation was failing because no details were found on
distributed/test_c10d_gloo
andtest_jit_cuda_fuser
That's fixed in this PR, and I've also fleshed out the code that extracts the details on the failing test suites/groups into a dedicated function, so we can test it in CI
@Flamefire Please take a look?