Tests are extremely flaky #4433

AndrewJGaut · 2023-03-31T06:31:00Z

The CI tests are very flaky. Often, tests will fail and then pass when they are re-run.

AndrewJGaut · 2023-04-27T01:10:01Z

After looking through some CI failures that occurred for commits to master (i.e. after the CI tests had already passed for that PR and still failed), I've identified the following tests that failed (and therefore are flaky):

Kill
o Examples: here
o Status: Unsure about problem (see below comment)
run
o Examples: shared fs and Azure blob
o Status: Same as kill (see below comment)
Mimic
o Examples: here and here
o Status: PR ready for review.
Make
o Examples: here
o Status: Found the problem. Want to chat about the solution.
Resources
o Examples: here and here
o Status: Unsure about problem.
General
o Setup Tests: Sometimes we get setup tests failures. Not sure what causes this... e.g. here

Will now start looking through each of these in a PR...

AndrewJGaut · 2023-04-29T22:55:54Z

Kill/run

First, note that the run tests that fail are due to failures in kill; see the examples above to see why.

This one perplexes me.

So, in the worker logs for the sharedFS run that failed here, we see the following:

2023-03-22 19:09:32,441 Bundle 0x468e9f41ecee4108895cec7001e9f01d is transitioning from RUN_STAGE.RUNNING to RUN_STAGE.CLEANING_UP. Reason: the bundle was killed /opt/codalab/worker/worker_run_state.py 142
2023-03-22 19:09:34,345 Bundle 0x468e9f41ecee4108895cec7001e9f01d is transitioning from RUN_STAGE.CLEANING_UP to RUN_STAGE.FINALIZING. Reason: Bundle is killed. uuid: 0x468e9f41ecee4108895cec7001e9f01d. failure message: Kill requested: User time quota exceeded. To apply for more quota, please visit the following link: https://codalab-worksheets.readthedocs.io/en/latest/FAQ/#how-do-i-request-more-disk-quota-or-time-quota /opt/codalab/worker/worker_run_state.py 142
2023-03-22 19:09:39,384 Got websocket message, got data: fv-az564-273, going to check in now. /opt/codalab/worker/worker.py 316
2023-03-22 19:09:39,689 Connected! Successful check in! /opt/codalab/worker/worker.py 505
2023-03-22 19:09:39,689 Received kill message: {'type': 'kill', 'uuid': '0x468e9f41ecee4108895cec7001e9f01d', 'kill_message': 'Kill requested: User time quota exceeded. To apply for more quota, please visit the following link: https://codalab-worksheets.readthedocs.io/en/latest/FAQ/#how-do-i-request-more-disk-quota-or-time-quota'} /opt/codalab/worker/worker.py 525
2023-03-22 19:09:39,689 Received mark_finalized message: {'type': 'mark_finalized', 'uuid': '0x468e9f41ecee4108895cec7001e9f01d'} /opt/codalab/worker/worker.py 525

Note first that 0x468e9f41ecee4108895cec7001e9f01d is the UUID of the bundle for which the failure occurs in the linked GHA test.

From these logs, we see this: Bundle 0x468e9f41ecee4108895cec7001e9f01d is transitioning from RUN_STAGE.RUNNING to RUN_STAGE.CLEANING_UP. Reason: the bundle was killed. This corresponds to the code here, which indicates the kill was, indeed, received. Moreover, we see Bundle 0x468e9f41ecee4108895cec7001e9f01d is transitioning from RUN_STAGE.CLEANING_UP to RUN_STAGE.FINALIZING. Reason: Bundle is killed., which again confirms the bundle is killed, AND we see the failure message Kill requested: User time quota exceeded. To apply for more quota, please visit the following link: https://codalab-worksheets.readthedocs.io/en/latest/FAQ/#how-do-i-request-more-disk-quota-or-time-quota. Now, this corresponds to the code here. We see that, directly after this, the run_state has the new failure message added . Later on, the worker checks in and sends those runs to the server (see here).

Now, a bundle's terminal state is set here in the rest server. For the terminal state to be READY, it would have to be the case that the failure_message and exit_code are none. Therefore, that bundle did not have a failure_message when its status was sent up to the rest-server -- even though it had to have had a failure message at some point when it was transitioning to finalizing since that was logged. (Note: before doing this, the failure_message and exit_code are added to the bundle metadata in the transition_bundle_finalizing function here).

So, what is happening here? Currently, I'm not sure. I have logged worker_run.as_dict in the PR for the fix to the tests to see if we can pick up worker_run dict for the next kill failure so I can see what it says, but I'm genuinely unsure as to what's happening here. We could also try setting exclude_final_states to be True when calling wait_until_state and hope that it reaches the KILLED state eventually, but that makes me uncomfortable

AndrewJGaut · 2023-04-29T23:57:20Z

Make

The test that's failing in the example given is this one here (uploading local files to Azure Blob storage). The error is azure.core.exceptions.ResourceNotFoundError: Operation returned an invalid status 'The specified blob does not exist.' ErrorCode:BlobNotFound, which occurs because of the following error: Not found: azfs://devstoreaccount1/bundles/0x5982329733ef43c8aadfa08b21a3a29d/contents.gz. Interestingly, that's actually the UUID of the make bundle (in the bundle manager logs, we see this line: 2023-04-05 22:28:00,662 Staging 0x5982329733ef43c8aadfa08b21a3a29d 2023-04-05 22:28:00,677 Making bundle 0x5982329733ef43c8aadfa08b21a3a29d 2023-04-05 22:28:01,201 Failing bundle 0x5982329733ef43c8aadfa08b21a3a29d: [Errno 2] Not found: azfs://devstoreaccount1/bundles/0x5982329733ef43c8aadfa08b21a3a29d/contents.gz).

My hypothesis is this: for the 'make' command in the bundle CLI, we see that the make bundle is first created -- meaning that the rest server will add it to the database as a MAKE bundle -- and then bundle location for it is added slightly later. My thinking right now is that there's a race condition wherein the bundle-manager starts making the bundle in the interval between when it's added to the database and when the bundle location is added.

For the fix, maybe Jiani and Ashwin can help out since they worked on this portion of the CLI.

AndrewJGaut · 2023-04-30T19:51:11Z

Resources

The two linked CI failures here and here are actually different failures within resources, which is interesting.

For the first failure, it looks like the bundle with the command python -c "import urllib.request; urllib.request.urlopen('https://www.google.com/').read()" failed, but there is no failure message (failure message was logged as None in the worker logs) even though it shows Exit Code 1. Thus, this one will be hard to debug unless I can replicate it locally, which I haven't yet been able to do.

Not sure what happened with the second one; the failure message is Error. Memory limit 10m exceeded. and it occurs when we run perl main.pl 1 1 5, where main.pl is the file stress_test.pl file (which is uploaded here in the cli test). I'm not sure why it fails for Kubernetes in this case; the arsg 1 1 5 indicate that only 1 MB of RAM should be used, yet it apparently goes over the memory limit. Perhaps @epicfaace has an idea here? This could be related to Kubernetes.

epicfaace · 2023-05-01T03:53:54Z

Maybe for the first one we should try not connecting to Google / a live site?

epicfaace · 2023-05-01T03:54:23Z

the arsg 1 1 5 indicate that only 1 MB of RAM should be used, yet it apparently goes over the memory limit. Perhaps @epicfaace has an idea here? This could be related to Kubernetes.

Maybe it's not allocating 10 MB to the kubernetes pod...

AndrewJGaut · 2023-05-01T19:44:58Z

Maybe for the first one we should try not connecting to Google / a live site?

I think it's intentionally connecting to a live site because it's testing network access? The lines of code for that test are here:

    # Test network access
    REQUEST_CMD = """python -c "import urllib.request; urllib.request.urlopen('https://www.google.com').read()" """
    # Network access is set to true by default
    wait(_run_command([cl, 'run', REQUEST_CMD], request_memory="10m"), 0)
    # --request-network should behave the same as above
    wait(_run_command([cl, 'run', '--request-network', REQUEST_CMD], request_memory="10m"), 0)

Maybe it's not allocating 10 MB to the kubernetes pod...

Yeah, that could be...

epicfaace · 2023-05-01T19:50:51Z

I think it's intentionally connecting to a live site because it's testing network access?

Yeah but is it possible that Google servers are doing something weird? Maybe try connecting to another server like https://en.wikipedia.org/?

Related to #4433 -- let's see if this fixes flakiness.

epicfaace · 2023-05-01T19:52:03Z

Testing in #4457

* Test - change google to wikipedia for network request Related to #4433 -- let's see if this fixes flakiness. * Update test_cli.py --------- Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

AndrewJGaut added the p1 Do it in the next two weeks. label Mar 31, 2023

AndrewJGaut self-assigned this Mar 31, 2023

This was referenced Apr 27, 2023

Fix Flaky CI - Mimic #4454

Merged

Fix Flaky CI - Kill and Run #4455

Draft

AndrewJGaut mentioned this issue Apr 30, 2023

Fix Flaky CI - Make #4456

Closed

epicfaace added a commit that referenced this issue May 1, 2023

Test - change google to wikipedia for network request

d4809a5

Related to #4433 -- let's see if this fixes flakiness.

epicfaace mentioned this issue May 1, 2023

Change google to msftconnecttest for network request #4457

Merged

epicfaace closed this as completed May 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tests are extremely flaky #4433

Tests are extremely flaky #4433

AndrewJGaut commented Mar 31, 2023

AndrewJGaut commented Apr 27, 2023 •

edited

Loading

AndrewJGaut commented Apr 29, 2023 •

edited

Loading

AndrewJGaut commented Apr 29, 2023 •

edited

Loading

AndrewJGaut commented Apr 30, 2023 •

edited

Loading

epicfaace commented May 1, 2023

epicfaace commented May 1, 2023

AndrewJGaut commented May 1, 2023 •

edited

Loading

epicfaace commented May 1, 2023

epicfaace commented May 1, 2023

Tests are extremely flaky #4433

Tests are extremely flaky #4433

Comments

AndrewJGaut commented Mar 31, 2023

AndrewJGaut commented Apr 27, 2023 • edited Loading

AndrewJGaut commented Apr 29, 2023 • edited Loading

AndrewJGaut commented Apr 29, 2023 • edited Loading

AndrewJGaut commented Apr 30, 2023 • edited Loading

epicfaace commented May 1, 2023

epicfaace commented May 1, 2023

AndrewJGaut commented May 1, 2023 • edited Loading

epicfaace commented May 1, 2023

epicfaace commented May 1, 2023

AndrewJGaut commented Apr 27, 2023 •

edited

Loading

AndrewJGaut commented Apr 29, 2023 •

edited

Loading

AndrewJGaut commented Apr 29, 2023 •

edited

Loading

AndrewJGaut commented Apr 30, 2023 •

edited

Loading

AndrewJGaut commented May 1, 2023 •

edited

Loading