Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests are extremely flaky #4433

Closed
AndrewJGaut opened this issue Mar 31, 2023 · 9 comments
Closed

Tests are extremely flaky #4433

AndrewJGaut opened this issue Mar 31, 2023 · 9 comments
Assignees
Labels
p1 Do it in the next two weeks.

Comments

@AndrewJGaut
Copy link
Contributor

The CI tests are very flaky. Often, tests will fail and then pass when they are re-run.

@AndrewJGaut AndrewJGaut added the p1 Do it in the next two weeks. label Mar 31, 2023
@AndrewJGaut AndrewJGaut self-assigned this Mar 31, 2023
@AndrewJGaut
Copy link
Contributor Author

AndrewJGaut commented Apr 27, 2023

After looking through some CI failures that occurred for commits to master (i.e. after the CI tests had already passed for that PR and still failed), I've identified the following tests that failed (and therefore are flaky):

  • Kill
    o Examples: here
    o Status: Unsure about problem (see below comment)
  • run
    o Examples: shared fs and Azure blob
    o Status: Same as kill (see below comment)
  • Mimic
    o Examples: here and here
    o Status: PR ready for review.
  • Make
    o Examples: here
    o Status: Found the problem. Want to chat about the solution.
  • Resources
    o Examples: here and here
    o Status: Unsure about problem.
  • General
    o Setup Tests: Sometimes we get setup tests failures. Not sure what causes this... e.g. here

Will now start looking through each of these in a PR...

This was referenced Apr 27, 2023
@AndrewJGaut
Copy link
Contributor Author

AndrewJGaut commented Apr 29, 2023

Kill/run

First, note that the run tests that fail are due to failures in kill; see the examples above to see why.

This one perplexes me.

So, in the worker logs for the sharedFS run that failed here, we see the following:

2023-03-22 19:09:32,441 Bundle 0x468e9f41ecee4108895cec7001e9f01d is transitioning from RUN_STAGE.RUNNING to RUN_STAGE.CLEANING_UP. Reason: the bundle was killed /opt/codalab/worker/worker_run_state.py 142
2023-03-22 19:09:34,345 Bundle 0x468e9f41ecee4108895cec7001e9f01d is transitioning from RUN_STAGE.CLEANING_UP to RUN_STAGE.FINALIZING. Reason: Bundle is killed. uuid: 0x468e9f41ecee4108895cec7001e9f01d. failure message: Kill requested: User time quota exceeded. To apply for more quota, please visit the following link: https://codalab-worksheets.readthedocs.io/en/latest/FAQ/#how-do-i-request-more-disk-quota-or-time-quota /opt/codalab/worker/worker_run_state.py 142
2023-03-22 19:09:39,384 Got websocket message, got data: fv-az564-273, going to check in now. /opt/codalab/worker/worker.py 316
2023-03-22 19:09:39,689 Connected! Successful check in! /opt/codalab/worker/worker.py 505
2023-03-22 19:09:39,689 Received kill message: {'type': 'kill', 'uuid': '0x468e9f41ecee4108895cec7001e9f01d', 'kill_message': 'Kill requested: User time quota exceeded. To apply for more quota, please visit the following link: https://codalab-worksheets.readthedocs.io/en/latest/FAQ/#how-do-i-request-more-disk-quota-or-time-quota'} /opt/codalab/worker/worker.py 525
2023-03-22 19:09:39,689 Received mark_finalized message: {'type': 'mark_finalized', 'uuid': '0x468e9f41ecee4108895cec7001e9f01d'} /opt/codalab/worker/worker.py 525

Note first that 0x468e9f41ecee4108895cec7001e9f01d is the UUID of the bundle for which the failure occurs in the linked GHA test.

From these logs, we see this: Bundle 0x468e9f41ecee4108895cec7001e9f01d is transitioning from RUN_STAGE.RUNNING to RUN_STAGE.CLEANING_UP. Reason: the bundle was killed. This corresponds to the code here, which indicates the kill was, indeed, received. Moreover, we see Bundle 0x468e9f41ecee4108895cec7001e9f01d is transitioning from RUN_STAGE.CLEANING_UP to RUN_STAGE.FINALIZING. Reason: Bundle is killed., which again confirms the bundle is killed, AND we see the failure message Kill requested: User time quota exceeded. To apply for more quota, please visit the following link: https://codalab-worksheets.readthedocs.io/en/latest/FAQ/#how-do-i-request-more-disk-quota-or-time-quota. Now, this corresponds to the code here. We see that, directly after this, the run_state has the new failure message added . Later on, the worker checks in and sends those runs to the server (see here).

Now, a bundle's terminal state is set here in the rest server. For the terminal state to be READY, it would have to be the case that the failure_message and exit_code are none. Therefore, that bundle did not have a failure_message when its status was sent up to the rest-server -- even though it had to have had a failure message at some point when it was transitioning to finalizing since that was logged. (Note: before doing this, the failure_message and exit_code are added to the bundle metadata in the transition_bundle_finalizing function here).

So, what is happening here? Currently, I'm not sure. I have logged worker_run.as_dict in the PR for the fix to the tests to see if we can pick up worker_run dict for the next kill failure so I can see what it says, but I'm genuinely unsure as to what's happening here. We could also try setting exclude_final_states to be True when calling wait_until_state and hope that it reaches the KILLED state eventually, but that makes me uncomfortable

@AndrewJGaut
Copy link
Contributor Author

AndrewJGaut commented Apr 29, 2023

Make

The test that's failing in the example given is this one here (uploading local files to Azure Blob storage). The error is azure.core.exceptions.ResourceNotFoundError: Operation returned an invalid status 'The specified blob does not exist.' ErrorCode:BlobNotFound, which occurs because of the following error: Not found: azfs://devstoreaccount1/bundles/0x5982329733ef43c8aadfa08b21a3a29d/contents.gz. Interestingly, that's actually the UUID of the make bundle (in the bundle manager logs, we see this line: 2023-04-05 22:28:00,662 Staging 0x5982329733ef43c8aadfa08b21a3a29d 2023-04-05 22:28:00,677 Making bundle 0x5982329733ef43c8aadfa08b21a3a29d 2023-04-05 22:28:01,201 Failing bundle 0x5982329733ef43c8aadfa08b21a3a29d: [Errno 2] Not found: azfs://devstoreaccount1/bundles/0x5982329733ef43c8aadfa08b21a3a29d/contents.gz).

My hypothesis is this: for the 'make' command in the bundle CLI, we see that the make bundle is first created -- meaning that the rest server will add it to the database as a MAKE bundle -- and then bundle location for it is added slightly later. My thinking right now is that there's a race condition wherein the bundle-manager starts making the bundle in the interval between when it's added to the database and when the bundle location is added.

For the fix, maybe Jiani and Ashwin can help out since they worked on this portion of the CLI.

@AndrewJGaut
Copy link
Contributor Author

AndrewJGaut commented Apr 30, 2023

Resources

The two linked CI failures here and here are actually different failures within resources, which is interesting.

For the first failure, it looks like the bundle with the command python -c "import urllib.request; urllib.request.urlopen('https://www.google.com/').read()" failed, but there is no failure message (failure message was logged as None in the worker logs) even though it shows Exit Code 1. Thus, this one will be hard to debug unless I can replicate it locally, which I haven't yet been able to do.

Not sure what happened with the second one; the failure message is Error. Memory limit 10m exceeded. and it occurs when we run perl main.pl 1 1 5, where main.pl is the file stress_test.pl file (which is uploaded here in the cli test). I'm not sure why it fails for Kubernetes in this case; the arsg 1 1 5 indicate that only 1 MB of RAM should be used, yet it apparently goes over the memory limit. Perhaps @epicfaace has an idea here? This could be related to Kubernetes.

@epicfaace
Copy link
Member

Maybe for the first one we should try not connecting to Google / a live site?

@epicfaace
Copy link
Member

the arsg 1 1 5 indicate that only 1 MB of RAM should be used, yet it apparently goes over the memory limit. Perhaps @epicfaace has an idea here? This could be related to Kubernetes.

Maybe it's not allocating 10 MB to the kubernetes pod...

@AndrewJGaut
Copy link
Contributor Author

AndrewJGaut commented May 1, 2023

Maybe for the first one we should try not connecting to Google / a live site?

I think it's intentionally connecting to a live site because it's testing network access? The lines of code for that test are here:

    # Test network access
    REQUEST_CMD = """python -c "import urllib.request; urllib.request.urlopen('https://www.google.com').read()" """
    # Network access is set to true by default
    wait(_run_command([cl, 'run', REQUEST_CMD], request_memory="10m"), 0)
    # --request-network should behave the same as above
    wait(_run_command([cl, 'run', '--request-network', REQUEST_CMD], request_memory="10m"), 0)

Maybe it's not allocating 10 MB to the kubernetes pod...

Yeah, that could be...

@epicfaace
Copy link
Member

I think it's intentionally connecting to a live site because it's testing network access?

Yeah but is it possible that Google servers are doing something weird? Maybe try connecting to another server like https://en.wikipedia.org/?

epicfaace added a commit that referenced this issue May 1, 2023
Related to #4433 -- let's see if this fixes flakiness.
@epicfaace
Copy link
Member

Testing in #4457

mergify bot added a commit that referenced this issue May 11, 2023
* Test - change google to wikipedia for network request

Related to #4433 -- let's see if this fixes flakiness.

* Update test_cli.py

---------

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p1 Do it in the next two weeks.
Projects
None yet
Development

No branches or pull requests

2 participants