-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[air] Fix test_tune_torch_get_device_gpu
race condition
#35004
[air] Fix test_tune_torch_get_device_gpu
race condition
#35004
Conversation
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix.
Just a random suggestion.
Feel free to merge!
run_config=RunConfig( | ||
# Use a unique name to avoid using the same | ||
# experiment directory | ||
name=f"test_tune_torch_get_device_gpu_{uuid.uuid4()}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe simply os.getpid()
will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave it with the uuid for now, it's just a test anyway :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super nice explanation!
I think we should do this if the user doesn't provide a name
. Make sure each exp dir is unique:
Please note that we may want to detect if two parallel runs are using the same experiment directory, as this will always lead to conflicts.
Because I think this can happen even without multi-tenancy. What if a user has 2 separate clusters writing to the same cloud storage, and they automate running on each cluster to happen at the same time? (a few milliseconds apart)
Good point! And yes, agree - though it will be even harder to detect that for cloud storage (time-based cloud lock?). Let's put this on our P2 backlog though. |
…t#35004) The `test_tune_torch_get_device_gpu` test is flaky. Recently, the flakiness has been increased after switching to the new execution backend (presumably because of speedups in experiment start). Due to the way the test is constructed, it keeps a Ray cluster alive. This then leads later tests in the same test suite to fail, as they try to re-initialize a Ray cluster. This PR fixes the underlying cause of the race condition and implements a mitigation. Signed-off-by: Kai Fricke <[email protected]>
Why are these changes needed?
The
test_tune_torch_get_device_gpu
test is flaky. Recently, the flakiness has been increased after switching to the new execution backend (presumably because of speedups in experiment start).Due to the way the test is constructed, it keeps a Ray cluster alive. This then leads later tests in the same test suite to fail, as they try to re-initialize a Ray cluster:
This PR fixes the underlying cause of the race condition and implements a mitigation.
Background
There is a race condition in
test_tune_torch_get_device_gpu
. The test starts three train runs in parallel - i.e. it utilizes multi-tenancy (which is not officially supported/endorsed).When the timing is right (or, I guess, wrong), the runs can get the same experiment directory. The experiment directory is just a name with a date suffix (whole second granularity), so the likelihood of conflicts is relatively high.
Then, when the timing is right again, the runs may try to save a experiment checkpoint at the same time. The experiment checkpointing works like this:
https://github.com/ray-project/ray/blob/master/python/ray/tune/execution/trial_runner.py#L369-L375
if this happens at the same time, one run will call
os.replace
, and the next run will call it right after. But since the first run already removed the same file, we run into an error:Further, the test uses a context manager to start and stop the cluster. But because an error is raised, the context is not fully exited and the Ray cluster remains part-alive - a
ray status
yieldsI believe this is an error and I will raise a separate issue for this. However, this is the reason why the failed state carries over to subsequent tests.
Mitigation
We mitigate the described issues in three ways:
I will open follow-up issues where necessary, but the changes in the PR should make the current master more stable.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.