[air] Fix `test_tune_torch_get_device_gpu` race condition #35004

krfricke · 2023-05-03T13:50:09Z

Why are these changes needed?

The test_tune_torch_get_device_gpu test is flaky. Recently, the flakiness has been increased after switching to the new execution backend (presumably because of speedups in experiment start).

Due to the way the test is constructed, it keeps a Ray cluster alive. This then leads later tests in the same test suite to fail, as they try to re-initialize a Ray cluster:

This PR fixes the underlying cause of the race condition and implements a mitigation.

Background

There is a race condition in test_tune_torch_get_device_gpu. The test starts three train runs in parallel - i.e. it utilizes multi-tenancy (which is not officially supported/endorsed).

When the timing is right (or, I guess, wrong), the runs can get the same experiment directory. The experiment directory is just a name with a date suffix (whole second granularity), so the likelihood of conflicts is relatively high.

Then, when the timing is right again, the runs may try to save a experiment checkpoint at the same time. The experiment checkpointing works like this:

https://github.com/ray-project/ray/blob/master/python/ray/tune/execution/trial_runner.py#L369-L375

        with open(tmp_file_name, "w") as f:
            json.dump(runner_state, f, indent=2, cls=TuneFunctionEncoder)

        os.replace(
            tmp_file_name,
            os.path.join(experiment_dir, self.experiment_state_file_name),
        )

if this happens at the same time, one run will call os.replace, and the next run will call it right after. But since the first run already removed the same file, we run into an error:

FileNotFoundError: [Errno 2] No such file or directory: '/root/.cache/bazel/_bazel_root/5fe90af4e7d1ed9fcf52f59e39e126f5/execroot/com_github_ray_project_ray/_tmp/220cfcce86786bd11d4235639cbf0f53/TorchTrainer_2023-05-03_09-06-33/.tmp_experiment_state' -> '/root/.cache/bazel/_bazel_root/5fe90af4e7d1ed9fcf52f59e39e126f5/execroot/com_github_ray_project_ray/_tmp/220cfcce86786bd11d4235639cbf0f53/TorchTrainer_2023-05-03_09-06-33/experiment_state-2023-05-03_09-06-33.json'

Further, the test uses a context manager to start and stop the cluster. But because an error is raised, the context is not fully exited and the Ray cluster remains part-alive - a ray status yields

    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused"
	debug_error_string = "UNKNOWN:Failed to pick subchannel {created_time:"2023-05-03T14:35:01.329305+01:00", children:[UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused {created_time:"2023-05-03T14:35:01.329302+01:00", grpc_status:14}]}"
>
Ray cluster is not found at 127.0.0.1:63032

I believe this is an error and I will raise a separate issue for this. However, this is the reason why the failed state carries over to subsequent tests.

Mitigation

We mitigate the described issues in three ways:

We use a unique experiment name for each parallel run. They won't conflict anymore and the runs should succeed. Please note that we may want to detect if two parallel runs are using the same experiment directory, as this will always lead to conflicts.
We use a unique temporary filename for writing the experiment checkpoint state.
We capture errors in the test and raise them after the context manager has been exited and the Ray cluster has been shut down.

I will open follow-up issues where necessary, but the changes in the PR should make the current master more stable.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Kai Fricke <[email protected]>

gjoliver

Thanks for the fix.
Just a random suggestion.
Feel free to merge!

gjoliver · 2023-05-03T17:26:54Z

python/ray/train/tests/test_torch_trainer.py

+                    run_config=RunConfig(
+                        # Use a unique name to avoid using the same
+                        # experiment directory
+                        name=f"test_tune_torch_get_device_gpu_{uuid.uuid4()}"


maybe simply os.getpid() will do.

I'll leave it with the uuid for now, it's just a test anyway :-)

justinvyu

Super nice explanation!

I think we should do this if the user doesn't provide a name. Make sure each exp dir is unique:

Please note that we may want to detect if two parallel runs are using the same experiment directory, as this will always lead to conflicts.

Because I think this can happen even without multi-tenancy. What if a user has 2 separate clusters writing to the same cloud storage, and they automate running on each cluster to happen at the same time? (a few milliseconds apart)

krfricke · 2023-05-03T22:16:12Z

Super nice explanation!

I think we should do this if the user doesn't provide a name. Make sure each exp dir is unique:

Please note that we may want to detect if two parallel runs are using the same experiment directory, as this will always lead to conflicts.

Because I think this can happen even without multi-tenancy. What if a user has 2 separate clusters writing to the same cloud storage, and they automate running on each cluster to happen at the same time? (a few milliseconds apart)

Good point! And yes, agree - though it will be even harder to detect that for cloud storage (time-based cloud lock?). Let's put this on our P2 backlog though.

…t#35004) The `test_tune_torch_get_device_gpu` test is flaky. Recently, the flakiness has been increased after switching to the new execution backend (presumably because of speedups in experiment start). Due to the way the test is constructed, it keeps a Ray cluster alive. This then leads later tests in the same test suite to fail, as they try to re-initialize a Ray cluster. This PR fixes the underlying cause of the race condition and implements a mitigation. Signed-off-by: Kai Fricke <[email protected]>

[air] Fix torch-trainer test race condition

b5eef5e

Signed-off-by: Kai Fricke <[email protected]>

krfricke requested review from justinvyu and gjoliver May 3, 2023 13:50

krfricke assigned justinvyu and gjoliver May 3, 2023

krfricke requested a review from amogkam May 3, 2023 13:51

krfricke assigned amogkam May 3, 2023

comment

a693d46

Signed-off-by: Kai Fricke <[email protected]>

This was referenced May 3, 2023

[core] Ray shutdown in a context manager is not fully executed on error #35005

Open

[air/tune][multi-tenancy] Parallel runs can use the same experiment directory #35006

Open

krfricke removed the request for review from gjoliver May 3, 2023 14:04

krfricke unassigned gjoliver May 3, 2023

gjoliver approved these changes May 3, 2023

View reviewed changes

justinvyu approved these changes May 3, 2023

View reviewed changes

krfricke merged commit dc6fd82 into ray-project:master May 3, 2023

krfricke deleted the air/fix-torch-trainer-test branch May 3, 2023 22:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[air] Fix `test_tune_torch_get_device_gpu` race condition #35004

[air] Fix `test_tune_torch_get_device_gpu` race condition #35004

krfricke commented May 3, 2023 •

edited

Loading

gjoliver left a comment

gjoliver May 3, 2023

krfricke May 3, 2023

justinvyu left a comment •

edited

Loading

krfricke commented May 3, 2023

[air] Fix test_tune_torch_get_device_gpu race condition #35004

[air] Fix test_tune_torch_get_device_gpu race condition #35004

Conversation

krfricke commented May 3, 2023 • edited Loading

Why are these changes needed?

Background

Mitigation

Related issue number

Checks

gjoliver left a comment

Choose a reason for hiding this comment

gjoliver May 3, 2023

Choose a reason for hiding this comment

krfricke May 3, 2023

Choose a reason for hiding this comment

justinvyu left a comment • edited Loading

Choose a reason for hiding this comment

krfricke commented May 3, 2023

[air] Fix `test_tune_torch_get_device_gpu` race condition #35004

[air] Fix `test_tune_torch_get_device_gpu` race condition #35004

krfricke commented May 3, 2023 •

edited

Loading

justinvyu left a comment •

edited

Loading