Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air] Fix test_tune_torch_get_device_gpu race condition #35004

Merged
merged 2 commits into from
May 3, 2023

Conversation

krfricke
Copy link
Contributor

@krfricke krfricke commented May 3, 2023

Why are these changes needed?

The test_tune_torch_get_device_gpu test is flaky. Recently, the flakiness has been increased after switching to the new execution backend (presumably because of speedups in experiment start).

Due to the way the test is constructed, it keeps a Ray cluster alive. This then leads later tests in the same test suite to fail, as they try to re-initialize a Ray cluster:

Screenshot 2023-05-03 at 2 38 53 PM

This PR fixes the underlying cause of the race condition and implements a mitigation.

Background

There is a race condition in test_tune_torch_get_device_gpu. The test starts three train runs in parallel - i.e. it utilizes multi-tenancy (which is not officially supported/endorsed).

When the timing is right (or, I guess, wrong), the runs can get the same experiment directory. The experiment directory is just a name with a date suffix (whole second granularity), so the likelihood of conflicts is relatively high.

Then, when the timing is right again, the runs may try to save a experiment checkpoint at the same time. The experiment checkpointing works like this:

https://github.com/ray-project/ray/blob/master/python/ray/tune/execution/trial_runner.py#L369-L375

        with open(tmp_file_name, "w") as f:
            json.dump(runner_state, f, indent=2, cls=TuneFunctionEncoder)

        os.replace(
            tmp_file_name,
            os.path.join(experiment_dir, self.experiment_state_file_name),
        )

if this happens at the same time, one run will call os.replace, and the next run will call it right after. But since the first run already removed the same file, we run into an error:

FileNotFoundError: [Errno 2] No such file or directory: '/root/.cache/bazel/_bazel_root/5fe90af4e7d1ed9fcf52f59e39e126f5/execroot/com_github_ray_project_ray/_tmp/220cfcce86786bd11d4235639cbf0f53/TorchTrainer_2023-05-03_09-06-33/.tmp_experiment_state' -> '/root/.cache/bazel/_bazel_root/5fe90af4e7d1ed9fcf52f59e39e126f5/execroot/com_github_ray_project_ray/_tmp/220cfcce86786bd11d4235639cbf0f53/TorchTrainer_2023-05-03_09-06-33/experiment_state-2023-05-03_09-06-33.json'

Further, the test uses a context manager to start and stop the cluster. But because an error is raised, the context is not fully exited and the Ray cluster remains part-alive - a ray status yields

    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused"
	debug_error_string = "UNKNOWN:Failed to pick subchannel {created_time:"2023-05-03T14:35:01.329305+01:00", children:[UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused {created_time:"2023-05-03T14:35:01.329302+01:00", grpc_status:14}]}"
>
Ray cluster is not found at 127.0.0.1:63032

I believe this is an error and I will raise a separate issue for this. However, this is the reason why the failed state carries over to subsequent tests.

Mitigation

We mitigate the described issues in three ways:

  1. We use a unique experiment name for each parallel run. They won't conflict anymore and the runs should succeed. Please note that we may want to detect if two parallel runs are using the same experiment directory, as this will always lead to conflicts.
  2. We use a unique temporary filename for writing the experiment checkpoint state.
  3. We capture errors in the test and raise them after the context manager has been exited and the Ray cluster has been shut down.

I will open follow-up issues where necessary, but the changes in the PR should make the current master more stable.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Kai Fricke <[email protected]>
Copy link
Member

@gjoliver gjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix.
Just a random suggestion.
Feel free to merge!

run_config=RunConfig(
# Use a unique name to avoid using the same
# experiment directory
name=f"test_tune_torch_get_device_gpu_{uuid.uuid4()}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe simply os.getpid() will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave it with the uuid for now, it's just a test anyway :-)

Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nice explanation!

I think we should do this if the user doesn't provide a name. Make sure each exp dir is unique:

Please note that we may want to detect if two parallel runs are using the same experiment directory, as this will always lead to conflicts.

Because I think this can happen even without multi-tenancy. What if a user has 2 separate clusters writing to the same cloud storage, and they automate running on each cluster to happen at the same time? (a few milliseconds apart)

@krfricke krfricke merged commit dc6fd82 into ray-project:master May 3, 2023
@krfricke krfricke deleted the air/fix-torch-trainer-test branch May 3, 2023 22:14
@krfricke
Copy link
Contributor Author

krfricke commented May 3, 2023

Super nice explanation!

I think we should do this if the user doesn't provide a name. Make sure each exp dir is unique:

Please note that we may want to detect if two parallel runs are using the same experiment directory, as this will always lead to conflicts.

Because I think this can happen even without multi-tenancy. What if a user has 2 separate clusters writing to the same cloud storage, and they automate running on each cluster to happen at the same time? (a few milliseconds apart)

Good point! And yes, agree - though it will be even harder to detect that for cloud storage (time-based cloud lock?). Let's put this on our P2 backlog though.

architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request May 16, 2023
…t#35004)

The `test_tune_torch_get_device_gpu` test is flaky. Recently, the flakiness has been increased after switching to the new execution backend (presumably because of speedups in experiment start).

Due to the way the test is constructed, it keeps a Ray cluster alive. This then leads later tests in the same test suite to fail, as they try to re-initialize a Ray cluster.

This PR fixes the underlying cause of the race condition and implements a mitigation.

Signed-off-by: Kai Fricke <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants