Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dashboard] Fix gRPC GCS healthcheck thread #19360

Merged
merged 5 commits into from
Oct 18, 2021

Conversation

simon-mo
Copy link
Contributor

It turns out that grpc doesn't work well when the process is
multi-threaded. Even though we are careful about creating channels
and making them thread local. There are some hidden global context in
gRPC making it hard to have two async channels in different thread.

See grpc/grpc#25364

Closes #19200

Why are these changes needed?

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

It turns out that grpc doesn't work well when the process is
multi-threaded. Even though we are careful about creating channels
and making them thread local. There are some hidden global context in
gRPC making it hard to have two async channels in different thread.

See grpc/grpc#25364

Closes ray-project#19200
@edoakes
Copy link
Contributor

edoakes commented Oct 13, 2021

@simon-mo what is the testing plan for this? Can we add a release test to avoid regressions in the future?

@simon-mo
Copy link
Contributor Author

@edoakes I digged around, there is actually a test that's passing in master that test this functionality. So the code/logic actually works, it just prints these annoy error messages.

def test_gcs_check_alive(fast_gcs_failure_detection, ray_start_with_dashboard):
assert (wait_until_server_available(ray_start_with_dashboard["webui_url"])
is True)
all_processes = ray.worker._global_node.all_processes
dashboard_info = all_processes[ray_constants.PROCESS_TYPE_DASHBOARD][0]
dashboard_proc = psutil.Process(dashboard_info.process.pid)
gcs_server_info = all_processes[ray_constants.PROCESS_TYPE_GCS_SERVER][0]
gcs_server_proc = psutil.Process(gcs_server_info.process.pid)
assert dashboard_proc.status() in [
psutil.STATUS_RUNNING, psutil.STATUS_SLEEPING, psutil.STATUS_DISK_SLEEP
]
gcs_server_proc.kill()
gcs_server_proc.wait()
# The dashboard exits by os._exit(-1)
assert dashboard_proc.wait(10) == 255

I have manually tested that the warning/errors go away and the functionality still works.

@simon-mo simon-mo merged commit a081579 into ray-project:master Oct 18, 2021
@simon-mo simon-mo mentioned this pull request Oct 18, 2021
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] [Dashboard] Dashboard.log spammy error message
2 participants