Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] investigate why Ray hangs with grpcio==1.48.0 #27299

Closed
scv119 opened this issue Jul 30, 2022 · 12 comments · Fixed by #28623
Closed

[Core] investigate why Ray hangs with grpcio==1.48.0 #27299

scv119 opened this issue Jul 30, 2022 · 12 comments · Fixed by #28623
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks Ray 2.1

Comments

@scv119
Copy link
Contributor

scv119 commented Jul 30, 2022

What happened + What you expected to happen

We need to dive deep down to why ray.init hangs with grpcio==1.48.0.

Bonus point: we need to figure out our strategy moving forward, since grpcio has been caused several issues.
502c3e1
e3051eb
fda3453

Versions / Dependencies

latest ray

Reproduction script

pip install ray
pip install grpcio==1.48.0

python

import ray
ray.init()

This will hang.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@scv119 scv119 added bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks core Issues that should be addressed in Ray Core Ray 2.1 labels Jul 30, 2022
@scv119 scv119 added P0 Issues that should be fixed in short order and removed P1 Issue that should be fixed within a few weeks labels Aug 19, 2022
@cadedaniel cadedaniel assigned cadedaniel and unassigned jjyao Sep 6, 2022
@cadedaniel
Copy link
Member

Repro'd. I see an interesting error message when installing grpcio==1.48.0 on Ubuntu, dumping here to take a look later

WARNING: The candidate selected for download or install is a yanked version: 'grpcio' candidate (version 1.48.0 at https://files.pythonhosted.org/packages/10/82/cfe40a55c28b487fe0d15d9afef68551fe90a7bd4e3b04120321f300e22b/grpcio-1.48.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=656c6f6f7b815bca3054780b8cdfa1e4e37cd36c887a48558d00c2cf85f3169
7 (from https://pypi.org/simple/grpcio/) (requires-python:>=3.6))
Reason for being yanked: Deadlock observed in Apache Beam.

@cadedaniel
Copy link
Member

It looks like grpcio yanked 1.48.0 from PyPi, see grpc/grpc#30446 (comment). If I install grpcio 1.48.1, the hang no longer reproduces.

@scv119 @jjyao I think the hang issue is not due to Ray -- it's because grpc shipped a bug causing hangs in multithreaded libraries. Want to brainstorm ways to harden our processes against this in the future? The bug was originally seen and fixed for apache Beam -- the grpc folks suggested they test against the latest grpcio rc in the future to catch these issues. It's not clear to me how that would help us. Also, it seems that the Ray inconsistent dependencies between runtime envs and system is orthogonal to this, how should I look into that?

cadedaniel added a commit to cadedaniel/ray that referenced this issue Sep 6, 2022
Signed-off-by: Cade Daniel <[email protected]>
@jjyao
Copy link
Collaborator

jjyao commented Sep 7, 2022

Can we revert #27244 then given 1.48.0 is yanked?

Since we don't pin down our dependencies so I don't know how we can avoid this kind of things from happening in the future.

@cadedaniel
Copy link
Member

Yes, I think we can widen the range of grpcio versions to include 1.48.1. Do you know why the original Allow grpcio >= 1.48 PR forbade versions in [1.45, 1.48)?

I'll work on this in parallel with current AIR benchmarking work.

@jjyao
Copy link
Collaborator

jjyao commented Sep 7, 2022

cc @rkooo567 @iycheng do you know why given you approved the original PR #26765. Should we still disallow [1.45, 1.48)?

@scv119
Copy link
Contributor Author

scv119 commented Sep 8, 2022

FWIW, [1.45 - 1.48) suffers from spammy logs issue https://github.com/grpc/grpc/commit/e8ca82b9a4374c8f87c2d190509c39055deda42a

@h-vetinari
Copy link

h-vetinari commented Sep 8, 2022

FWIW, [1.45 - 1.48) suffers from spammy logs issue https://github.com/grpc/grpc/commit/e8ca82b9a4374c8f87c2d190509c39055deda42a

I don't think spammy logs are a good enough reason to forbid a version, much less such a big range of them. On the conda-forge side, we try to be very faithful to the restrictions specified by the project, but non-essential restrictions like this just make that work a lot harder.

For context, we need to rebuild packages for newer host dependencies, and grpc (together with abseil + protobuf) are a lot of work to keep consistent across the ecosystem. We cannot seriously maintain more than 2-3 versions of any package at a time, and pinning ray to grpcio <=1.43 means it will not be co-installable with all the other packages (e.g. tensorflow, etc.) that have already moved on (and been rebuilt for newer protobuf 3.21, abseil 20220623, grpc 1.46/1.47, etc.)

@cadedaniel
Copy link
Member

I think you have a good point @h-vetinari. Let me check to see if there were any other reasons (besides the spammy logs) which caused us to limit versions [1.45, 1.48). If there's nothing else then I'll create a PR for people to comment on.

@mwtian
Copy link
Member

mwtian commented Sep 9, 2022

#22518 has most of the context. The extra logs show up in the otherwise sparse stdout as well, which is distracting to users and they opened issues in the Ray repo. But if the restriction for [1.45, 1.48) is removed now, maybe most of the new Ray installations will not get these versions anyway.

@h-vetinari
Copy link

Cool, thanks! So we'll open up the range to <=1.47 for now. If people have issues with the spammy logs, we can patch the grpc within conda-forge to include that fix.

@zhe-thoughts
Copy link
Collaborator

Hi @jjyao @cadedaniel : could you help provide an update on this one (since it's P0)? Thanks for working on this!

@jjyao
Copy link
Collaborator

jjyao commented Sep 19, 2022

@cadedaniel will work on it on Monday. We already found the root cause and know how to fix.

@cadedaniel cadedaniel linked a pull request Sep 19, 2022 that will close this issue
@scv119 scv119 added P1 Issue that should be fixed within a few weeks and removed P0 Issues that should be fixed in short order labels Sep 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks Ray 2.1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants