-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ray] Modin on ray causes ray.tune to hang #3479
Comments
Upon discussing the issue further with @Yard1, what also works is |
Hi @jmakov, thanks for posting! This is a tricky one. Our Ray workers each occupy 1 CPU for Ray to schedule properly, otherwise Ray's scheduler will inefficiently place Modin tasks and potentially oversubscribe the system. Happy to discuss alternative ways of efficiently sharing resources between other Ray libraries with the Ray team, @Yard1 @richardliaw @simon-mo. Do you folks think we need to designate a custom resource for Modin, and how can the scheduler make sure there are enough resources for both Modin and Tune?
I love my Ray developer friends, they are always so happy to blame me 😄. Jokes aside, I'm not sure it's as simple as one library or another's fault: as I mention above it's something we have to coordinate with them. |
In the short term, we should strive to improve documentation to ensure that users are aware that libraries like Modin use the same resource pool as Tune. I'll be getting out a PR to improve that in Tune on Monday - would be great if a similar mention could be put into Modin's docs! |
Great, let me know! We can add a something to the Modin docs as well. I think it would also be good to have documentation on how external libraries can interoperate with Ray libraries. Otherwise we are just creating libraries in a vacuum and they won't be generally useful to the Ray community. Does a document like this exist, and/or what are the best practices here? |
I'll close the issue since we got now better docs in ray. |
System information
modin.__version__
): 0.10.2Describe the problem
tune.run() starts work on a local cluster. After a couple of minutes less and less CPUs are used. After no CPU is utilized, tune.run() still hasn't finished. The expected behavior is that after tune.run() all cluster resources are utilized until tune.run() finishes.
As discussed with ray devs, it seems to be a modin issue :): ray-project/ray#18808. @Yard1: "Each Ray trial takes up 1 CPU resource. Modin operations inside those trials also take up 1 resource each. Because all resources are taken up by trials, the modin operations cannot progress as they are waiting for resources to become free - which will never happen because the trials are waiting on the modin operations to finish. Classic deadlock And this is also why limiting concurrency works, as it allows some CPUs to be free and thus usable by modin"
Additional info:
ray monitor cluster.yaml shows that all CPUs are in use.
Source code / logs
The text was updated successfully, but these errors were encountered: