Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ray] Modin on ray causes ray.tune to hang #3479

Closed
jmakov opened this issue Sep 24, 2021 · 5 comments
Closed

[ray] Modin on ray causes ray.tune to hang #3479

jmakov opened this issue Sep 24, 2021 · 5 comments
Labels
bug 🦗 Something isn't working

Comments

@jmakov
Copy link

jmakov commented Sep 24, 2021

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 21.04
  • Modin version (modin.__version__): 0.10.2
  • Python version: 3.7

Describe the problem

tune.run() starts work on a local cluster. After a couple of minutes less and less CPUs are used. After no CPU is utilized, tune.run() still hasn't finished. The expected behavior is that after tune.run() all cluster resources are utilized until tune.run() finishes.

As discussed with ray devs, it seems to be a modin issue :): ray-project/ray#18808. @Yard1: "Each Ray trial takes up 1 CPU resource. Modin operations inside those trials also take up 1 resource each. Because all resources are taken up by trials, the modin operations cannot progress as they are waiting for resources to become free - which will never happen because the trials are waiting on the modin operations to finish. Classic deadlock And this is also why limiting concurrency works, as it allows some CPUs to be free and thus usable by modin"

Additional info:
ray monitor cluster.yaml shows that all CPUs are in use.

Source code / logs

import modin.pandas as pd 
import ray
from ray import tune
from ray.tune.suggest.basic_variant import BasicVariantGenerator


ray.init(address='auto', _redis_password='xxx')


def easy_objective(config, data):
    data_df = data[0]

    # Here be dragons. If either of the below lines are included, Tune hangs.
    score = int(pd.DataFrame(pd.Series(df.test), columns=["test"]).explode(["test"]).test.sum()) 
    # pd.DataFrame(pd.Series(df.test), columns=["test"]).explode(["test"])
    # pd.DataFrame(pd.Series(df.test), columns=["test"]).sum()  

    tune.report(score=score)


tune.run(
    tune.with_parameters(easy_objective, data=[df.index.values, df.bid.values, df.ask.values, df.decimals_price[0]]),
    name="test_study",
    time_budget_s=3600*24*3,
    num_samples=-1,
    verbose=3,
    fail_fast=True,
    config={
            "steps": 100,
            "width": tune.uniform(0, 20),
            "height": tune.uniform(-100, 100),
            "activation": tune.grid_search(["relu", "tanh"])
        },
    metric="score", 
    mode="max",
# but works with this enabled
#    search_alg=BasicVariantGenerator(max_concurrent=CLUSTER_AVAILABLE_LOGICAL_CPUS - 1),  #N.B. "-1", else hangs
)
@jmakov
Copy link
Author

jmakov commented Sep 24, 2021

Upon discussing the issue further with @Yard1, what also works is resources_per_trial={"cpu":0,"extra_cpu":1}.

@devin-petersohn
Copy link
Collaborator

Hi @jmakov, thanks for posting! This is a tricky one. Our Ray workers each occupy 1 CPU for Ray to schedule properly, otherwise Ray's scheduler will inefficiently place Modin tasks and potentially oversubscribe the system. Happy to discuss alternative ways of efficiently sharing resources between other Ray libraries with the Ray team, @Yard1 @richardliaw @simon-mo. Do you folks think we need to designate a custom resource for Modin, and how can the scheduler make sure there are enough resources for both Modin and Tune?

As discussed with ray devs, it seems to be a modin issue :):

I love my Ray developer friends, they are always so happy to blame me 😄. Jokes aside, I'm not sure it's as simple as one library or another's fault: as I mention above it's something we have to coordinate with them.

@Yard1
Copy link

Yard1 commented Sep 25, 2021

In the short term, we should strive to improve documentation to ensure that users are aware that libraries like Modin use the same resource pool as Tune. I'll be getting out a PR to improve that in Tune on Monday - would be great if a similar mention could be put into Modin's docs!

@devin-petersohn
Copy link
Collaborator

I'll be getting out a PR to improve that in Tune on Monday - would be great if a similar mention could be put into Modin's docs!

Great, let me know! We can add a something to the Modin docs as well.

I think it would also be good to have documentation on how external libraries can interoperate with Ray libraries. Otherwise we are just creating libraries in a vacuum and they won't be generally useful to the Ray community. Does a document like this exist, and/or what are the best practices here?

@jmakov
Copy link
Author

jmakov commented Oct 1, 2021

I'll close the issue since we got now better docs in ray.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants