[ray] Modin on ray causes ray.tune to hang #3479

jmakov · 2021-09-24T19:24:25Z

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 21.04
Modin version (modin.__version__): 0.10.2
Python version: 3.7

Describe the problem

tune.run() starts work on a local cluster. After a couple of minutes less and less CPUs are used. After no CPU is utilized, tune.run() still hasn't finished. The expected behavior is that after tune.run() all cluster resources are utilized until tune.run() finishes.

As discussed with ray devs, it seems to be a modin issue :): ray-project/ray#18808. @Yard1: "Each Ray trial takes up 1 CPU resource. Modin operations inside those trials also take up 1 resource each. Because all resources are taken up by trials, the modin operations cannot progress as they are waiting for resources to become free - which will never happen because the trials are waiting on the modin operations to finish. Classic deadlock And this is also why limiting concurrency works, as it allows some CPUs to be free and thus usable by modin"

Additional info:
ray monitor cluster.yaml shows that all CPUs are in use.

Source code / logs

import modin.pandas as pd 
import ray
from ray import tune
from ray.tune.suggest.basic_variant import BasicVariantGenerator


ray.init(address='auto', _redis_password='xxx')


def easy_objective(config, data):
    data_df = data[0]

    # Here be dragons. If either of the below lines are included, Tune hangs.
    score = int(pd.DataFrame(pd.Series(df.test), columns=["test"]).explode(["test"]).test.sum()) 
    # pd.DataFrame(pd.Series(df.test), columns=["test"]).explode(["test"])
    # pd.DataFrame(pd.Series(df.test), columns=["test"]).sum()  

    tune.report(score=score)


tune.run(
    tune.with_parameters(easy_objective, data=[df.index.values, df.bid.values, df.ask.values, df.decimals_price[0]]),
    name="test_study",
    time_budget_s=3600*24*3,
    num_samples=-1,
    verbose=3,
    fail_fast=True,
    config={
            "steps": 100,
            "width": tune.uniform(0, 20),
            "height": tune.uniform(-100, 100),
            "activation": tune.grid_search(["relu", "tanh"])
        },
    metric="score", 
    mode="max",
# but works with this enabled
#    search_alg=BasicVariantGenerator(max_concurrent=CLUSTER_AVAILABLE_LOGICAL_CPUS - 1),  #N.B. "-1", else hangs
)

The text was updated successfully, but these errors were encountered:

jmakov · 2021-09-24T20:24:54Z

Upon discussing the issue further with @Yard1, what also works is resources_per_trial={"cpu":0,"extra_cpu":1}.

devin-petersohn · 2021-09-24T21:40:27Z

Hi @jmakov, thanks for posting! This is a tricky one. Our Ray workers each occupy 1 CPU for Ray to schedule properly, otherwise Ray's scheduler will inefficiently place Modin tasks and potentially oversubscribe the system. Happy to discuss alternative ways of efficiently sharing resources between other Ray libraries with the Ray team, @Yard1 @richardliaw @simon-mo. Do you folks think we need to designate a custom resource for Modin, and how can the scheduler make sure there are enough resources for both Modin and Tune?

As discussed with ray devs, it seems to be a modin issue :):

I love my Ray developer friends, they are always so happy to blame me 😄. Jokes aside, I'm not sure it's as simple as one library or another's fault: as I mention above it's something we have to coordinate with them.

Yard1 · 2021-09-25T09:59:12Z

In the short term, we should strive to improve documentation to ensure that users are aware that libraries like Modin use the same resource pool as Tune. I'll be getting out a PR to improve that in Tune on Monday - would be great if a similar mention could be put into Modin's docs!

devin-petersohn · 2021-09-27T13:16:12Z

I'll be getting out a PR to improve that in Tune on Monday - would be great if a similar mention could be put into Modin's docs!

Great, let me know! We can add a something to the Modin docs as well.

I think it would also be good to have documentation on how external libraries can interoperate with Ray libraries. Otherwise we are just creating libraries in a vacuum and they won't be generally useful to the Ray community. Does a document like this exist, and/or what are the best practices here?

jmakov · 2021-10-01T19:48:44Z

I'll close the issue since we got now better docs in ray.

jmakov added the bug 🦗 Something isn't working label Sep 24, 2021

jmakov mentioned this issue Sep 24, 2021

[tune] tune.run() hangs on local cluster when using a functional trainable with reuse_actors=True ray-project/ray#18808

Closed

2 tasks

Yard1 mentioned this issue Sep 28, 2021

[docs] Provide information about resource deadlocks, early stopping in Tune docs ray-project/ray#18947

Merged

6 tasks

jmakov closed this as completed Oct 1, 2021

ahallermed mentioned this issue Nov 21, 2022

[Tune] Deadlock on local cluster since ray 2.* ray-project/ray#30524

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ray] Modin on ray causes ray.tune to hang #3479

[ray] Modin on ray causes ray.tune to hang #3479

jmakov commented Sep 24, 2021

jmakov commented Sep 24, 2021

devin-petersohn commented Sep 24, 2021

Yard1 commented Sep 25, 2021

devin-petersohn commented Sep 27, 2021

jmakov commented Oct 1, 2021

[ray] Modin on ray causes ray.tune to hang #3479

[ray] Modin on ray causes ray.tune to hang #3479

Comments

jmakov commented Sep 24, 2021

System information

Describe the problem

Source code / logs

jmakov commented Sep 24, 2021

devin-petersohn commented Sep 24, 2021

Yard1 commented Sep 25, 2021

devin-petersohn commented Sep 27, 2021

jmakov commented Oct 1, 2021