[Bug] ray monitor reports negative usage #19237

jmakov · 2021-10-08T23:34:22Z

Search before asking

I searched the issues and found no similar issues.

Ray Component

Ray Core, Monitoring & Debugging

What happened + What you expected to happen

Started a script, left it run for a couple of hours, interrupted with ctrl+c, scripts exits, ray monitor shows negative usage:

Usage:
 -2.0/52.0 CPU (0.0 used of 2.0 reserved in placement groups)
 0.0/2.0 GPU
 0.0/1.0 accelerator_type:G
 0.0/1.0 accelerator_type:GT
 0.00/103.633 GiB memory
 0.00/48.406 GiB object_store_memory

Demands:
 (no resource demands)

Reproduction script

search_optuna = optuna.OptunaSearch()
    tune.run(
        tune.with_parameters(trainable, data=[df.index.values, df.bid.values, df.ask.values, df.decimals_price[0]]),
        name="DEMA_single_OM_process",
        search_alg=search_optuna,
        time_budget_s=3600*24*30,
        # resume=True,
        num_samples=-1,
        verbose=3,  
        config=conf,
        metric="score",
        mode="max",
        fail_fast=True,
        reuse_actors=True
    )

Anything else

ray 1.7.0

Are you willing to submit a PR?

Yes I am willing to submit a PR!

rkooo567 · 2021-10-08T23:48:17Z

We made a fix for this issue in the master. Is it possible for you to verify that?

rkooo567 · 2021-10-08T23:50:07Z

Related pr #19138

rkooo567 · 2021-10-08T23:50:19Z

We can include the fix to 1.7.1

rkooo567 · 2021-10-08T23:50:36Z

Cc @sven1977

jmakov · 2021-10-09T00:02:33Z

We made a fix for this issue in the master. Is it possible for you to verify that?

Do I have to install nightly (ray 2.0) or is there a simpler procedure?

rkooo567 · 2021-10-09T00:03:12Z

Testing it with nightly wheel should be sufficient!

jmakov · 2021-10-09T00:28:10Z

Just installed on all nodes with pip install -U 'ray[default]'@https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl. After running and interrupting my script, ray monitor cluster.yaml still shows -1. If I repeat that again (after a few sec), ray monitor still shows -1. If again (script interrupted after a min), still -1.

Other observations:

previously on interrupting the script there weren't any exceptions, now plenty
dashboard doesn't work (same behavior as reported in [Bug] Dashboard doesn't work in ray 1.7.0 for local cluster #19209)
same warning as reported in [Bug] [tune] Tune hangs (Tune event loop has been backlogged processing new results) #18903: WARNING ray_trial_executor.py:772 -- Over the last 60 seconds, the Tune event loop has been backlogged processing new results. Consider increasing your period of result reporting to improve performance.

rkooo567 · 2021-10-09T00:51:11Z

Hmm interesting. I will investigate this next week. Btw can you provide the self-contained script that I can just copy & paste and run?

jmakov · 2021-10-09T01:28:19Z

Ran cluster down and up again (with --no-config-cache, as before ray nightly instlation) and cannot reproduce anymore. Will report if I can reproduce with a min script.

Observations:

now ray dashboard cluster.yaml shows not a blank page anymore as reported in the comment above and in [Bug] Dashboard doesn't work in ray 1.7.0 for local cluster #19209, but says "The connection was reset"
it's hard to debug when the cluster/ray (state) changes randomly (would be great if I could just post logs)

rkooo567 · 2021-10-09T01:45:05Z

Hmm there might be some flaky conditions. I will make one more pr that can minimize the race conditions of pg removal, but lmk if you find the repro! I will prioritize to fix them!

jmakov · 2021-10-25T14:08:53Z

In ray 1.7.1 ray monitor shows -61/44. And I cannot run any script. What helps is to restart the cluster after (almost) every run. This script randomly causes negative values in ray monitor usage:

import os
import random
import time
os.environ["TUNE_DISABLE_AUTO_CALLBACK_LOGGERS"] = "1"  # https://github.com/ray-project/ray/issues/18903
os.environ["TUNE_DISABLE_AUTO_CALLBACK_SYNCER"] = "1"  # https://github.com/ray-project/ray/issues/18903
os.environ["TUNE_RESULT_BUFFER_LENGTH"] = "0"

import numpy as np
import ray
from ray import tune
from ray.tune.suggest import optuna
from ray.tune.suggest import basic_variant

def evaluation_fn():
    time.sleep(1)
    return random.randint(1, 10_000)


def easy_objective(config, data):
    intermediate_score = evaluation_fn()
    tune.report(mean_loss=intermediate_score)

if __name__ == "__main__":
    ray.init(address='auto', _redis_password='5241590000000000')
    df = np.zeros(10_000_000)
    search_optuna = optuna.OptunaSearch()
    search_basic = basic_variant.BasicVariantGenerator()

    analysis = tune.run(
        tune.with_parameters(easy_objective, data=df),
        name="test",
        metric="mean_loss",
        mode="max",
        search_alg=search_basic,
        num_samples=-1,
        config={
            # A random function
            "alpha": tune.sample_from(lambda _: np.random.uniform(100)),
            # Use the `spec.config` namespace to access other hyperparameters
            "beta": tune.sample_from(lambda spec: spec.config.alpha * np.random.normal())
        },
        reuse_actors=True,
        fail_fast=True,
        verbose=2
    )

rkooo567 · 2021-10-25T14:15:20Z

Ah, the was probably not included in 1.7.1... :(. Maybe we should release 1.7.2 with this fix coming in.

Just to make sure, you haven't seen this issue in the master right?

jmakov · 2021-10-25T15:17:07Z

@rkooo567 didn't test nightly. Is the master branch stable (can I always use nightly instead of waiting for releases)?

rkooo567 · 2021-10-25T22:01:59Z

We definitely recommend you to use 1.7.1. I was asking the question because the fix was merged in the master, and we forgot to cherry-pick the commit on 1.7.1, but I wanted to make sure if the fix commit actually fixes the issue.

rkooo567 · 2021-10-29T14:19:05Z

Btw, we will have a release for 1.8 very soon. We can revisit this if it still happens on 1.8

jmakov · 2021-10-29T14:53:09Z

Currently didn't yet happened on ray 1.7.1. Will reopen when this is again the case. Thanks for quick responses!

jmakov added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 8, 2021

rkooo567 added the needs-repro-script Issue needs a runnable script to be reproduced label Oct 11, 2021

jmakov closed this as completed Oct 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] ray monitor reports negative usage #19237

[Bug] ray monitor reports negative usage #19237

jmakov commented Oct 8, 2021

rkooo567 commented Oct 8, 2021

rkooo567 commented Oct 8, 2021

rkooo567 commented Oct 8, 2021

rkooo567 commented Oct 8, 2021

jmakov commented Oct 9, 2021

rkooo567 commented Oct 9, 2021

jmakov commented Oct 9, 2021 •

edited

Loading

rkooo567 commented Oct 9, 2021

jmakov commented Oct 9, 2021 •

edited

Loading

rkooo567 commented Oct 9, 2021

jmakov commented Oct 25, 2021

rkooo567 commented Oct 25, 2021

jmakov commented Oct 25, 2021

rkooo567 commented Oct 25, 2021

rkooo567 commented Oct 29, 2021

jmakov commented Oct 29, 2021 •

edited

Loading

[Bug] ray monitor reports negative usage #19237

[Bug] ray monitor reports negative usage #19237

Comments

jmakov commented Oct 8, 2021

Search before asking

Ray Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

rkooo567 commented Oct 8, 2021

rkooo567 commented Oct 8, 2021

rkooo567 commented Oct 8, 2021

rkooo567 commented Oct 8, 2021

jmakov commented Oct 9, 2021

rkooo567 commented Oct 9, 2021

jmakov commented Oct 9, 2021 • edited Loading

rkooo567 commented Oct 9, 2021

jmakov commented Oct 9, 2021 • edited Loading

rkooo567 commented Oct 9, 2021

jmakov commented Oct 25, 2021

rkooo567 commented Oct 25, 2021

jmakov commented Oct 25, 2021

rkooo567 commented Oct 25, 2021

rkooo567 commented Oct 29, 2021

jmakov commented Oct 29, 2021 • edited Loading

jmakov commented Oct 9, 2021 •

edited

Loading

jmakov commented Oct 9, 2021 •

edited

Loading

jmakov commented Oct 29, 2021 •

edited

Loading