-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] ray monitor reports negative usage #19237
Comments
We made a fix for this issue in the master. Is it possible for you to verify that? |
Related pr #19138 |
We can include the fix to 1.7.1 |
Cc @sven1977 |
Do I have to install nightly (ray 2.0) or is there a simpler procedure? |
Testing it with nightly wheel should be sufficient! |
Just installed on all nodes with Other observations:
|
Hmm interesting. I will investigate this next week. Btw can you provide the self-contained script that I can just copy & paste and run? |
Ran cluster down and up again (with --no-config-cache, as before ray nightly instlation) and cannot reproduce anymore. Will report if I can reproduce with a min script. Observations:
|
Hmm there might be some flaky conditions. I will make one more pr that can minimize the race conditions of pg removal, but lmk if you find the repro! I will prioritize to fix them! |
In ray 1.7.1 ray monitor shows -61/44. And I cannot run any script. What helps is to restart the cluster after (almost) every run. This script randomly causes negative values in ray monitor usage: import os
import random
import time
os.environ["TUNE_DISABLE_AUTO_CALLBACK_LOGGERS"] = "1" # https://github.com/ray-project/ray/issues/18903
os.environ["TUNE_DISABLE_AUTO_CALLBACK_SYNCER"] = "1" # https://github.com/ray-project/ray/issues/18903
os.environ["TUNE_RESULT_BUFFER_LENGTH"] = "0"
import numpy as np
import ray
from ray import tune
from ray.tune.suggest import optuna
from ray.tune.suggest import basic_variant
def evaluation_fn():
time.sleep(1)
return random.randint(1, 10_000)
def easy_objective(config, data):
intermediate_score = evaluation_fn()
tune.report(mean_loss=intermediate_score)
if __name__ == "__main__":
ray.init(address='auto', _redis_password='5241590000000000')
df = np.zeros(10_000_000)
search_optuna = optuna.OptunaSearch()
search_basic = basic_variant.BasicVariantGenerator()
analysis = tune.run(
tune.with_parameters(easy_objective, data=df),
name="test",
metric="mean_loss",
mode="max",
search_alg=search_basic,
num_samples=-1,
config={
# A random function
"alpha": tune.sample_from(lambda _: np.random.uniform(100)),
# Use the `spec.config` namespace to access other hyperparameters
"beta": tune.sample_from(lambda spec: spec.config.alpha * np.random.normal())
},
reuse_actors=True,
fail_fast=True,
verbose=2
) |
Ah, the was probably not included in 1.7.1... :(. Maybe we should release 1.7.2 with this fix coming in. Just to make sure, you haven't seen this issue in the master right? |
@rkooo567 didn't test nightly. Is the master branch stable (can I always use nightly instead of waiting for releases)? |
We definitely recommend you to use 1.7.1. I was asking the question because the fix was merged in the master, and we forgot to cherry-pick the commit on 1.7.1, but I wanted to make sure if the fix commit actually fixes the issue. |
Btw, we will have a release for 1.8 very soon. We can revisit this if it still happens on 1.8 |
Currently didn't yet happened on ray 1.7.1. Will reopen when this is again the case. Thanks for quick responses! |
Search before asking
Ray Component
Ray Core, Monitoring & Debugging
What happened + What you expected to happen
Started a script, left it run for a couple of hours, interrupted with ctrl+c, scripts exits, ray monitor shows negative usage:
Usage: -2.0/52.0 CPU (0.0 used of 2.0 reserved in placement groups) 0.0/2.0 GPU 0.0/1.0 accelerator_type:G 0.0/1.0 accelerator_type:GT 0.00/103.633 GiB memory 0.00/48.406 GiB object_store_memory Demands: (no resource demands)
Reproduction script
Anything else
ray 1.7.0
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: