Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core, Observabilty] Create metric for running ray jobs #47438

Closed
alanwguo opened this issue Aug 31, 2024 — with Slack · 5 comments · Fixed by #47793
Closed

[Core, Observabilty] Create metric for running ray jobs #47438

alanwguo opened this issue Aug 31, 2024 — with Slack · 5 comments · Fixed by #47793
Assignees
Labels
core Issues that should be addressed in Ray Core enhancement Request for new feature and/or capability @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. good first issue Great starter issue for someone just starting to contribute to Ray observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling P1 Issue that should be fixed within a few weeks

Comments

Copy link
Contributor

Tracking the running ray jobs would be useful. Alerts could be created on if ray clusters are left alive without a running job.

@alanwguo alanwguo added the good first issue Great starter issue for someone just starting to contribute to Ray label Aug 31, 2024 — with Slack
@alanwguo alanwguo added core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Aug 31, 2024
@jjyao jjyao added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 3, 2024
@ekdnam
Copy link

ekdnam commented Sep 4, 2024

Hi @alanwguo @jjyao ! I would like to work on this. Can you please let me know from where should I start?

@anyscalesam anyscalesam added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 4, 2024
@anyscalesam anyscalesam added the @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. label Sep 4, 2024
@alanwguo
Copy link
Contributor Author

alanwguo commented Sep 4, 2024

@jjyao can we actually have someone from ray core offer guidance here? I'm not familiar with the core code that emits metrics.

This will likely require a C++ change, @ekdnam , if you're comfortable

@jjyao
Copy link
Collaborator

jjyao commented Sep 23, 2024

@dentiny do you want to take this one? Check metrics for actors as a starting point. The code is in gcs_actor_manager.cc:

ray::stats::STATS_actors.Record(
            num_actors,
            {{"State", rpc::ActorTableData::ActorState_Name(key.first)},
             {"Name", key.second},
             {"Source", "gcs"},
             {"JobId", ""}});

For jobs, the code will be in gcs_job_manager.cc, we want a ray_jobs gauge metric with a State label (possible values are: RUNNING and FINISHED).

@jjyao jjyao added P1 Issue that should be fixed within a few weeks and removed P2 Important issue, but not time-critical labels Sep 23, 2024
@dentiny
Copy link
Contributor

dentiny commented Sep 23, 2024

Thanks @jjyao! I would love to take a look.

@ekdnam
Copy link

ekdnam commented Sep 23, 2024

Hi @alanwguo ! Sorry for not taking this up, I have been caught up in some personal stuff for the past few weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core enhancement Request for new feature and/or capability @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. good first issue Great starter issue for someone just starting to contribute to Ray observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants