Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Add metrics for gcs jobs #47793

Merged
merged 7 commits into from
Oct 11, 2024

Conversation

dentiny
Copy link
Contributor

@dentiny dentiny commented Sep 23, 2024

Why are these changes needed?

This PR adds metrics for job states within job manager.

In detail, a gauge stats is sent via opencensus exporter, so running ray jobs could be tracked and alerts could be created later on.

Fault tolerance is not considered, according to doc, state is re-constructed at restart.

On testing, the best way is to observe via opencensus backend (i.e. google monitoring dashboard), but not easy for open-source contributors; or to have a mock / fake exporter implementation, which I don't find in the code base.

Related issue number

Closes #47438

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@dentiny dentiny requested a review from a team as a code owner September 23, 2024 06:46
@@ -342,8 +342,8 @@ GcsActorManager::GcsActorManager(
actor_gc_delay_(RayConfig::instance().gcs_actor_table_min_duration_ms()) {
RAY_CHECK(worker_client_factory_);
RAY_CHECK(destroy_owned_placement_group_if_needed_);
actor_state_counter_.reset(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a shared pointer with new leads to two allocation, while only once for std::make_shared.
Ref: modern effective C++ item 21

@@ -104,11 +117,6 @@ class GcsJobManager : public rpc::JobInfoHandler {

/// The cached core worker clients which are used to communicate with workers.
rpc::CoreWorkerClientPool core_worker_clients_;

void ClearJobInfos(const rpc::JobTableData &job_data);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

google coding style has recommendation for declaration order: https://google.github.io/styleguide/cppguide.html#Declaration_Order

@dentiny dentiny force-pushed the hjiang/gcs-job-metrics branch 4 times, most recently from 9437c3e to 6934d0e Compare September 23, 2024 09:00
Signed-off-by: dentiny <[email protected]>
@rynewang
Copy link
Contributor

This PR does not consider HA situation. "state is re-constructed at restart" yes but this needs GcsJobManager's own code. In GcsJobManager::Initialize you receive a redis snapshot of the prev state. Since you added a new state running_job_ids_ you need to init it in that method.

We also have job_id in the metrics. This means each job creates a metric stream which gives up cardinality problems. I am thinking if we only need a single {State} without job ID.

The metrics is updated every when a job starts/ends. I guess we can instead expose that periodically, in align with the other managers' RecordMetrics method as in

gcs_task_manager_->RecordMetrics();
.

src/ray/stats/metric_defs.cc Outdated Show resolved Hide resolved
DEFINE_stats(jobs,
"Current number of jobs currently in a particular state.",
// State: latest state for the particular job.
// JobId: ID in hex format for this job.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to have JobId label: we just want two time series: one for running and one for finished.

Copy link
Contributor Author

@dentiny dentiny Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Edit: I split the job status into two parts:

  • Gauge for running jobs;
  • Counter for finished jobs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry for the confusing. What I meant is that we still have one stats called jobs but have a State label with cardinality 2 (RUNNING and FINISHED). This will be stored as two time series in prometheus.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: maybe you don't need to save all running_job_ids since you can keep both as a counter; and when the counter resets on gcs restart Prometheus can take care of the reset.

Copy link
Contributor Author

@dentiny dentiny Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you don't need to save all running_job_ids since you can keep both as a counter

Well you mentioned HandleAddJob needs to be idempotent (which I guess it could be invoked repeatedly due to reasons like retry), so if we use an integer value here, incrementing the value directly leads to mis-report.

The reason why I use set for running task count (gauge) and integer for finished task (counter) is:

  • Gauge is queried on current value, it better to be precise (i.e. having 1 running job vs 0 running job differs a lot)
  • Counter is queried via rate (as you mentioned above), which we care more about the rate of increment/decrement, rather than an absolute value

@dentiny
Copy link
Contributor Author

dentiny commented Sep 24, 2024

This PR does not consider HA situation. "state is re-constructed at restart" yes but this needs GcsJobManager's own code. In GcsJobManager::Initialize you receive a redis snapshot of the prev state. Since you added a new state running_job_ids_ you need to init it in that method.

Thanks for the code pointer! State recovery logic added.

We also have job_id in the metrics. This means each job creates a metric stream which gives up cardinality problems. I am thinking if we only need a single {State} without job ID.

Sounds good, I'm curious about the cardinality for jobs?

Terminology wise, job is a collection of tasks.
My personal experience, as for AV company, it's not uncommon to have tens of thousands of tasks running concurrently at the same time, but job number is usually 100-200 per day.

But yeah removing the job id sounds good.

The metrics is updated every when a job starts/ends. I guess we can instead expose that periodically, in align with the other managers' RecordMetrics method as in

Thanks for the code pointer! Updated to unify metrics report.

@dentiny dentiny force-pushed the hjiang/gcs-job-metrics branch 4 times, most recently from d66d066 to 44be6d5 Compare September 24, 2024 10:19
@dentiny dentiny requested a review from jjyao September 24, 2024 10:21
@anyscalesam anyscalesam added triage Needs triage (eg: priority, bug/not-bug, and owning component) core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Sep 24, 2024
src/ray/stats/metric_defs.cc Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_job_manager.h Outdated Show resolved Hide resolved
Signed-off-by: dentiny <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
This PR adds metrics for job states within job manager.

In detail, a gauge stats is sent via opencensus exporter, so running ray
jobs could be tracked and alerts could be created later on.

Fault tolerance is not considered, according to
[doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html),
state is re-constructed at restart.

On testing, the best way is to observe via opencensus backend (i.e.
google monitoring dashboard), but not easy for open-source contributors;
or to have a mock / fake exporter implementation, which I don't find in
the code base.

Signed-off-by: dentiny <[email protected]>
Co-authored-by: Ruiyang Wang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
This PR followup for comment
ray-project#47793 (comment),
and adds a thread checking to GCS job manager callback to make sure no
concurrent access for data members.

Signed-off-by: dentiny <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
This PR adds metrics for job states within job manager.

In detail, a gauge stats is sent via opencensus exporter, so running ray
jobs could be tracked and alerts could be created later on.

Fault tolerance is not considered, according to
[doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html),
state is re-constructed at restart.

On testing, the best way is to observe via opencensus backend (i.e.
google monitoring dashboard), but not easy for open-source contributors;
or to have a mock / fake exporter implementation, which I don't find in
the code base.

Signed-off-by: dentiny <[email protected]>
Co-authored-by: Ruiyang Wang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
This PR followup for comment
ray-project#47793 (comment),
and adds a thread checking to GCS job manager callback to make sure no
concurrent access for data members.

Signed-off-by: dentiny <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
This PR adds metrics for job states within job manager.

In detail, a gauge stats is sent via opencensus exporter, so running ray
jobs could be tracked and alerts could be created later on.

Fault tolerance is not considered, according to
[doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html),
state is re-constructed at restart.

On testing, the best way is to observe via opencensus backend (i.e.
google monitoring dashboard), but not easy for open-source contributors;
or to have a mock / fake exporter implementation, which I don't find in
the code base.

Signed-off-by: dentiny <[email protected]>
Co-authored-by: Ruiyang Wang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
This PR adds metrics for job states within job manager.

In detail, a gauge stats is sent via opencensus exporter, so running ray
jobs could be tracked and alerts could be created later on.

Fault tolerance is not considered, according to
[doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html),
state is re-constructed at restart.

On testing, the best way is to observe via opencensus backend (i.e.
google monitoring dashboard), but not easy for open-source contributors;
or to have a mock / fake exporter implementation, which I don't find in
the code base.

Signed-off-by: dentiny <[email protected]>
Co-authored-by: Ruiyang Wang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
This PR followup for comment
ray-project#47793 (comment),
and adds a thread checking to GCS job manager callback to make sure no
concurrent access for data members.

Signed-off-by: dentiny <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
This PR adds metrics for job states within job manager.

In detail, a gauge stats is sent via opencensus exporter, so running ray
jobs could be tracked and alerts could be created later on.

Fault tolerance is not considered, according to
[doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html),
state is re-constructed at restart.

On testing, the best way is to observe via opencensus backend (i.e.
google monitoring dashboard), but not easy for open-source contributors;
or to have a mock / fake exporter implementation, which I don't find in
the code base.

Signed-off-by: dentiny <[email protected]>
Co-authored-by: Ruiyang Wang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
This PR followup for comment
ray-project#47793 (comment),
and adds a thread checking to GCS job manager callback to make sure no
concurrent access for data members.

Signed-off-by: dentiny <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
This PR adds metrics for job states within job manager.

In detail, a gauge stats is sent via opencensus exporter, so running ray
jobs could be tracked and alerts could be created later on.

Fault tolerance is not considered, according to
[doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html),
state is re-constructed at restart.

On testing, the best way is to observe via opencensus backend (i.e.
google monitoring dashboard), but not easy for open-source contributors;
or to have a mock / fake exporter implementation, which I don't find in
the code base.

Signed-off-by: dentiny <[email protected]>
Co-authored-by: Ruiyang Wang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
This PR followup for comment
ray-project#47793 (comment),
and adds a thread checking to GCS job manager callback to make sure no
concurrent access for data members.

Signed-off-by: dentiny <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
This PR adds metrics for job states within job manager.

In detail, a gauge stats is sent via opencensus exporter, so running ray
jobs could be tracked and alerts could be created later on.

Fault tolerance is not considered, according to
[doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html),
state is re-constructed at restart.

On testing, the best way is to observe via opencensus backend (i.e.
google monitoring dashboard), but not easy for open-source contributors;
or to have a mock / fake exporter implementation, which I don't find in
the code base.

Signed-off-by: dentiny <[email protected]>
Co-authored-by: Ruiyang Wang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
This PR adds metrics for job states within job manager.

In detail, a gauge stats is sent via opencensus exporter, so running ray
jobs could be tracked and alerts could be created later on.

Fault tolerance is not considered, according to
[doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html),
state is re-constructed at restart.

On testing, the best way is to observe via opencensus backend (i.e.
google monitoring dashboard), but not easy for open-source contributors;
or to have a mock / fake exporter implementation, which I don't find in
the code base.

Signed-off-by: dentiny <[email protected]>
Co-authored-by: Ruiyang Wang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
This PR followup for comment
ray-project#47793 (comment),
and adds a thread checking to GCS job manager callback to make sure no
concurrent access for data members.

Signed-off-by: dentiny <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
This PR adds metrics for job states within job manager.

In detail, a gauge stats is sent via opencensus exporter, so running ray
jobs could be tracked and alerts could be created later on.

Fault tolerance is not considered, according to
[doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html),
state is re-constructed at restart.

On testing, the best way is to observe via opencensus backend (i.e.
google monitoring dashboard), but not easy for open-source contributors;
or to have a mock / fake exporter implementation, which I don't find in
the code base.

Signed-off-by: dentiny <[email protected]>
Co-authored-by: Ruiyang Wang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
This PR followup for comment
ray-project#47793 (comment),
and adds a thread checking to GCS job manager callback to make sure no
concurrent access for data members.

Signed-off-by: dentiny <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
@alanwguo
Copy link
Contributor

@dentiny @jjyao is there a ticket for the follow-up of having a metric to track job state changes. We have a user which would like to alert whenever a ray job fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core, Observabilty] Create metric for running ray jobs
5 participants