[core] Add metrics for gcs jobs #47793

dentiny · 2024-09-23T06:46:01Z

Why are these changes needed?

This PR adds metrics for job states within job manager.

In detail, a gauge stats is sent via opencensus exporter, so running ray jobs could be tracked and alerts could be created later on.

Fault tolerance is not considered, according to doc, state is re-constructed at restart.

On testing, the best way is to observe via opencensus backend (i.e. google monitoring dashboard), but not easy for open-source contributors; or to have a mock / fake exporter implementation, which I don't find in the code base.

Related issue number

Closes #47438

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

dentiny · 2024-09-23T06:47:06Z

src/ray/gcs/gcs_server/gcs_actor_manager.cc

@@ -342,8 +342,8 @@ GcsActorManager::GcsActorManager(
      actor_gc_delay_(RayConfig::instance().gcs_actor_table_min_duration_ms()) {
  RAY_CHECK(worker_client_factory_);
  RAY_CHECK(destroy_owned_placement_group_if_needed_);
-  actor_state_counter_.reset(


Creating a shared pointer with new leads to two allocation, while only once for std::make_shared.
Ref: modern effective C++ item 21

dentiny · 2024-09-23T06:48:08Z

src/ray/gcs/gcs_server/gcs_job_manager.h

@@ -104,11 +117,6 @@ class GcsJobManager : public rpc::JobInfoHandler {

  /// The cached core worker clients which are used to communicate with workers.
  rpc::CoreWorkerClientPool core_worker_clients_;
-
-  void ClearJobInfos(const rpc::JobTableData &job_data);


google coding style has recommendation for declaration order: https://google.github.io/styleguide/cppguide.html#Declaration_Order

Signed-off-by: dentiny <[email protected]>

rynewang · 2024-09-23T19:44:20Z

This PR does not consider HA situation. "state is re-constructed at restart" yes but this needs GcsJobManager's own code. In GcsJobManager::Initialize you receive a redis snapshot of the prev state. Since you added a new state running_job_ids_ you need to init it in that method.

We also have job_id in the metrics. This means each job creates a metric stream which gives up cardinality problems. I am thinking if we only need a single {State} without job ID.

The metrics is updated every when a job starts/ends. I guess we can instead expose that periodically, in align with the other managers' RecordMetrics method as in

ray/src/ray/gcs/gcs_server/gcs_server.cc

Line 792 in dde49e4

gcs_task_manager_->RecordMetrics();

.

src/ray/stats/metric_defs.cc

jjyao · 2024-09-23T15:48:07Z

src/ray/stats/metric_defs.cc

+DEFINE_stats(jobs,
+             "Current number of jobs currently in a particular state.",
+             // State: latest state for the particular job.
+             // JobId: ID in hex format for this job.


No need to have JobId label: we just want two time series: one for running and one for finished.

Updated.

Edit: I split the job status into two parts:

Gauge for running jobs;

Counter for finished jobs.

Oh, sorry for the confusing. What I meant is that we still have one stats called jobs but have a State label with cardinality 2 (RUNNING and FINISHED). This will be stored as two time series in prometheus.

Note: maybe you don't need to save all running_job_ids since you can keep both as a counter; and when the counter resets on gcs restart Prometheus can take care of the reset.

maybe you don't need to save all running_job_ids since you can keep both as a counter

Well you mentioned HandleAddJob needs to be idempotent (which I guess it could be invoked repeatedly due to reasons like retry), so if we use an integer value here, incrementing the value directly leads to mis-report.

The reason why I use set for running task count (gauge) and integer for finished task (counter) is:

Gauge is queried on current value, it better to be precise (i.e. having 1 running job vs 0 running job differs a lot)

Counter is queried via rate (as you mentioned above), which we care more about the rate of increment/decrement, rather than an absolute value

dentiny · 2024-09-24T09:03:15Z

This PR does not consider HA situation. "state is re-constructed at restart" yes but this needs GcsJobManager's own code. In GcsJobManager::Initialize you receive a redis snapshot of the prev state. Since you added a new state running_job_ids_ you need to init it in that method.

Thanks for the code pointer! State recovery logic added.

We also have job_id in the metrics. This means each job creates a metric stream which gives up cardinality problems. I am thinking if we only need a single {State} without job ID.

Sounds good, I'm curious about the cardinality for jobs?

Terminology wise, job is a collection of tasks.
My personal experience, as for AV company, it's not uncommon to have tens of thousands of tasks running concurrently at the same time, but job number is usually 100-200 per day.

But yeah removing the job id sounds good.

The metrics is updated every when a job starts/ends. I guess we can instead expose that periodically, in align with the other managers' RecordMetrics method as in

Thanks for the code pointer! Updated to unify metrics report.

Signed-off-by: dentiny <[email protected]>

src/ray/gcs/gcs_server/gcs_job_manager.cc

src/ray/stats/metric_defs.cc

src/ray/gcs/gcs_server/gcs_job_manager.h

src/ray/stats/metric_defs.cc

Signed-off-by: dentiny <[email protected]>

src/ray/gcs/gcs_server/gcs_job_manager.cc

This PR adds metrics for job states within job manager. In detail, a gauge stats is sent via opencensus exporter, so running ray jobs could be tracked and alerts could be created later on. Fault tolerance is not considered, according to [doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html), state is re-constructed at restart. On testing, the best way is to observe via opencensus backend (i.e. google monitoring dashboard), but not easy for open-source contributors; or to have a mock / fake exporter implementation, which I don't find in the code base. Signed-off-by: dentiny <[email protected]> Co-authored-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>