-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Add metrics for gcs jobs #47793
Conversation
@@ -342,8 +342,8 @@ GcsActorManager::GcsActorManager( | |||
actor_gc_delay_(RayConfig::instance().gcs_actor_table_min_duration_ms()) { | |||
RAY_CHECK(worker_client_factory_); | |||
RAY_CHECK(destroy_owned_placement_group_if_needed_); | |||
actor_state_counter_.reset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a shared pointer with new
leads to two allocation, while only once for std::make_shared
.
Ref: modern effective C++ item 21
@@ -104,11 +117,6 @@ class GcsJobManager : public rpc::JobInfoHandler { | |||
|
|||
/// The cached core worker clients which are used to communicate with workers. | |||
rpc::CoreWorkerClientPool core_worker_clients_; | |||
|
|||
void ClearJobInfos(const rpc::JobTableData &job_data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
google coding style has recommendation for declaration order: https://google.github.io/styleguide/cppguide.html#Declaration_Order
9437c3e
to
6934d0e
Compare
Signed-off-by: dentiny <[email protected]>
6934d0e
to
9e1f214
Compare
This PR does not consider HA situation. "state is re-constructed at restart" yes but this needs GcsJobManager's own code. In We also have The metrics is updated every when a job starts/ends. I guess we can instead expose that periodically, in align with the other managers' ray/src/ray/gcs/gcs_server/gcs_server.cc Line 792 in dde49e4
|
src/ray/stats/metric_defs.cc
Outdated
DEFINE_stats(jobs, | ||
"Current number of jobs currently in a particular state.", | ||
// State: latest state for the particular job. | ||
// JobId: ID in hex format for this job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to have JobId label: we just want two time series: one for running and one for finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
Edit: I split the job status into two parts:
- Gauge for running jobs;
- Counter for finished jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry for the confusing. What I meant is that we still have one stats called jobs
but have a State
label with cardinality 2 (RUNNING and FINISHED). This will be stored as two time series in prometheus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: maybe you don't need to save all running_job_ids since you can keep both as a counter; and when the counter resets on gcs restart Prometheus can take care of the reset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you don't need to save all running_job_ids since you can keep both as a counter
Well you mentioned HandleAddJob
needs to be idempotent (which I guess it could be invoked repeatedly due to reasons like retry), so if we use an integer value here, incrementing the value directly leads to mis-report.
The reason why I use set for running task count (gauge) and integer for finished task (counter) is:
- Gauge is queried on current value, it better to be precise (i.e. having 1 running job vs 0 running job differs a lot)
- Counter is queried via
rate
(as you mentioned above), which we care more about the rate of increment/decrement, rather than an absolute value
Thanks for the code pointer! State recovery logic added.
Sounds good, I'm curious about the cardinality for jobs? Terminology wise, job is a collection of tasks. But yeah removing the job id sounds good.
Thanks for the code pointer! Updated to unify metrics report. |
d66d066
to
44be6d5
Compare
Signed-off-by: dentiny <[email protected]>
44be6d5
to
208482a
Compare
Signed-off-by: dentiny <[email protected]>
This PR adds metrics for job states within job manager. In detail, a gauge stats is sent via opencensus exporter, so running ray jobs could be tracked and alerts could be created later on. Fault tolerance is not considered, according to [doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html), state is re-constructed at restart. On testing, the best way is to observe via opencensus backend (i.e. google monitoring dashboard), but not easy for open-source contributors; or to have a mock / fake exporter implementation, which I don't find in the code base. Signed-off-by: dentiny <[email protected]> Co-authored-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
This PR followup for comment ray-project#47793 (comment), and adds a thread checking to GCS job manager callback to make sure no concurrent access for data members. Signed-off-by: dentiny <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
This PR adds metrics for job states within job manager. In detail, a gauge stats is sent via opencensus exporter, so running ray jobs could be tracked and alerts could be created later on. Fault tolerance is not considered, according to [doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html), state is re-constructed at restart. On testing, the best way is to observe via opencensus backend (i.e. google monitoring dashboard), but not easy for open-source contributors; or to have a mock / fake exporter implementation, which I don't find in the code base. Signed-off-by: dentiny <[email protected]> Co-authored-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
This PR followup for comment ray-project#47793 (comment), and adds a thread checking to GCS job manager callback to make sure no concurrent access for data members. Signed-off-by: dentiny <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
This PR adds metrics for job states within job manager. In detail, a gauge stats is sent via opencensus exporter, so running ray jobs could be tracked and alerts could be created later on. Fault tolerance is not considered, according to [doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html), state is re-constructed at restart. On testing, the best way is to observe via opencensus backend (i.e. google monitoring dashboard), but not easy for open-source contributors; or to have a mock / fake exporter implementation, which I don't find in the code base. Signed-off-by: dentiny <[email protected]> Co-authored-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
This PR adds metrics for job states within job manager. In detail, a gauge stats is sent via opencensus exporter, so running ray jobs could be tracked and alerts could be created later on. Fault tolerance is not considered, according to [doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html), state is re-constructed at restart. On testing, the best way is to observe via opencensus backend (i.e. google monitoring dashboard), but not easy for open-source contributors; or to have a mock / fake exporter implementation, which I don't find in the code base. Signed-off-by: dentiny <[email protected]> Co-authored-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
This PR followup for comment ray-project#47793 (comment), and adds a thread checking to GCS job manager callback to make sure no concurrent access for data members. Signed-off-by: dentiny <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
This PR adds metrics for job states within job manager. In detail, a gauge stats is sent via opencensus exporter, so running ray jobs could be tracked and alerts could be created later on. Fault tolerance is not considered, according to [doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html), state is re-constructed at restart. On testing, the best way is to observe via opencensus backend (i.e. google monitoring dashboard), but not easy for open-source contributors; or to have a mock / fake exporter implementation, which I don't find in the code base. Signed-off-by: dentiny <[email protected]> Co-authored-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
This PR followup for comment ray-project#47793 (comment), and adds a thread checking to GCS job manager callback to make sure no concurrent access for data members. Signed-off-by: dentiny <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
This PR adds metrics for job states within job manager. In detail, a gauge stats is sent via opencensus exporter, so running ray jobs could be tracked and alerts could be created later on. Fault tolerance is not considered, according to [doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html), state is re-constructed at restart. On testing, the best way is to observe via opencensus backend (i.e. google monitoring dashboard), but not easy for open-source contributors; or to have a mock / fake exporter implementation, which I don't find in the code base. Signed-off-by: dentiny <[email protected]> Co-authored-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
This PR followup for comment ray-project#47793 (comment), and adds a thread checking to GCS job manager callback to make sure no concurrent access for data members. Signed-off-by: dentiny <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
This PR adds metrics for job states within job manager. In detail, a gauge stats is sent via opencensus exporter, so running ray jobs could be tracked and alerts could be created later on. Fault tolerance is not considered, according to [doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html), state is re-constructed at restart. On testing, the best way is to observe via opencensus backend (i.e. google monitoring dashboard), but not easy for open-source contributors; or to have a mock / fake exporter implementation, which I don't find in the code base. Signed-off-by: dentiny <[email protected]> Co-authored-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
This PR adds metrics for job states within job manager. In detail, a gauge stats is sent via opencensus exporter, so running ray jobs could be tracked and alerts could be created later on. Fault tolerance is not considered, according to [doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html), state is re-constructed at restart. On testing, the best way is to observe via opencensus backend (i.e. google monitoring dashboard), but not easy for open-source contributors; or to have a mock / fake exporter implementation, which I don't find in the code base. Signed-off-by: dentiny <[email protected]> Co-authored-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
This PR followup for comment ray-project#47793 (comment), and adds a thread checking to GCS job manager callback to make sure no concurrent access for data members. Signed-off-by: dentiny <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
This PR adds metrics for job states within job manager. In detail, a gauge stats is sent via opencensus exporter, so running ray jobs could be tracked and alerts could be created later on. Fault tolerance is not considered, according to [doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html), state is re-constructed at restart. On testing, the best way is to observe via opencensus backend (i.e. google monitoring dashboard), but not easy for open-source contributors; or to have a mock / fake exporter implementation, which I don't find in the code base. Signed-off-by: dentiny <[email protected]> Co-authored-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
This PR followup for comment ray-project#47793 (comment), and adds a thread checking to GCS job manager callback to make sure no concurrent access for data members. Signed-off-by: dentiny <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Why are these changes needed?
This PR adds metrics for job states within job manager.
In detail, a gauge stats is sent via opencensus exporter, so running ray jobs could be tracked and alerts could be created later on.
Fault tolerance is not considered, according to doc, state is re-constructed at restart.
On testing, the best way is to observe via opencensus backend (i.e. google monitoring dashboard), but not easy for open-source contributors; or to have a mock / fake exporter implementation, which I don't find in the code base.
Related issue number
Closes #47438
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.