Skip to content

Commit

Permalink
[core] Add metrics for gcs jobs (ray-project#47793)
Browse files Browse the repository at this point in the history
This PR adds metrics for job states within job manager.

In detail, a gauge stats is sent via opencensus exporter, so running ray
jobs could be tracked and alerts could be created later on.

Fault tolerance is not considered, according to
[doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html),
state is re-constructed at restart.

On testing, the best way is to observe via opencensus backend (i.e.
google monitoring dashboard), but not easy for open-source contributors;
or to have a mock / fake exporter implementation, which I don't find in
the code base.

Signed-off-by: dentiny <[email protected]>
Co-authored-by: Ruiyang Wang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
  • Loading branch information
2 people authored and ujjawal-khare committed Oct 15, 2024
1 parent 7eadf76 commit 43ce25b
Show file tree
Hide file tree
Showing 2 changed files with 13 additions and 9 deletions.
13 changes: 8 additions & 5 deletions src/ray/gcs/gcs_server/gcs_job_manager.cc
Original file line number Diff line number Diff line change
Expand Up @@ -96,8 +96,6 @@ void GcsJobManager::HandleAddJob(rpc::AddJobRequest request,
reply,
send_reply_callback =
std::move(send_reply_callback)](const Status &status) {
RAY_CHECK(thread_checker_.IsOnSameThread());

if (!status.ok()) {
RAY_LOG(ERROR) << "Failed to add job, job id = " << job_id
<< ", driver pid = " << job_table_data.driver_pid();
Expand All @@ -117,7 +115,7 @@ void GcsJobManager::HandleAddJob(rpc::AddJobRequest request,
// multiple times and requires idempotency (i.e. due to retry).
running_job_ids_.insert(job_id);
}
WriteDriverJobExportEvent(mutable_job_table_data);
WriteDriverJobExportEvent(job_table_data);
GCS_RPC_SEND_REPLY(send_reply_callback, reply, status);
};

Expand All @@ -138,8 +136,6 @@ void GcsJobManager::MarkJobAsFinished(rpc::JobTableData job_table_data,
job_table_data.set_is_dead(true);
auto on_done = [this, job_id, job_table_data, done_callback = std::move(done_callback)](
const Status &status) {
RAY_CHECK(thread_checker_.IsOnSameThread());

if (!status.ok()) {
RAY_LOG(ERROR) << "Failed to mark job state, job id = " << job_id;
} else {
Expand All @@ -150,6 +146,13 @@ void GcsJobManager::MarkJobAsFinished(rpc::JobTableData job_table_data,
}
function_manager_.RemoveJobReference(job_id);
WriteDriverJobExportEvent(job_table_data);

// Update running job status.
auto iter = running_job_ids_.find(job_id);
RAY_CHECK(iter != running_job_ids_.end());
running_job_ids_.erase(iter);
++finished_jobs_count_;

done_callback(status);
};

Expand Down
9 changes: 5 additions & 4 deletions src/ray/gcs/gcs_server/gcs_job_manager.h
Original file line number Diff line number Diff line change
Expand Up @@ -96,16 +96,17 @@ class GcsJobManager : public rpc::JobInfoHandler {

void WriteDriverJobExportEvent(rpc::JobTableData job_data) const;

/// Record metrics.
/// For job manager, (1) running jobs count gauge and (2) new finished jobs (whether
/// succeed or fail) will be reported periodically.
void RecordMetrics();

private:
void ClearJobInfos(const rpc::JobTableData &job_data);

void MarkJobAsFinished(rpc::JobTableData job_table_data,
std::function<void(Status)> done_callback);

// Used to validate invariants for threading; for example, all callbacks are executed on
// the same thread.
ThreadChecker thread_checker_;

// Running Job IDs, used to report metrics.
absl::flat_hash_set<JobID> running_job_ids_;

Expand Down

0 comments on commit 43ce25b

Please sign in to comment.