Skip to content

Commit

Permalink
[core] Add metrics for gcs jobs (ray-project#47793)
Browse files Browse the repository at this point in the history
This PR adds metrics for job states within job manager.

In detail, a gauge stats is sent via opencensus exporter, so running ray
jobs could be tracked and alerts could be created later on.

Fault tolerance is not considered, according to
[doc](https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html),
state is re-constructed at restart.

On testing, the best way is to observe via opencensus backend (i.e.
google monitoring dashboard), but not easy for open-source contributors;
or to have a mock / fake exporter implementation, which I don't find in
the code base.

Signed-off-by: dentiny <[email protected]>
Co-authored-by: Ruiyang Wang <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
  • Loading branch information
2 people authored and ujjawal-khare committed Oct 15, 2024
1 parent 8c82c96 commit 319330f
Show file tree
Hide file tree
Showing 2 changed files with 13 additions and 1 deletion.
9 changes: 8 additions & 1 deletion src/ray/gcs/gcs_server/gcs_job_manager.cc
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ void GcsJobManager::HandleAddJob(rpc::AddJobRequest request,
// multiple times and requires idempotency (i.e. due to retry).
running_job_ids_.insert(job_id);
}
WriteDriverJobExportEvent(mutable_job_table_data);
WriteDriverJobExportEvent(job_table_data);
GCS_RPC_SEND_REPLY(send_reply_callback, reply, status);
};

Expand Down Expand Up @@ -146,6 +146,13 @@ void GcsJobManager::MarkJobAsFinished(rpc::JobTableData job_table_data,
}
function_manager_.RemoveJobReference(job_id);
WriteDriverJobExportEvent(job_table_data);

// Update running job status.
auto iter = running_job_ids_.find(job_id);
RAY_CHECK(iter != running_job_ids_.end());
running_job_ids_.erase(iter);
++finished_jobs_count_;

done_callback(status);
};

Expand Down
5 changes: 5 additions & 0 deletions src/ray/gcs/gcs_server/gcs_job_manager.h
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,11 @@ class GcsJobManager : public rpc::JobInfoHandler {

void WriteDriverJobExportEvent(rpc::JobTableData job_data) const;

/// Record metrics.
/// For job manager, (1) running jobs count gauge and (2) new finished jobs (whether
/// succeed or fail) will be reported periodically.
void RecordMetrics();

private:
void ClearJobInfos(const rpc::JobTableData &job_data);

Expand Down

0 comments on commit 319330f

Please sign in to comment.