-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[metrics] Force export census metrics on worker death #28547
Conversation
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
@@ -460,7 +460,7 @@ RAY_CONFIG(int64_t, idle_worker_killing_time_threshold_ms, 1000) | |||
RAY_CONFIG(int64_t, num_workers_soft_limit, -1) | |||
|
|||
// The interval where metrics are exported in milliseconds. | |||
RAY_CONFIG(uint64_t, metrics_report_interval_ms, 10000) | |||
RAY_CONFIG(uint64_t, metrics_report_interval_ms, 5000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelatedly, it seems 5 seconds is probably safe for now.
@@ -590,6 +590,8 @@ void CoreWorker::Disconnect( | |||
const rpc::WorkerExitType &exit_type, | |||
const std::string &exit_detail, | |||
const std::shared_ptr<LocalMemoryBuffer> &creation_task_exception_pb_bytes) { | |||
// Force stats export before exiting the worker. | |||
opencensus::stats::StatsExporter::ExportNow(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this method thread safe?
Also, technically this can lose data because ExportNow() is not a blocking call (so if metrics_agent_io_service stops before we send a RPC, it can lose data). I guess the probably is low, and it might be okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I checked here and OCL acquires a mutex internally.
Regarding the RPC safety, I think at least the RPC initiation is a blocking call. I'm not sure if we wait for a reply though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. If the initiation is a blocking call, I think it is pretty safe although we are not waiting for a reply...
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
This reverts commit 54136e8.
) Signed-off-by: PaulFenton <[email protected]>
Why are these changes needed?
Census metrics are periodically pushed from individual workers to per-node metrics agent processes for aggregation. However, this can result in missing the latest metrics from workers (e.g., last few tasks that finished). Force an export during core worker shutdown to fix this.
This involves patch OCL to allow access to the ExportNow() API.