Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[metrics] Force export census metrics on worker death #28547

Merged
merged 20 commits into from
Sep 16, 2022

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Sep 15, 2022

Why are these changes needed?

Census metrics are periodically pushed from individual workers to per-node metrics agent processes for aggregation. However, this can result in missing the latest metrics from workers (e.g., last few tasks that finished). Force an export during core worker shutdown to fix this.

This involves patch OCL to allow access to the ExportNow() API.

Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
@@ -460,7 +460,7 @@ RAY_CONFIG(int64_t, idle_worker_killing_time_threshold_ms, 1000)
RAY_CONFIG(int64_t, num_workers_soft_limit, -1)

// The interval where metrics are exported in milliseconds.
RAY_CONFIG(uint64_t, metrics_report_interval_ms, 10000)
RAY_CONFIG(uint64_t, metrics_report_interval_ms, 5000)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelatedly, it seems 5 seconds is probably safe for now.

python/ray/tests/test_task_metrics.py Outdated Show resolved Hide resolved
@@ -590,6 +590,8 @@ void CoreWorker::Disconnect(
const rpc::WorkerExitType &exit_type,
const std::string &exit_detail,
const std::shared_ptr<LocalMemoryBuffer> &creation_task_exception_pb_bytes) {
// Force stats export before exiting the worker.
opencensus::stats::StatsExporter::ExportNow();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method thread safe?

Also, technically this can lose data because ExportNow() is not a blocking call (so if metrics_agent_io_service stops before we send a RPC, it can lose data). I guess the probably is low, and it might be okay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I checked here and OCL acquires a mutex internally.

Regarding the RPC safety, I think at least the RPC initiation is a blocking call. I'm not sure if we wait for a reply though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. If the initiation is a blocking call, I think it is pretty safe although we are not waiting for a reply...

@ericl ericl merged commit 54136e8 into ray-project:master Sep 16, 2022
stephanie-wang added a commit that referenced this pull request Sep 19, 2022
stephanie-wang added a commit that referenced this pull request Sep 19, 2022
ericl added a commit to ericl/ray that referenced this pull request Sep 19, 2022
PaulFenton pushed a commit to PaulFenton/ray that referenced this pull request Sep 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants