config/prometheus: add metrics exporter for workers #469

ulfox · 2022-08-13T00:21:10Z

Why are these changes needed?

Sample configuration for exporting metrics from ray cluster workers. This works with autoscaling and should cover new workers and remove destroyed worker pods as well

The podMonitor CRD resource works in a similar way that serviceMonitor works but instead of targeting services, it targets pods.

Prometheus example after applying this manifest

ray_raylet_mem{..., container="ray-head", ...} | ...
...
ray_raylet_mem{..., container="ray-worker", ..., pod="ray-cluster-main-worker-generic-group-h2nhg",...} | ...
...

Jeffwan · 2022-08-15T17:53:56Z

@ulfox This is great. BTW, how do you use pod level (ray worker) metrics in your case? We considered to monitor workers in our downstream but feel there're not lots of values. I am trying to learn how you leverage those metrics? /cc @scarlet25151

config/prometheus/workersPodMonitor.yaml

ulfox · 2022-08-17T09:47:04Z

@ulfox This is great. BTW, how do you use pod level (ray worker) metrics in your case? We considered to monitor workers in our downstream but feel there're not lots of values. I am trying to learn how you leverage those metrics? /cc @scarlet25151

We currently use the worker's metrics for obersvability using grafana panels.

We check

active workers per node group to detect activity spikes on ray cluster
Scheduling status on ray workers, for example Unscheduled tasks

For example with the following query

sum(ray_scheduler_unscheduleable_tasks{ray_io_cluster="$RayCluster"}) by (Reason, pod)

We can detect waiting for resources or plasma memory spikes and then check

if it was infra related
activity related (spike)
the client is using a non-optimized code

Some additional examples of worker metrics we observe

sum(rate(ray_scheduler_failed_worker_startup_total{ray_io_cluster="$RayCluster"}[$__range])) by (Reason, pod)
sum(rate(ray_operation_run_time_ms{ray_io_cluster="$RayCluster"}[$__range])) by (Method, pod)
sum(rate(ray_operation_queue_time_ms{ray_io_cluster="$RayCluster"}[$__range])) by (Method, pod)
sum(rate(ray_grpc_server_req_handling_total{ray_io_cluster="$RayCluster"}[$__range])) by (Method, pod)
ray_object_directory_lookups{ray_io_cluster="$RayCluster"}

Ratio metrics

# Ratio of GRPC new / finished requests
sum(rate(ray_grpc_server_req_new_total{ray_io_cluster="$RayCluster"}[$__range])) by (Method, pod) / sum(rate(ray_grpc_server_req_finished_total{ray_io_cluster="$RayCluster"}[$__range])) by (Method, pod)

# Ratio of directory objects added / removed
ray_object_directory_added_locations{ray_io_cluster="$RayCluster"} / ray_object_directory_removed_locations{ray_io_cluster="$RayCluster"}

# Workers memory util
(1 - (ray_node_mem_available{container="ray-worker", ray_io_cluster="$RayCluster"} / ray_node_mem_total{container="ray-worker", ray_io_cluster="$RayCluster"})) * 100

Availability metrics

# [99.9] Percentile of Workers register latency (For our cluster, this is withing the 10sec buckets)
100 * (sum(rate(ray_worker_register_time_ms_bucket{le="10000.0", ray_io_cluster="$RayCluster"}[$__range])) by (pod) / sum(rate(ray_worker_register_time_ms_count{ray_io_cluster="$RayCluster"}[$__range])) by (pod))

# [99.9] Percentile of Process startup latency (For our cluster, this is within the 100ms bucket)
100 * (sum(rate(ray_process_startup_time_ms_bucket{le="100.0", ray_io_cluster="$RayCluster"}[$__range])) by (pod) / sum(rate(ray_process_startup_time_ms_count{ray_io_cluster="$RayCluster"}[$__range])) by (pod))

* Also update and rename serviceMonitor example

Jeffwan · 2022-08-18T17:25:38Z

@ulfox These are awesome guidances! We export the control plane grafana dashboard here. https://github.com/ray-project/kuberay/tree/master/config/grafana If the one for workers can be open sourced on your side, I think people would love it.

ulfox · 2022-08-18T20:25:41Z

@Jeffwan I will provide a workers Grafana panel as well!

* config/prometheus: add metrics exporter for workers * Also update and rename serviceMonitor example * config/prometheus/rules: add custom rules example * update: docs/guidance/observability

ulfox force-pushed the configs/prometheus-workers-metrics-exporter branch from 57b9126 to afa4849 Compare August 13, 2022 00:22

Jeffwan reviewed Aug 15, 2022

View reviewed changes

config/prometheus/workersPodMonitor.yaml Outdated Show resolved Hide resolved

config/prometheus: add metrics exporter for workers

12df3f7

* Also update and rename serviceMonitor example

ulfox force-pushed the configs/prometheus-workers-metrics-exporter branch from afa4849 to 12df3f7 Compare August 18, 2022 00:28

Christos Kotsis added 2 commits August 18, 2022 03:29

config/prometheus/rules: add custom rules example

88054af

update: docs/guidance/observability

cadc6fb

ulfox requested a review from Jeffwan August 18, 2022 00:30

Jeffwan approved these changes Aug 18, 2022

View reviewed changes

Jeffwan merged commit dc5c2cd into ray-project:master Aug 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config/prometheus: add metrics exporter for workers #469

config/prometheus: add metrics exporter for workers #469

ulfox commented Aug 13, 2022 •

edited

Loading

Jeffwan commented Aug 15, 2022

ulfox commented Aug 17, 2022 •

edited

Loading

Jeffwan commented Aug 18, 2022

ulfox commented Aug 18, 2022

config/prometheus: add metrics exporter for workers #469

config/prometheus: add metrics exporter for workers #469

Conversation

ulfox commented Aug 13, 2022 • edited Loading

Why are these changes needed?

Jeffwan commented Aug 15, 2022

ulfox commented Aug 17, 2022 • edited Loading

Jeffwan commented Aug 18, 2022

ulfox commented Aug 18, 2022

ulfox commented Aug 13, 2022 •

edited

Loading

ulfox commented Aug 17, 2022 •

edited

Loading