Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

config/prometheus: add metrics exporter for workers #469

Merged

Conversation

ulfox
Copy link
Contributor

@ulfox ulfox commented Aug 13, 2022

Why are these changes needed?

Sample configuration for exporting metrics from ray cluster workers. This works with autoscaling and should cover new workers and remove destroyed worker pods as well

The podMonitor CRD resource works in a similar way that serviceMonitor works but instead of targeting services, it targets pods.

Prometheus example after applying this manifest

ray_raylet_mem{..., container="ray-head", ...} | ...
...
ray_raylet_mem{..., container="ray-worker", ..., pod="ray-cluster-main-worker-generic-group-h2nhg",...} | ...
...

@ulfox ulfox force-pushed the configs/prometheus-workers-metrics-exporter branch from 57b9126 to afa4849 Compare August 13, 2022 00:22
@Jeffwan
Copy link
Collaborator

Jeffwan commented Aug 15, 2022

@ulfox This is great. BTW, how do you use pod level (ray worker) metrics in your case? We considered to monitor workers in our downstream but feel there're not lots of values. I am trying to learn how you leverage those metrics? /cc @scarlet25151

@ulfox
Copy link
Contributor Author

ulfox commented Aug 17, 2022

@ulfox This is great. BTW, how do you use pod level (ray worker) metrics in your case? We considered to monitor workers in our downstream but feel there're not lots of values. I am trying to learn how you leverage those metrics? /cc @scarlet25151

We currently use the worker's metrics for obersvability using grafana panels.

We check

  • active workers per node group to detect activity spikes on ray cluster
  • Scheduling status on ray workers, for example Unscheduled tasks

For example with the following query

sum(ray_scheduler_unscheduleable_tasks{ray_io_cluster="$RayCluster"}) by (Reason, pod)

We can detect waiting for resources or plasma memory spikes and then check

  • if it was infra related
  • activity related (spike)
  • the client is using a non-optimized code

Some additional examples of worker metrics we observe

sum(rate(ray_scheduler_failed_worker_startup_total{ray_io_cluster="$RayCluster"}[$__range])) by (Reason, pod)
sum(rate(ray_operation_run_time_ms{ray_io_cluster="$RayCluster"}[$__range])) by (Method, pod)
sum(rate(ray_operation_queue_time_ms{ray_io_cluster="$RayCluster"}[$__range])) by (Method, pod)
sum(rate(ray_grpc_server_req_handling_total{ray_io_cluster="$RayCluster"}[$__range])) by (Method, pod)
ray_object_directory_lookups{ray_io_cluster="$RayCluster"}

Ratio metrics

# Ratio of GRPC new / finished requests
sum(rate(ray_grpc_server_req_new_total{ray_io_cluster="$RayCluster"}[$__range])) by (Method, pod) / sum(rate(ray_grpc_server_req_finished_total{ray_io_cluster="$RayCluster"}[$__range])) by (Method, pod)

# Ratio of directory objects added / removed
ray_object_directory_added_locations{ray_io_cluster="$RayCluster"} / ray_object_directory_removed_locations{ray_io_cluster="$RayCluster"}

# Workers memory util
(1 - (ray_node_mem_available{container="ray-worker", ray_io_cluster="$RayCluster"} / ray_node_mem_total{container="ray-worker", ray_io_cluster="$RayCluster"})) * 100

Availability metrics

# [99.9] Percentile of Workers register latency (For our cluster, this is withing the 10sec buckets)
100 * (sum(rate(ray_worker_register_time_ms_bucket{le="10000.0", ray_io_cluster="$RayCluster"}[$__range])) by (pod) / sum(rate(ray_worker_register_time_ms_count{ray_io_cluster="$RayCluster"}[$__range])) by (pod))

# [99.9] Percentile of Process startup latency (For our cluster, this is within the 100ms bucket)
100 * (sum(rate(ray_process_startup_time_ms_bucket{le="100.0", ray_io_cluster="$RayCluster"}[$__range])) by (pod) / sum(rate(ray_process_startup_time_ms_count{ray_io_cluster="$RayCluster"}[$__range])) by (pod))

* Also update and rename serviceMonitor example
@ulfox ulfox force-pushed the configs/prometheus-workers-metrics-exporter branch from afa4849 to 12df3f7 Compare August 18, 2022 00:28
@ulfox ulfox requested a review from Jeffwan August 18, 2022 00:30
@Jeffwan
Copy link
Collaborator

Jeffwan commented Aug 18, 2022

@ulfox These are awesome guidances! We export the control plane grafana dashboard here. https://github.com/ray-project/kuberay/tree/master/config/grafana If the one for workers can be open sourced on your side, I think people would love it.

@Jeffwan Jeffwan merged commit dc5c2cd into ray-project:master Aug 18, 2022
@ulfox
Copy link
Contributor Author

ulfox commented Aug 18, 2022

@Jeffwan I will provide a workers Grafana panel as well!

lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
* config/prometheus: add metrics exporter for workers

* Also update and rename serviceMonitor example

* config/prometheus/rules: add custom rules example

* update: docs/guidance/observability
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants