Skip to content

Latest commit

 

History

History
65 lines (51 loc) · 3.11 KB

profiling.md

File metadata and controls

65 lines (51 loc) · 3.11 KB

Profiling with KubeRay

py-spy is a sampling profiler for Python programs. It lets you visualize what your Python program is spending time on without restarting the program or modifying the code in any way.

This document describes how to configure RayCluster YAML file to enable py-spy and see Stack Trace and CPU Flame Graph via Ray dashboard.

Theory

py-spy requires the SYS_PTRACE capability to read process memory. However, Kubernetes omits this capability by default. To enable profiling, add the following to the template.spec.containers for both the head and workers.

securityContext:
  capabilities:
    add:
    - SYS_PTRACE

Notes:

  • Adding SYS_PTRACE is forbidden under baseline and restricted Pod Security Standards. See Pod Security Standards for more details.

Steps to deploy and test the RayCluster with SYS_PTRACE capability

  1. Create a KinD cluster:

    kind create cluster
  2. Install the KubeRay operator:

    Follow the steps in Installation Guide.

  3. Create a RayCluster with SYS_PTRACE capability:

    # Path: kuberay/ray-operator/config/samples
    kubectl apply -f ray-cluster.profiling.yaml
  4. Forward the dashboard port:

    kubectl port-forward --address 0.0.0.0 svc/raycluster-profiling-head-svc 8265:8265
  5. Run a sample job within the head Pod:

    # Log in to the head Pod
    kubectl exec -it ${YOUR_HEAD_POD} -- bash
    
    # (Head Pod) Run a sample job in the Pod
    # `long_running_task` includes a `while True` loop to ensure the task remains actively running indefinitely. 
    # This allows you ample time to view the Stack Trace and CPU Flame Graph via the Ray dashboard.
    python3 samples/long_running_task.py

    Notes:

    • If you're running your own examples and encounter the error Failed to write flamegraph: I/O error: No stack counts found when viewing CPU Flame Graph, it might be due to the process being idle. Notably, using the sleep function can lead to this state. In such situations, py-spy filters out the idle stack traces. Refer to this issue for more information.
  6. Profile using the Ray dashboard:

  7. Clean up the RayCluster:

    kubectl delete -f ray-cluster.profiling.yaml