-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Telemetry] Inject env identifying KubeRay. #562
Merged
DmitriGekhtman
merged 1 commit into
ray-project:master
from
DmitriGekhtman:dmitri/telemetry-env
Sep 14, 2022
Merged
[Telemetry] Inject env identifying KubeRay. #562
DmitriGekhtman
merged 1 commit into
ray-project:master
from
DmitriGekhtman:dmitri/telemetry-env
Sep 14, 2022
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Dmitri Gekhtman <[email protected]>
7 tasks
jjyao
approved these changes
Sep 14, 2022
I'll add an e2e test for this in the Ray CI. |
DmitriGekhtman
added a commit
to ray-project/ray
that referenced
this pull request
Sep 14, 2022
) Right now, Ray telemetry indicates the majority of Ray's CPU hour usage comes from Ray running within a Kubernetes cluster. However, we have no data on what method is used to deploy Ray on Kubernetes. This PR enables Ray telemetry to distinguish between three methods of deploying Ray on Kubernetes: KubeRay >= 0.4.0 Legacy Ray Operator with Ray >= 2.1.0 All other methods The strategy is to have the operators inject an env variable into the Ray container's environment. The variable identifies the deployment method. This PR also modifies the legacy Ray operator to inject the relevant env variable. A follow-up KubeRay PR changes the KubeRay operator to do the same thing: ray-project/kuberay#562 Signed-off-by: Dmitri Gekhtman <[email protected]>
ArturNiederfahrenhorst
added a commit
to ray-project/ray
that referenced
this pull request
Sep 15, 2022
…tests (#28535) * [core/ci] Disallow protobuf 3.19.5 (#28504) This leads to hangs in Ray client (e.g. test_dataclient_disconnect) Signed-off-by: Kai Fricke <[email protected]> * [tune] Fix trial checkpoint syncing after recovery from other node (#28470) On restore from a different IP, the SyncerCallback currently still tries to sync from a stale node IP, because `trial.last_result` has not been updated, yet. Instead, the syncer callback should keep its own map of trials to IPs, and only act on this. Signed-off-by: Kai Fricke <[email protected]> * [air] minor example fix. (#28379) Signed-off-by: xwjiang2010 <[email protected]> * [cleanup] Remove memory unit conversion (#28396) The internal memory unit was switched back to bytes years ago, there's no point in keeping confusing conversion code around anymore. Recommendation: Review #28394 first, since this is stacked on top of it. Co-authored-by: Alex <[email protected]> * [RLlib] Sync policy specs from local_worker_for_synching while recovering rollout/eval workers. (#28422) * Cast rewards as tf.float32 to fix error in DQN in tf2 (#28384) * Cast rewards as tf.float32 to fix error in DQN in tf2 Signed-off-by: mgerstgrasser <[email protected]> * Add test case for DQN with integer rewards Signed-off-by: mgerstgrasser <[email protected]> Signed-off-by: mgerstgrasser <[email protected]> * [doc] [Datasets] Improve docstring and doctest for read_parquet (#28488) This addresses some of the issues brought up in #28484 * [ci] Increase timeout on test_metrics (#28508) 10 milliseconds is ambitious for the CI to do anything. Co-authored-by: Alex <[email protected]> * [air/tune] Catch empty hyperopt search space, raise better Tuner error message (#28503) * Add imports to object-spilling.rst Python code (#28507) * Add imports to object-spilling.rst Python code Also adjust a couple descriptions, retaining the same general information Signed-off-by: Jake <[email protected]> * fix doc build / keep note formatting Signed-off-by: Philipp Moritz <[email protected]> * another tiny fix Signed-off-by: Philipp Moritz <[email protected]> Signed-off-by: Jake <[email protected]> Signed-off-by: Philipp Moritz <[email protected]> Co-authored-by: Philipp Moritz <[email protected]> * [AIR] Make PathPartitionScheme a dataclass (#28390) Signed-off-by: Balaji Veeramani <[email protected]> * [Telemetry][Kuberentes] Distinguish Kubernetes deployment stacks (#28490) Right now, Ray telemetry indicates the majority of Ray's CPU hour usage comes from Ray running within a Kubernetes cluster. However, we have no data on what method is used to deploy Ray on Kubernetes. This PR enables Ray telemetry to distinguish between three methods of deploying Ray on Kubernetes: KubeRay >= 0.4.0 Legacy Ray Operator with Ray >= 2.1.0 All other methods The strategy is to have the operators inject an env variable into the Ray container's environment. The variable identifies the deployment method. This PR also modifies the legacy Ray operator to inject the relevant env variable. A follow-up KubeRay PR changes the KubeRay operator to do the same thing: ray-project/kuberay#562 Signed-off-by: Dmitri Gekhtman <[email protected]> * [autoscaler][observability] Experimental verbose mode (#28392) This PR introduces a super secret hidden verbose mode for ray status, which we can keep hidden while collecting feedback before going through the process of officially declaring it part of the public API. Example output ======== Autoscaler status: 2020-12-28 01:02:03 ======== GCS request time: 3.141500s Node Provider non_terminated_nodes time: 1.618000s Node status -------------------------------------------------------- Healthy: 2 p3.2xlarge 20 m4.4xlarge Pending: m4.4xlarge, 2 launching 1.2.3.4: m4.4xlarge, waiting-for-ssh 1.2.3.5: m4.4xlarge, waiting-for-ssh Recent failures: p3.2xlarge: RayletUnexpectedlyDied (ip: 1.2.3.6) Resources -------------------------------------------------------- Total Usage: 1/2 AcceleratorType:V100 530.0/544.0 CPU 2/2 GPU 2.00/8.000 GiB memory 3.14/16.000 GiB object_store_memory Total Demands: {'CPU': 1}: 150+ pending tasks/actors {'CPU': 4} * 5 (PACK): 420+ pending placement groups {'CPU': 16}: 100+ from request_resources() Node: 192.168.1.1 Usage: 0.1/1 AcceleratorType:V100 5.0/20.0 CPU 0.7/1 GPU 1.00/4.000 GiB memory 3.14/4.000 GiB object_store_memory Node: 192.168.1.2 Usage: 0.9/1 AcceleratorType:V100 15.0/20.0 CPU 0.3/1 GPU 1.00/12.000 GiB memory 0.00/4.000 GiB object_store_memory Co-authored-by: Alex <[email protected]> * [doc/tune] fix tune stopper attribute name (#28517) * [doc] Fix tune stopper doctests (#28531) * [air] Use self-hosted mirror for CIFAR10 dataset (#28480) The CIFAR10 website host has been unreliable in the past. This PR injects our own mirror into our CI packages for testing. Signed-off-by: Kai Fricke <[email protected]> * draft Signed-off-by: Artur Niederfahrenhorst <[email protected]> Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: xwjiang2010 <[email protected]> Signed-off-by: mgerstgrasser <[email protected]> Signed-off-by: Jake <[email protected]> Signed-off-by: Philipp Moritz <[email protected]> Signed-off-by: Balaji Veeramani <[email protected]> Signed-off-by: Dmitri Gekhtman <[email protected]> Signed-off-by: Artur Niederfahrenhorst <[email protected]> Co-authored-by: Kai Fricke <[email protected]> Co-authored-by: xwjiang2010 <[email protected]> Co-authored-by: Alex Wu <[email protected]> Co-authored-by: Alex <[email protected]> Co-authored-by: Jun Gong <[email protected]> Co-authored-by: mgerstgrasser <[email protected]> Co-authored-by: Philipp Moritz <[email protected]> Co-authored-by: Jake <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]> Co-authored-by: Dmitri Gekhtman <[email protected]> Co-authored-by: Árpád Rózsás <[email protected]>
PaulFenton
pushed a commit
to PaulFenton/ray
that referenced
this pull request
Sep 19, 2022
…-project#28490) Right now, Ray telemetry indicates the majority of Ray's CPU hour usage comes from Ray running within a Kubernetes cluster. However, we have no data on what method is used to deploy Ray on Kubernetes. This PR enables Ray telemetry to distinguish between three methods of deploying Ray on Kubernetes: KubeRay >= 0.4.0 Legacy Ray Operator with Ray >= 2.1.0 All other methods The strategy is to have the operators inject an env variable into the Ray container's environment. The variable identifies the deployment method. This PR also modifies the legacy Ray operator to inject the relevant env variable. A follow-up KubeRay PR changes the KubeRay operator to do the same thing: ray-project/kuberay#562 Signed-off-by: Dmitri Gekhtman <[email protected]> Signed-off-by: PaulFenton <[email protected]>
2 tasks
lowang-bh
pushed a commit
to lowang-bh/kuberay
that referenced
this pull request
Sep 24, 2023
Signed-off-by: Dmitri Gekhtman <[email protected]> This change has the KubeRay operator inject an env var identifying a Ray container as having been deployed using KubeRay. The purpose is to refine Ray usage stats collected by Ray's telemetry service (https://docs.ray.io/en/latest/cluster/usage-stats.html). See ray-project/ray#28490
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Signed-off-by: Dmitri Gekhtman [email protected]
Why are these changes needed?
This change has the KubeRay operator inject an env var identifying a Ray container as having been deployed using KubeRay.
The purpose is to refine Ray usage stats collected by Ray's telemetry service (https://docs.ray.io/en/latest/cluster/usage-stats.html).
See ray-project/ray#28490
Related issue number
Checks