Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kuberay] Ray Autoscaler integration with Kuberay (MVP) #21086

Merged
merged 60 commits into from
Jan 20, 2022

Conversation

pcmoritz
Copy link
Contributor

@pcmoritz pcmoritz commented Dec 14, 2021

Why are these changes needed?

This is a minimum viable product for Ray Autoscaler integration with Kuberay. It is not ready for prime time/general use, but should be enough for interested parties to get started (see the documentation in kuberay.md).

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@wuisawesome wuisawesome self-assigned this Jan 4, 2022
docker/autoscaler/Dockerfile Outdated Show resolved Hide resolved
docker/autoscaler/Dockerfile Outdated Show resolved Hide resolved
docker/autoscaler/run_autoscaler.py Outdated Show resolved Hide resolved
docker/autoscaler/run_autoscaler.py Outdated Show resolved Hide resolved
@@ -0,0 +1,26 @@
import os
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be helpful when user run this in non-container environment? If user run this in kubernetes, I feel we can leverage pod restart to achieve the same goal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will try that before merging the PR :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeating sentiment to avoid crash-looping when possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, Dmitri, do you remember where the crash-looping was coming from? We might be able to avoid it by doing re-tries at a lower level (e.g. when connecting to Redis), that might lead to a more robust solution overall.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the retries are for connecting to the head node. Waiting for the Ray cluster to be ready in a more targeted / principled way sounds better.

python/ray/autoscaler/_private/kuberay/node_provider.py Outdated Show resolved Hide resolved
python/ray/autoscaler/_private/kuberay/node_provider.py Outdated Show resolved Hide resolved
python/ray/autoscaler/_private/kuberay/node_provider.py Outdated Show resolved Hide resolved
.bazelrc Outdated Show resolved Hide resolved
@pcmoritz pcmoritz changed the title [WIP] Autoscaler integration with Kuberay talking to k8s API server directly [Kuberay] Ray Autoscaler integration with Kuberay Jan 18, 2022
@pcmoritz pcmoritz changed the title [Kuberay] Ray Autoscaler integration with Kuberay [Kuberay] Ray Autoscaler integration with Kuberay (MVP) Jan 18, 2022
Copy link
Contributor

@DmitriGekhtman DmitriGekhtman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks super, assorted minor comments.

docker/kuberay-autoscaler/run_autoscaler.py Outdated Show resolved Hide resolved
doc/source/cluster/kuberay.md Show resolved Hide resolved
python/ray/autoscaler/_private/kuberay/node_provider.py Outdated Show resolved Hide resolved
python/ray/autoscaler/_private/kuberay/node_provider.py Outdated Show resolved Hide resolved
python/ray/autoscaler/kuberay/init-config.sh Show resolved Hide resolved
doc/source/cluster/kuberay.md Outdated Show resolved Hide resolved
doc/source/cluster/kuberay.md Outdated Show resolved Hide resolved
python/ray/autoscaler/_private/kuberay/node_provider.py Outdated Show resolved Hide resolved
Copy link
Contributor

@DmitriGekhtman DmitriGekhtman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!!

@DmitriGekhtman
Copy link
Contributor

test_autoscaler_yaml probably needs to be modified to exclude some files added in this PR.

- newName: kuberay/operator
- newTag: nightly
+ newName: rayproject/kuberay-operator
+ newTag: latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we like to maintain the image in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we have versioned releases of the image in kuberay/operator, either by doing releases with a version number, or just by git commit hash if we want to keep it low overhead that would be fine too. Then we can pin them here and update them whenever there is a new release :)

Copy link
Contributor

@Jeffwan Jeffwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pcmoritz for the great work! The revised version looks good to me now!

@pcmoritz pcmoritz merged commit fbc51d6 into ray-project:master Jan 20, 2022
@pcmoritz pcmoritz deleted the kuberay-autoscaler-2 branch January 20, 2022 03:42
Copy link
Collaborator

@zhe-thoughts zhe-thoughts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually let me submit a follow up PR for the minor comments

@@ -13,6 +13,7 @@ Ray with Cluster Managers
:maxdepth: 2

kubernetes.rst
kuberay.md
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pcmoritz Actually any reason we are using .md for this one vs. .rst?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe because .rst is awful :)

For consistency, we should probably switch back to .rst as we solidify the integration and its documentation.

[Kuberay](https://github.com/ray-project/kuberay) is a set of tools for running Ray on Kubernetes.
It has been used by some larger corporations to deploy Ray on their infrastructure.
Going forward, we would like to make this way of deployment accessible and seamless for
all Ray users and standardize Ray deployment on Kubernetes around Kuberay's operator.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Kuberay" -> "KubeRay"

It has been used by some larger corporations to deploy Ray on their infrastructure.
Going forward, we would like to make this way of deployment accessible and seamless for
all Ray users and standardize Ray deployment on Kubernetes around Kuberay's operator.
Presently you should consider this integration a minimal viable product that is not polished
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should label this integration as "Experimental" or "Alpha" to be consistent about setting expectations for other Ray features

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header does says "Experimental." Wouldn't hurt to repeat that word in this sentence though.

all Ray users and standardize Ray deployment on Kubernetes around Kuberay's operator.
Presently you should consider this integration a minimal viable product that is not polished
enough for general use and prefer the [Kubernetes integration](kubernetes.rst) for running
Ray on Kubernetes. If you are brave enough to try the Kuberay integration out, this documentation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Kuberay" -> "KubeRay"

enough for general use and prefer the [Kubernetes integration](kubernetes.rst) for running
Ray on Kubernetes. If you are brave enough to try the Kuberay integration out, this documentation
is for you! We would love your feedback as a [Github issue](https://github.com/ray-project/ray/issues)
including `[Kuberay]` in the title.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Kuberay" -> "KubeRay"

including `[Kuberay]` in the title.
```

Here we describe how you can deploy a Ray cluster on Kuberay. The following instructions are for
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Kuberay" -> "KubeRay"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants