-
Notifications
You must be signed in to change notification settings - Fork 503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s] Kubernetes support #2096
[k8s] Kubernetes support #2096
Conversation
# Conflicts: # sky/backends/backend_utils.py # sky/backends/cloud_vm_ray_backend.py # sky/registry.py # sky/utils/ux_utils.py
# Conflicts: # sky/__init__.py # sky/authentication.py # sky/backends/backend_utils.py # sky/backends/cloud_vm_ray_backend.py # sky/clouds/__init__.py # sky/clouds/service_catalog/__init__.py # sky/setup_files/MANIFEST.in # sky/utils/ux_utils.py
# Conflicts: # sky/backends/cloud_vm_ray_backend.py
'namespace' exists under 'context' key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tried this on the GKE cluster with and without autoscaling. It works smoothly, except that for the one with autoscaling, the first sky launch
that trigger the autoscaling will timeout, because the GKE takes longer to set up the cluster?
tests/kubernetes/README.md
Outdated
|
||
## Running a GKE cluster | ||
1. Make sure ports 30000-32767 are open in your node pool VPC's firewall. | ||
2. Create a GKE cluster with at least 1 node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we cannot connect to the GKE cluster created with autopilot
mode. Should we note that down?
Error:
kubernetes.client.exceptions.ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'f5bd48cc-e0dd-4a9a-82a0-70b432915ed6', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Warning': '299 - "autopilot-default-resources-mutator:Autopilot updated Pod default/sky-b72f-zhwu-ray-head: adjusted resources to meet requirements for containers [ray-node] (see http://g.co/gke/autopilot-resources)"', 'X-Kubernetes-Pf-Flowschema-Uid': '79ebecb8-a530-44e8-8ae7-df0fa0c2c903', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'a21e38a1-7fd8-445a-924f-2d47a4e26c50', 'Date': 'Tue, 25 Jul 2023 04:37:22 GMT', 'Content-Length': '664'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"admission webhook \"warden-validating.common-webhooks.networking.gke.io\" denied the request: GKE Warden rejected the request because it violates one or more constraints.\nViolations details: {\"[denied by autogke-disallow-privilege]\":[\"container ray-node is privileged; not allowed in Autopilot\"],\"[denied by autogke-no-write-mode-hostpath]\":[\"hostPath volume dev-fuse in container ray-node is accessed in write mode; disallowed in Autopilot.\"]}\nRequested by user: '[email protected]', groups: 'system:authenticated'.","reason":"GKE Warden constraints violations","code":400}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right.. autopilot clusters are not currently supported (I haven't tested yet, and looks like fuse mounting is not supported from the logs you posted). I've added it as a note here.
sky/authentication.py
Outdated
logger.warning( | ||
f'Key {key_label} already exists in the cluster, using it...') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to print this out? It seems a bit confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh good point. I used it for debugging. Changed the warning
to debug
.
* Set build_image.sh to be executable. * Use TAG to easily switch between registries.
I've removed ingress creation since it was no longer required, and confirmed it now works with older client library ( Tested:
That's right, autoscaling takes time and that's why we want to make the provisioning timeout configurable in the near future so users can set it according to their cluster. In the medium-long term, we should look in to interfacing with the underlying autoscaler(s) to know when it is scaling up, and automatically change the timeout as needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great effort @romilbhardwaj! LGTM. Very excited to see the k8s support to be merged.
Thanks for the reviews @Michaelvll! All smoke tests pass, merging now. |
Adds Kubernetes support to SkyPilot. Still largely a WIP, expect rough edges and please provide feedback :)
To try out, create a Kubernetes cluster (either locally with Kind or some hosted service like GKE, instructions here), make sure ports 30000-32767 are accessible, and place the kubeconfig in
~/.kube/config
and runsky check
.Current status, TODOs, and roadmap can be found in this doc.
Also adds two new CLI commands:
sky local up
create a local kubernetes cluster to run tasks locally. Requires docker and kind installed.sky local down
tears down the local cluster.What's not supported in this PR:
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py --kubernetes