Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Kubernetes support #2096

Merged
merged 114 commits into from
Aug 2, 2023
Merged

[k8s] Kubernetes support #2096

merged 114 commits into from
Aug 2, 2023

Conversation

romilbhardwaj
Copy link
Collaborator

@romilbhardwaj romilbhardwaj commented Jun 16, 2023

Adds Kubernetes support to SkyPilot. Still largely a WIP, expect rough edges and please provide feedback :)

To try out, create a Kubernetes cluster (either locally with Kind or some hosted service like GKE, instructions here), make sure ports 30000-32767 are accessible, and place the kubeconfig in ~/.kube/config and run sky check.

Current status, TODOs, and roadmap can be found in this doc.

Also adds two new CLI commands:

  • sky local up create a local kubernetes cluster to run tasks locally. Requires docker and kind installed.
  • sky local down tears down the local cluster.

What's not supported in this PR:

  • Multi-node
  • GPU support
  • Supporting multiple Kubernetes clusters
  • Docs (both for admins and users)
  • Stopping clusters
  • ...
(base) ➜  ~ SKYPILOT_DEBUG=0 sky launch --cpus 8+
== Optimizer ==
Target: minimizing cost
Estimated cost: $0.0 / hour

Considered resources (1 node):
---------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE          vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN
---------------------------------------------------------------------------------------------------
 Kubernetes   8vCPU-8GB         8       8         -              kubernetes    0.00          ✔
 AWS          m6i.2xlarge       8       32        -              us-east-1     0.38
 Azure        Standard_D8s_v5   8       32        -              eastus        0.38
 IBM          bx2-8x32          8       32        -              us-east       0.38
 GCP          n2-standard-8     8       32        -              us-central1   0.39
 Lambda       gpu_1x_a10        30      200       A10:1          us-east-1     0.60
---------------------------------------------------------------------------------------------------

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • All smoke tests: pytest tests/test_smoke.py --kubernetes

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tried this on the GKE cluster with and without autoscaling. It works smoothly, except that for the one with autoscaling, the first sky launch that trigger the autoscaling will timeout, because the GKE takes longer to set up the cluster?


## Running a GKE cluster
1. Make sure ports 30000-32767 are open in your node pool VPC's firewall.
2. Create a GKE cluster with at least 1 node.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we cannot connect to the GKE cluster created with autopilot mode. Should we note that down?

Error:

kubernetes.client.exceptions.ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'f5bd48cc-e0dd-4a9a-82a0-70b432915ed6', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Warning': '299 - "autopilot-default-resources-mutator:Autopilot updated Pod default/sky-b72f-zhwu-ray-head: adjusted resources to meet requirements for containers [ray-node] (see http://g.co/gke/autopilot-resources)"', 'X-Kubernetes-Pf-Flowschema-Uid': '79ebecb8-a530-44e8-8ae7-df0fa0c2c903', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'a21e38a1-7fd8-445a-924f-2d47a4e26c50', 'Date': 'Tue, 25 Jul 2023 04:37:22 GMT', 'Content-Length': '664'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"admission webhook \"warden-validating.common-webhooks.networking.gke.io\" denied the request: GKE Warden rejected the request because it violates one or more constraints.\nViolations details: {\"[denied by autogke-disallow-privilege]\":[\"container ray-node is privileged; not allowed in Autopilot\"],\"[denied by autogke-no-write-mode-hostpath]\":[\"hostPath volume dev-fuse in container ray-node is accessed in write mode; disallowed in Autopilot.\"]}\nRequested by user: '[email protected]', groups: 'system:authenticated'.","reason":"GKE Warden constraints violations","code":400}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right.. autopilot clusters are not currently supported (I haven't tested yet, and looks like fuse mounting is not supported from the logs you posted). I've added it as a note here.

Comment on lines 391 to 392
logger.warning(
f'Key {key_label} already exists in the cluster, using it...')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to print this out? It seems a bit confusing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh good point. I used it for debugging. Changed the warning to debug.

@romilbhardwaj
Copy link
Collaborator Author

I've removed ingress creation since it was no longer required, and confirmed it now works with older client library (kubernetes==17.17.0) forced by PyYAML<=5.3.1.

Tested:

  • pytest tests/test_smoke.py --kubernetes

It works smoothly, except that for the one with autoscaling, the first sky launch that trigger the autoscaling will timeout, because the GKE takes longer to set up the cluster?

That's right, autoscaling takes time and that's why we want to make the provisioning timeout configurable in the near future so users can set it according to their cluster. In the medium-long term, we should look in to interfacing with the underlying autoscaler(s) to know when it is scaling up, and automatically change the timeout as needed.

@romilbhardwaj romilbhardwaj mentioned this pull request Jul 29, 2023
1 task
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great effort @romilbhardwaj! LGTM. Very excited to see the k8s support to be merged.

sky/clouds/kubernetes.py Outdated Show resolved Hide resolved
@romilbhardwaj
Copy link
Collaborator Author

Thanks for the reviews @Michaelvll! All smoke tests pass, merging now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants