[k8s] Kubernetes support #2096

romilbhardwaj · 2023-06-16T16:07:04Z

Adds Kubernetes support to SkyPilot. Still largely a WIP, expect rough edges and please provide feedback :)

To try out, create a Kubernetes cluster (either locally with Kind or some hosted service like GKE, instructions here), make sure ports 30000-32767 are accessible, and place the kubeconfig in ~/.kube/config and run sky check.

Current status, TODOs, and roadmap can be found in this doc.

Also adds two new CLI commands:

sky local up create a local kubernetes cluster to run tasks locally. Requires docker and kind installed.
sky local down tears down the local cluster.

What's not supported in this PR:

Multi-node
GPU support
Supporting multiple Kubernetes clusters
Docs (both for admins and users)
Stopping clusters
...

(base) ➜  ~ SKYPILOT_DEBUG=0 sky launch --cpus 8+
== Optimizer ==
Target: minimizing cost
Estimated cost: $0.0 / hour

Considered resources (1 node):
---------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE          vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN
---------------------------------------------------------------------------------------------------
 Kubernetes   8vCPU-8GB         8       8         -              kubernetes    0.00          ✔
 AWS          m6i.2xlarge       8       32        -              us-east-1     0.38
 Azure        Standard_D8s_v5   8       32        -              eastus        0.38
 IBM          bx2-8x32          8       32        -              us-east       0.38
 GCP          n2-standard-8     8       32        -              us-central1   0.39
 Lambda       gpu_1x_a10        30      200       A10:1          us-east-1     0.60
---------------------------------------------------------------------------------------------------

Tested (run the relevant ones):

Code formatting: bash format.sh
All smoke tests: pytest tests/test_smoke.py --kubernetes

# Conflicts: # sky/backends/backend_utils.py # sky/backends/cloud_vm_ray_backend.py # sky/registry.py # sky/utils/ux_utils.py

# Conflicts: # sky/__init__.py # sky/authentication.py # sky/backends/backend_utils.py # sky/backends/cloud_vm_ray_backend.py # sky/clouds/__init__.py # sky/clouds/service_catalog/__init__.py # sky/setup_files/MANIFEST.in # sky/utils/ux_utils.py

# Conflicts: # sky/backends/cloud_vm_ray_backend.py

'namespace' exists under 'context' key.

…_cloud

Michaelvll

Just tried this on the GKE cluster with and without autoscaling. It works smoothly, except that for the one with autoscaling, the first sky launch that trigger the autoscaling will timeout, because the GKE takes longer to set up the cluster?

Michaelvll · 2023-07-25T04:41:19Z

tests/kubernetes/README.md

+
+## Running a GKE cluster
+1. Make sure ports 30000-32767 are open in your node pool VPC's firewall.
+2. Create a GKE cluster with at least 1 node.


It seems that we cannot connect to the GKE cluster created with autopilot mode. Should we note that down?

Error:

kubernetes.client.exceptions.ApiException: (400) Reason: Bad Request HTTP response headers: HTTPHeaderDict({'Audit-Id': 'f5bd48cc-e0dd-4a9a-82a0-70b432915ed6', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Warning': '299 - "autopilot-default-resources-mutator:Autopilot updated Pod default/sky-b72f-zhwu-ray-head: adjusted resources to meet requirements for containers [ray-node] (see http://g.co/gke/autopilot-resources)"', 'X-Kubernetes-Pf-Flowschema-Uid': '79ebecb8-a530-44e8-8ae7-df0fa0c2c903', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'a21e38a1-7fd8-445a-924f-2d47a4e26c50', 'Date': 'Tue, 25 Jul 2023 04:37:22 GMT', 'Content-Length': '664'}) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"admission webhook \"warden-validating.common-webhooks.networking.gke.io\" denied the request: GKE Warden rejected the request because it violates one or more constraints.\nViolations details: {\"[denied by autogke-disallow-privilege]\":[\"container ray-node is privileged; not allowed in Autopilot\"],\"[denied by autogke-no-write-mode-hostpath]\":[\"hostPath volume dev-fuse in container ray-node is accessed in write mode; disallowed in Autopilot.\"]}\nRequested by user: '[email protected]', groups: 'system:authenticated'.","reason":"GKE Warden constraints violations","code":400}

That's right.. autopilot clusters are not currently supported (I haven't tested yet, and looks like fuse mounting is not supported from the logs you posted). I've added it as a note here.

Michaelvll · 2023-07-26T04:43:06Z

sky/authentication.py

+            logger.warning(
+                f'Key {key_label} already exists in the cluster, using it...')


Do we need to print this out? It seems a bit confusing

Ahh good point. I used it for debugging. Changed the warning to debug.

Typo.

* Set build_image.sh to be executable. * Use TAG to easily switch between registries.

…_cloud

romilbhardwaj · 2023-07-26T17:12:17Z

I've removed ingress creation since it was no longer required, and confirmed it now works with older client library (kubernetes==17.17.0) forced by PyYAML<=5.3.1.

Tested:

pytest tests/test_smoke.py --kubernetes

It works smoothly, except that for the one with autoscaling, the first sky launch that trigger the autoscaling will timeout, because the GKE takes longer to set up the cluster?

That's right, autoscaling takes time and that's why we want to make the provisioning timeout configurable in the near future so users can set it according to their cluster. In the medium-long term, we should look in to interfacing with the underlying autoscaler(s) to know when it is scaling up, and automatically change the timeout as needed.

Michaelvll

Thanks for the great effort @romilbhardwaj! LGTM. Very excited to see the k8s support to be merged.

sky/clouds/kubernetes.py

romilbhardwaj · 2023-08-02T10:56:45Z

Thanks for the reviews @Michaelvll! All smoke tests pass, merging now.

romilbhardwaj added 30 commits February 3, 2023 16:47

Working Ray K8s node provider based on SSH

0431f96

Merge branch 'master' into k8s_cloud

5f715e8

wip

197acea

working provisioning with SkyPilot and ssh config

f06b22d

working provisioning with SkyPilot and ssh config

cf1ddec

Merge branch 'master' into k8s_cloud

0937cc3

# Conflicts: # sky/backends/backend_utils.py # sky/backends/cloud_vm_ray_backend.py # sky/registry.py # sky/utils/ux_utils.py

Updates to master

40aad6d

ray2.3

47d0953

Clean up docs

9f59467

multiarch build

07f9bcb

hacking around ray start

bd12014

more port fixes

4baf0b6

fix up default instance selection

7ed02eb

fix resource selection

898a851

Add provisioning timeout by checking if pods are ready

fcb51d1

Working mounting

13eb198

Remove catalog

428f143

fixes

ebf9d83

fixes

da570fc

Fix ssh-key auth to create unique secrets

1bea866

Fix for ContainerCreating timeout

9def756

Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cloud

8f9cafe

# Conflicts: # sky/backends/cloud_vm_ray_backend.py

Fix head node ssh port caching

65366eb

mypy

b984ead

lint

3bca8a9

fix ports

61df297

typo

036eaf9

cleanup

95e160c

cleanup

301a914

aviweit and others added 3 commits July 25, 2023 17:30

[k8s_cloud] Ray pod not created under current context namespace. (#2302)

b36fba4

'namespace' exists under 'context' key.

Merge branch 'k8s_cloud' of github.com:skypilot-org/skypilot into k8s…

c137360

…_cloud

head ssh port namespace fix

a806b39

Michaelvll reviewed Jul 26, 2023

View reviewed changes

aviweit and others added 10 commits July 26, 2023 08:35

[k8s-cloud] Typo in sky local --help. (#2308)

a9b9636

Typo.

[k8s-cloud] Set build_image.sh to be executable. (#2307)

7903339

* Set build_image.sh to be executable. * Use TAG to easily switch between registries.

remove ingress

4ab5329

remove debug statements

4b49241

UX and readme updates

83aecd3

lint

bdeb7d5

Merge branch 'k8s_cloud' of github.com:skypilot-org/skypilot into k8s…

993f736

…_cloud

fix logging for 409 retry

4fb1d94

lint

02e3415

lint

c1b7438

romilbhardwaj requested a review from Michaelvll July 28, 2023 16:22

romilbhardwaj mentioned this pull request Jul 29, 2023

[k8s] Kubernetes Docs #2324

Merged

1 task

Michaelvll approved these changes Jul 31, 2023

View reviewed changes

sky/clouds/kubernetes.py Outdated Show resolved Hide resolved

romilbhardwaj added 2 commits August 1, 2023 21:27

comments

6eae8bd

remove k8s from default clouds to run

57a37b3

romilbhardwaj merged commit 4045cf3 into master Aug 2, 2023
15 checks passed

romilbhardwaj deleted the k8s_cloud branch August 2, 2023 10:56

Michaelvll mentioned this pull request Aug 4, 2023

fix tpu bug #2350

Merged

6 tasks

romilbhardwaj mentioned this pull request Aug 12, 2023

Adding K8s support for Sky #1008

Closed

romilbhardwaj mentioned this pull request Sep 29, 2023

[k8s] Multi-node support for Kubernetes #2609

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s] Kubernetes support #2096

[k8s] Kubernetes support #2096

romilbhardwaj commented Jun 16, 2023 •

edited

Loading

Michaelvll left a comment

Michaelvll Jul 25, 2023

romilbhardwaj Jul 26, 2023

Michaelvll Jul 26, 2023

romilbhardwaj Jul 26, 2023

romilbhardwaj commented Jul 26, 2023

Michaelvll left a comment

romilbhardwaj commented Aug 2, 2023

		logger.warning(
		f'Key {key_label} already exists in the cluster, using it...')

[k8s] Kubernetes support #2096

[k8s] Kubernetes support #2096

Conversation

romilbhardwaj commented Jun 16, 2023 • edited Loading

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll Jul 25, 2023

Choose a reason for hiding this comment

romilbhardwaj Jul 26, 2023

Choose a reason for hiding this comment

Michaelvll Jul 26, 2023

Choose a reason for hiding this comment

romilbhardwaj Jul 26, 2023

Choose a reason for hiding this comment

romilbhardwaj commented Jul 26, 2023

Michaelvll left a comment

Choose a reason for hiding this comment

romilbhardwaj commented Aug 2, 2023

romilbhardwaj commented Jun 16, 2023 •

edited

Loading