Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Kubernetes support #2096

Merged
merged 114 commits into from
Aug 2, 2023
Merged
Show file tree
Hide file tree
Changes from 98 commits
Commits
Show all changes
114 commits
Select commit Hold shift + click to select a range
0431f96
Working Ray K8s node provider based on SSH
romilbhardwaj Feb 4, 2023
5f715e8
Merge branch 'master' into k8s_cloud
romilbhardwaj Feb 4, 2023
197acea
wip
romilbhardwaj Feb 5, 2023
f06b22d
working provisioning with SkyPilot and ssh config
romilbhardwaj Feb 7, 2023
cf1ddec
working provisioning with SkyPilot and ssh config
romilbhardwaj Feb 8, 2023
0937cc3
Merge branch 'master' into k8s_cloud
romilbhardwaj Mar 16, 2023
40aad6d
Updates to master
romilbhardwaj Mar 16, 2023
47d0953
ray2.3
romilbhardwaj Mar 21, 2023
9f59467
Clean up docs
romilbhardwaj Mar 29, 2023
07f9bcb
multiarch build
romilbhardwaj Mar 31, 2023
bd12014
hacking around ray start
romilbhardwaj Mar 31, 2023
4baf0b6
more port fixes
romilbhardwaj Apr 3, 2023
b08eb1b
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cloud
romilbhardwaj Jun 8, 2023
7ed02eb
fix up default instance selection
romilbhardwaj Jun 8, 2023
898a851
fix resource selection
romilbhardwaj Jun 8, 2023
fcb51d1
Add provisioning timeout by checking if pods are ready
romilbhardwaj Jun 9, 2023
13eb198
Working mounting
romilbhardwaj Jun 9, 2023
428f143
Remove catalog
romilbhardwaj Jun 13, 2023
ebf9d83
fixes
romilbhardwaj Jun 14, 2023
da570fc
fixes
romilbhardwaj Jun 15, 2023
1bea866
Fix ssh-key auth to create unique secrets
romilbhardwaj Jun 15, 2023
9def756
Fix for ContainerCreating timeout
romilbhardwaj Jun 15, 2023
8f9cafe
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cloud
romilbhardwaj Jun 15, 2023
65366eb
Fix head node ssh port caching
romilbhardwaj Jun 15, 2023
b984ead
mypy
romilbhardwaj Jun 15, 2023
3bca8a9
lint
romilbhardwaj Jun 16, 2023
61df297
fix ports
romilbhardwaj Jun 16, 2023
036eaf9
typo
romilbhardwaj Jun 16, 2023
95e160c
cleanup
romilbhardwaj Jun 16, 2023
301a914
cleanup
romilbhardwaj Jun 16, 2023
2c88daf
wip
romilbhardwaj Jun 16, 2023
7ece7f7
Update setup
romilbhardwaj Jun 16, 2023
cc85f94
readme updates
romilbhardwaj Jun 16, 2023
0450cee
lint
romilbhardwaj Jun 16, 2023
f3f0578
Fix failover
romilbhardwaj Jun 16, 2023
574a9c6
Fix failover
romilbhardwaj Jun 16, 2023
0632b48
optimize setup
romilbhardwaj Jun 16, 2023
05508d3
Fix sync down logs for k8s
romilbhardwaj Jun 16, 2023
fb36a40
test wip
romilbhardwaj Jun 18, 2023
7db4027
instance name parsing wip
romilbhardwaj Jun 19, 2023
632ed30
Fix instance name parsing
romilbhardwaj Jun 20, 2023
d7bd766
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cloud
romilbhardwaj Jun 20, 2023
1a444d1
Merge fixes for query_status
romilbhardwaj Jun 20, 2023
da9cba2
[k8s_cloud] Delete k8s service resources. (#2105)
aviweit Jun 20, 2023
81871ac
Status refresh WIP
romilbhardwaj Jun 20, 2023
0d1c4ac
refactor to kubernetes adaptor
romilbhardwaj Jun 20, 2023
8017020
tests wip
romilbhardwaj Jun 21, 2023
5d7f8e8
clean up auth
romilbhardwaj Jun 22, 2023
aa787f8
wip tests
romilbhardwaj Jun 22, 2023
c026559
cli
romilbhardwaj Jun 22, 2023
3dc80d2
cli
romilbhardwaj Jun 23, 2023
63ce29b
sky local up/down cli
romilbhardwaj Jun 23, 2023
f9d5b73
cli
romilbhardwaj Jun 23, 2023
b81647a
lint
romilbhardwaj Jun 23, 2023
050cfc2
lint
romilbhardwaj Jun 23, 2023
d64c394
lint
romilbhardwaj Jun 23, 2023
7367b4a
Speed up kind cluster creation
romilbhardwaj Jun 23, 2023
756c56c
tests
romilbhardwaj Jun 23, 2023
d4c0990
lint
romilbhardwaj Jun 23, 2023
b64dd19
tests
romilbhardwaj Jun 24, 2023
10333d7
handling for non-reachable clusters
romilbhardwaj Jun 25, 2023
b07fc58
Invalid kubeconfig handling
romilbhardwaj Jun 26, 2023
5af58aa
Timeout for sky check
romilbhardwaj Jun 26, 2023
4d6710f
code cleanup
romilbhardwaj Jun 27, 2023
c057c88
lint
romilbhardwaj Jun 27, 2023
b8e414e
Do not raise error if GPUs requested, return empty list
romilbhardwaj Jul 3, 2023
c2ebfe7
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cloud
romilbhardwaj Jul 3, 2023
1fc857b
Address comments
romilbhardwaj Jul 5, 2023
0ae92eb
comments
romilbhardwaj Jul 5, 2023
10f302f
lint
romilbhardwaj Jul 5, 2023
2a4caac
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cloud
romilbhardwaj Jul 13, 2023
54b2b28
Remove public key upload
romilbhardwaj Jul 13, 2023
5ee821d
add shebang
romilbhardwaj Jul 15, 2023
d6ca85a
comments
romilbhardwaj Jul 16, 2023
fbae4bf
change permissions
romilbhardwaj Jul 16, 2023
6e9e6ba
remove chmod
romilbhardwaj Jul 16, 2023
7fa9d7e
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cloud
romilbhardwaj Jul 16, 2023
a3f827e
merge 2241
romilbhardwaj Jul 16, 2023
9687ea8
add todo
romilbhardwaj Jul 16, 2023
4b54555
Handle kube config management for sky local commands (#2253)
hemildesai Jul 19, 2023
f73f1b2
Switch context in create_cluster if cluster already exists.
romilbhardwaj Jul 19, 2023
0c45b9a
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cloud
romilbhardwaj Jul 20, 2023
a69df01
fix typo
romilbhardwaj Jul 20, 2023
ff1d832
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cloud
romilbhardwaj Jul 20, 2023
6a931e2
update sky check error msg after sky local down
romilbhardwaj Jul 20, 2023
662e4b9
lint
romilbhardwaj Jul 20, 2023
4046749
update timeout check
romilbhardwaj Jul 21, 2023
92d588d
fix import error
romilbhardwaj Jul 21, 2023
9ff1662
Fix kube API access from within cluster (load_incluster_auth)
romilbhardwaj Jul 21, 2023
364b03f
lint
romilbhardwaj Jul 21, 2023
691f6b7
lint
romilbhardwaj Jul 21, 2023
ed0741f
working autodown and sky status -r
romilbhardwaj Jul 21, 2023
3fe9bfb
lint
romilbhardwaj Jul 21, 2023
b98ced3
add test_kubernetes_autodown
romilbhardwaj Jul 21, 2023
07ea97d
lint
romilbhardwaj Jul 24, 2023
73ee737
address comments
romilbhardwaj Jul 24, 2023
7726850
address comments
romilbhardwaj Jul 24, 2023
2ee4833
lint
romilbhardwaj Jul 24, 2023
9e0f5b6
deletion timeouts wip
romilbhardwaj Jul 25, 2023
b36fba4
[k8s_cloud] Ray pod not created under current context namespace. (#2302)
aviweit Jul 26, 2023
c137360
Merge branch 'k8s_cloud' of github.com:skypilot-org/skypilot into k8s…
romilbhardwaj Jul 26, 2023
a806b39
head ssh port namespace fix
romilbhardwaj Jul 26, 2023
a9b9636
[k8s-cloud] Typo in sky local --help. (#2308)
aviweit Jul 26, 2023
7903339
[k8s-cloud] Set build_image.sh to be executable. (#2307)
aviweit Jul 26, 2023
4ab5329
remove ingress
romilbhardwaj Jul 26, 2023
4b49241
remove debug statements
romilbhardwaj Jul 26, 2023
83aecd3
UX and readme updates
romilbhardwaj Jul 26, 2023
bdeb7d5
lint
romilbhardwaj Jul 26, 2023
993f736
Merge branch 'k8s_cloud' of github.com:skypilot-org/skypilot into k8s…
romilbhardwaj Jul 26, 2023
4fb1d94
fix logging for 409 retry
romilbhardwaj Jul 26, 2023
02e3415
lint
romilbhardwaj Jul 26, 2023
c1b7438
lint
romilbhardwaj Jul 26, 2023
6eae8bd
comments
romilbhardwaj Aug 1, 2023
57a37b3
remove k8s from default clouds to run
romilbhardwaj Aug 2, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions Dockerfile_k8s
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
FROM continuumio/miniconda3:22.11.1

# TODO(romilb): Investigate if this image can be consolidated with the skypilot
# client image (`Dockerfile`)

# Initialize conda for root user, install ssh and other local dependencies
RUN apt update -y && \
apt install gcc rsync sudo patch openssh-server pciutils nano fuse -y && \
rm -rf /var/lib/apt/lists/* && \
apt remove -y python3 && \
conda init

# Setup SSH and generate hostkeys
RUN mkdir -p /var/run/sshd && \
sed -i 's/PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config && \
sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd && \
cd /etc/ssh/ && \
ssh-keygen -A

# Setup new user named sky and add to sudoers. Also add /opt/conda/bin to sudo path.
RUN useradd -m -s /bin/bash sky && \
echo "sky ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers && \
echo 'Defaults secure_path="/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"' > /etc/sudoers.d/sky

# Switch to sky user
USER sky

# Install SkyPilot pip dependencies
RUN pip install wheel Click colorama cryptography jinja2 jsonschema && \
pip install networkx oauth2client pandas pendulum PrettyTable && \
pip install ray==2.4.0 rich tabulate filelock && \
pip install packaging 'protobuf<4.0.0' pulp && \
pip install awscli boto3 pycryptodome==3.12.0 && \
pip install docker kubernetes

# Add /home/sky/.local/bin/ to PATH
RUN echo 'export PATH="$PATH:$HOME/.local/bin"' >> ~/.bashrc

# Install SkyPilot. This is purposely separate from installing SkyPilot
# dependencies to optimize rebuild time
COPY --chown=sky . /skypilot/sky/

# TODO(romilb): Installing SkyPilot may not be necessary since ray up will do it
RUN cd /skypilot/ && \
sudo mv -v sky/setup_files/* . && \
pip install ".[aws]"

# Set WORKDIR and initialize conda for sky user
WORKDIR /home/sky
RUN conda init
6 changes: 4 additions & 2 deletions sky/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,19 +32,21 @@
Lambda = clouds.Lambda
SCP = clouds.SCP
Local = clouds.Local
Kubernetes = clouds.Kubernetes
OCI = clouds.OCI
optimize = Optimizer.optimize

__all__ = [
'__version__',
'IBM',
'AWS',
'Azure',
'GCP',
'IBM',
'Kubernetes',
'Lambda',
'SCP',
'Local',
'OCI',
'SCP',
'Optimizer',
'OptimizeTarget',
'backends',
Expand Down
140 changes: 140 additions & 0 deletions sky/adaptors/kubernetes.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
"""Kubernetes adaptors"""

# pylint: disable=import-outside-toplevel

import functools
import os

from sky.utils import ux_utils, env_options

kubernetes = None
urllib3 = None

_configured = False
_core_api = None
_auth_api = None
_networking_api = None
_custom_objects_api = None

# Timeout to use for API calls
API_TIMEOUT = 5


def import_package(func):

@functools.wraps(func)
def wrapper(*args, **kwargs):
global kubernetes
global urllib3
if kubernetes is None:
try:
import kubernetes as _kubernetes
import urllib3 as _urllib3
except ImportError:
# TODO(romilb): Update this message to point to installation
# docs when they are ready.
raise ImportError('Fail to import dependencies for Kubernetes. '
'Run `pip install kubernetes` to '
'install them.') from None
kubernetes = _kubernetes
urllib3 = _urllib3
return func(*args, **kwargs)

return wrapper


@import_package
def get_kubernetes():
return kubernetes


@import_package
def _load_config():
global _configured
if _configured:
return
try:
# Load in-cluster config if running in a pod
# Kubernetes set environment variables for service discovery do not
# show up in SkyPilot tasks. For now, we work around by using
# DNS name instead of environment variables.
# See issue: https://github.com/skypilot-org/skypilot/issues/2287
os.environ['KUBERNETES_SERVICE_HOST'] = 'kubernetes.default.svc'
os.environ['KUBERNETES_SERVICE_PORT'] = '443'
kubernetes.config.load_incluster_config()
except kubernetes.config.config_exception.ConfigException:
try:
kubernetes.config.load_kube_config()
except kubernetes.config.config_exception.ConfigException as e:
suffix = ''
if env_options.Options.SHOW_DEBUG_INFO.get():
suffix += f' Error: {str(e)}'
# Check if exception was due to no current-context
if 'Expected key current-context' in str(e):
err_str = ('Failed to load Kubernetes configuration. '
'Kubeconfig does not contain any valid context(s).'
f'{suffix}\n'
' If you were running a local Kubernetes '
'cluster, run `sky local up` to start the cluster.')
else:
err_str = (
'Failed to load Kubernetes configuration. '
f'Please check if your kubeconfig file is valid.{suffix}')
with ux_utils.print_exception_no_traceback():
raise ValueError(err_str) from None
_configured = True


@import_package
def core_api():
global _core_api
if _core_api is None:
_load_config()
_core_api = kubernetes.client.CoreV1Api()

return _core_api


@import_package
def auth_api():
global _auth_api
if _auth_api is None:
_load_config()
_auth_api = kubernetes.client.RbacAuthorizationV1Api()

return _auth_api


@import_package
def networking_api():
global _networking_api
if _networking_api is None:
_load_config()
_networking_api = kubernetes.client.NetworkingV1Api()

return _networking_api


@import_package
def custom_objects_api():
global _custom_objects_api
if _custom_objects_api is None:
_load_config()
_custom_objects_api = kubernetes.client.CustomObjectsApi()

return _custom_objects_api


@import_package
def api_exception():
return kubernetes.client.rest.ApiException


@import_package
def config_exception():
return kubernetes.config.config_exception.ConfigException


@import_package
def max_retry_error():
return urllib3.exceptions.MaxRetryError
30 changes: 30 additions & 0 deletions sky/authentication.py
Original file line number Diff line number Diff line change
Expand Up @@ -372,3 +372,33 @@ def setup_scp_authentication(config: Dict[str, Any]) -> Dict[str, Any]:
with open(public_key_path, 'r') as f:
public_key = f.read()
return _replace_ssh_info_in_config(config, public_key)


def setup_kubernetes_authentication(config: Dict[str, Any]) -> Dict[str, Any]:
get_or_generate_keys()

# Run kubectl command to add the public key to the cluster.
public_key_path = os.path.expanduser(PUBLIC_SSH_KEY_PATH)
key_label = clouds.Kubernetes.SKY_SSH_KEY_SECRET_NAME
cmd = f'kubectl create secret generic {key_label} ' \
f'--from-file=ssh-publickey={public_key_path}'
try:
subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=True)
except subprocess.CalledProcessError as e:
output = e.output.decode('utf-8')
suffix = f'\nError message: {output}'
if 'already exists' in output:
logger.warning(
f'Key {key_label} already exists in the cluster, using it...')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to print this out? It seems a bit confusing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh good point. I used it for debugging. Changed the warning to debug.

elif any(err in output for err in ['connection refused', 'timeout']):
with ux_utils.print_exception_no_traceback():
raise ConnectionError(
'Failed to connect to the cluster. Check if your '
'cluster is running, your kubeconfig is correct '
'and you can connect to it using: '
f'kubectl get namespaces.{suffix}') from e
else:
logger.error(suffix)
raise

return config
Loading
Loading