Skip to content

Commit

Permalink
[k8s] Kubernetes support (#2096)
Browse files Browse the repository at this point in the history
* Working Ray K8s node provider based on SSH

* wip

* working provisioning with SkyPilot and ssh config

* working provisioning with SkyPilot and ssh config

* Updates to master

* ray2.3

* Clean up docs

* multiarch build

* hacking around ray start

* more port fixes

* fix up default instance selection

* fix resource selection

* Add provisioning timeout by checking if pods are ready

* Working mounting

* Remove catalog

* fixes

* fixes

* Fix ssh-key auth to create unique secrets

* Fix for ContainerCreating timeout

* Fix head node ssh port caching

* mypy

* lint

* fix ports

* typo

* cleanup

* cleanup

* wip

* Update setup

* readme updates

* lint

* Fix failover

* Fix failover

* optimize setup

* Fix sync down logs for k8s

* test wip

* instance name parsing wip

* Fix instance name parsing

* Merge fixes for query_status

* [k8s_cloud] Delete k8s service resources. (#2105)

Delete k8s service resources.

- 'sky down' for Kubernetes cloud to remove cluster service resources.

* Status refresh WIP

* refactor to kubernetes adaptor

* tests wip

* clean up auth

* wip tests

* cli

* cli

* sky local up/down cli

* cli

* lint

* lint

* lint

* Speed up kind cluster creation

* tests

* lint

* tests

* handling for non-reachable clusters

* Invalid kubeconfig handling

* Timeout for sky check

* code cleanup

* lint

* Do not raise error if GPUs requested, return empty list

* Address comments

* comments

* lint

* Remove public key upload

* add shebang

* comments

* change permissions

* remove chmod

* merge 2241

* add todo

* Handle kube config management for sky local commands (#2253)

* Set current-context (if availablee) after sky local down and remove incorrect prompt in sky local up

* Warn user of kubeconfig context switch during sky local up

* Use Optional instead of Union

* Switch context in create_cluster if cluster already exists.

* fix typo

* update sky check error msg after sky local down

* lint

* update timeout check

* fix import error

* Fix kube API access from within cluster (load_incluster_auth)

* lint

* lint

* working autodown and sky status -r

* lint

* add test_kubernetes_autodown

* lint

* address comments

* address comments

* lint

* deletion timeouts wip

* [k8s_cloud] Ray pod not created under current context namespace. (#2302)

'namespace' exists under 'context' key.

* head ssh port namespace fix

* [k8s-cloud] Typo in sky local --help. (#2308)

Typo.

* [k8s-cloud] Set build_image.sh to be executable. (#2307)

* Set build_image.sh to be executable.

* Use TAG to easily switch between registries.

* remove ingress

* remove debug statements

* UX and readme updates

* lint

* fix logging for 409 retry

* lint

* lint

* comments

* remove k8s from default clouds to run

---------

Co-authored-by: Avi Weit <[email protected]>
Co-authored-by: Hemil Desai <[email protected]>
  • Loading branch information
3 people authored Aug 2, 2023
1 parent 4d51a89 commit 4045cf3
Show file tree
Hide file tree
Showing 37 changed files with 3,000 additions and 63 deletions.
50 changes: 50 additions & 0 deletions Dockerfile_k8s
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
FROM continuumio/miniconda3:22.11.1

# TODO(romilb): Investigate if this image can be consolidated with the skypilot
# client image (`Dockerfile`)

# Initialize conda for root user, install ssh and other local dependencies
RUN apt update -y && \
apt install gcc rsync sudo patch openssh-server pciutils nano fuse -y && \
rm -rf /var/lib/apt/lists/* && \
apt remove -y python3 && \
conda init

# Setup SSH and generate hostkeys
RUN mkdir -p /var/run/sshd && \
sed -i 's/PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config && \
sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd && \
cd /etc/ssh/ && \
ssh-keygen -A

# Setup new user named sky and add to sudoers. Also add /opt/conda/bin to sudo path.
RUN useradd -m -s /bin/bash sky && \
echo "sky ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers && \
echo 'Defaults secure_path="/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"' > /etc/sudoers.d/sky

# Switch to sky user
USER sky

# Install SkyPilot pip dependencies
RUN pip install wheel Click colorama cryptography jinja2 jsonschema && \
pip install networkx oauth2client pandas pendulum PrettyTable && \
pip install ray==2.4.0 rich tabulate filelock && \
pip install packaging 'protobuf<4.0.0' pulp && \
pip install awscli boto3 pycryptodome==3.12.0 && \
pip install docker kubernetes

# Add /home/sky/.local/bin/ to PATH
RUN echo 'export PATH="$PATH:$HOME/.local/bin"' >> ~/.bashrc

# Install SkyPilot. This is purposely separate from installing SkyPilot
# dependencies to optimize rebuild time
COPY --chown=sky . /skypilot/sky/

# TODO(romilb): Installing SkyPilot may not be necessary since ray up will do it
RUN cd /skypilot/ && \
sudo mv -v sky/setup_files/* . && \
pip install ".[aws]"

# Set WORKDIR and initialize conda for sky user
WORKDIR /home/sky
RUN conda init
6 changes: 4 additions & 2 deletions sky/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,19 +32,21 @@
Lambda = clouds.Lambda
SCP = clouds.SCP
Local = clouds.Local
Kubernetes = clouds.Kubernetes
OCI = clouds.OCI
optimize = Optimizer.optimize

__all__ = [
'__version__',
'IBM',
'AWS',
'Azure',
'GCP',
'IBM',
'Kubernetes',
'Lambda',
'SCP',
'Local',
'OCI',
'SCP',
'Optimizer',
'OptimizeTarget',
'backends',
Expand Down
140 changes: 140 additions & 0 deletions sky/adaptors/kubernetes.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
"""Kubernetes adaptors"""

# pylint: disable=import-outside-toplevel

import functools
import os

from sky.utils import ux_utils, env_options

kubernetes = None
urllib3 = None

_configured = False
_core_api = None
_auth_api = None
_networking_api = None
_custom_objects_api = None

# Timeout to use for API calls
API_TIMEOUT = 5


def import_package(func):

@functools.wraps(func)
def wrapper(*args, **kwargs):
global kubernetes
global urllib3
if kubernetes is None:
try:
import kubernetes as _kubernetes
import urllib3 as _urllib3
except ImportError:
# TODO(romilb): Update this message to point to installation
# docs when they are ready.
raise ImportError('Fail to import dependencies for Kubernetes. '
'Run `pip install kubernetes` to '
'install them.') from None
kubernetes = _kubernetes
urllib3 = _urllib3
return func(*args, **kwargs)

return wrapper


@import_package
def get_kubernetes():
return kubernetes


@import_package
def _load_config():
global _configured
if _configured:
return
try:
# Load in-cluster config if running in a pod
# Kubernetes set environment variables for service discovery do not
# show up in SkyPilot tasks. For now, we work around by using
# DNS name instead of environment variables.
# See issue: https://github.com/skypilot-org/skypilot/issues/2287
os.environ['KUBERNETES_SERVICE_HOST'] = 'kubernetes.default.svc'
os.environ['KUBERNETES_SERVICE_PORT'] = '443'
kubernetes.config.load_incluster_config()
except kubernetes.config.config_exception.ConfigException:
try:
kubernetes.config.load_kube_config()
except kubernetes.config.config_exception.ConfigException as e:
suffix = ''
if env_options.Options.SHOW_DEBUG_INFO.get():
suffix += f' Error: {str(e)}'
# Check if exception was due to no current-context
if 'Expected key current-context' in str(e):
err_str = ('Failed to load Kubernetes configuration. '
'Kubeconfig does not contain any valid context(s).'
f'{suffix}\n'
' If you were running a local Kubernetes '
'cluster, run `sky local up` to start the cluster.')
else:
err_str = (
'Failed to load Kubernetes configuration. '
f'Please check if your kubeconfig file is valid.{suffix}')
with ux_utils.print_exception_no_traceback():
raise ValueError(err_str) from None
_configured = True


@import_package
def core_api():
global _core_api
if _core_api is None:
_load_config()
_core_api = kubernetes.client.CoreV1Api()

return _core_api


@import_package
def auth_api():
global _auth_api
if _auth_api is None:
_load_config()
_auth_api = kubernetes.client.RbacAuthorizationV1Api()

return _auth_api


@import_package
def networking_api():
global _networking_api
if _networking_api is None:
_load_config()
_networking_api = kubernetes.client.NetworkingV1Api()

return _networking_api


@import_package
def custom_objects_api():
global _custom_objects_api
if _custom_objects_api is None:
_load_config()
_custom_objects_api = kubernetes.client.CustomObjectsApi()

return _custom_objects_api


@import_package
def api_exception():
return kubernetes.client.rest.ApiException


@import_package
def config_exception():
return kubernetes.config.config_exception.ConfigException


@import_package
def max_retry_error():
return urllib3.exceptions.MaxRetryError
30 changes: 30 additions & 0 deletions sky/authentication.py
Original file line number Diff line number Diff line change
Expand Up @@ -373,3 +373,33 @@ def setup_scp_authentication(config: Dict[str, Any]) -> Dict[str, Any]:
with open(public_key_path, 'r') as f:
public_key = f.read().strip()
return _replace_ssh_info_in_config(config, public_key)


def setup_kubernetes_authentication(config: Dict[str, Any]) -> Dict[str, Any]:
get_or_generate_keys()

# Run kubectl command to add the public key to the cluster.
public_key_path = os.path.expanduser(PUBLIC_SSH_KEY_PATH)
key_label = clouds.Kubernetes.SKY_SSH_KEY_SECRET_NAME
cmd = f'kubectl create secret generic {key_label} ' \
f'--from-file=ssh-publickey={public_key_path}'
try:
subprocess.check_output(cmd, stderr=subprocess.STDOUT, shell=True)
except subprocess.CalledProcessError as e:
output = e.output.decode('utf-8')
suffix = f'\nError message: {output}'
if 'already exists' in output:
logger.debug(
f'Key {key_label} already exists in the cluster, using it...')
elif any(err in output for err in ['connection refused', 'timeout']):
with ux_utils.print_exception_no_traceback():
raise ConnectionError(
'Failed to connect to the cluster. Check if your '
'cluster is running, your kubeconfig is correct '
'and you can connect to it using: '
f'kubectl get namespaces.{suffix}') from e
else:
logger.error(suffix)
raise

return config
Loading

0 comments on commit 4045cf3

Please sign in to comment.