-
Notifications
You must be signed in to change notification settings - Fork 503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s] Kubernetes support #2096
Merged
Merged
[k8s] Kubernetes support #2096
Changes from 98 commits
Commits
Show all changes
114 commits
Select commit
Hold shift + click to select a range
0431f96
Working Ray K8s node provider based on SSH
romilbhardwaj 5f715e8
Merge branch 'master' into k8s_cloud
romilbhardwaj 197acea
wip
romilbhardwaj f06b22d
working provisioning with SkyPilot and ssh config
romilbhardwaj cf1ddec
working provisioning with SkyPilot and ssh config
romilbhardwaj 0937cc3
Merge branch 'master' into k8s_cloud
romilbhardwaj 40aad6d
Updates to master
romilbhardwaj 47d0953
ray2.3
romilbhardwaj 9f59467
Clean up docs
romilbhardwaj 07f9bcb
multiarch build
romilbhardwaj bd12014
hacking around ray start
romilbhardwaj 4baf0b6
more port fixes
romilbhardwaj b08eb1b
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cloud
romilbhardwaj 7ed02eb
fix up default instance selection
romilbhardwaj 898a851
fix resource selection
romilbhardwaj fcb51d1
Add provisioning timeout by checking if pods are ready
romilbhardwaj 13eb198
Working mounting
romilbhardwaj 428f143
Remove catalog
romilbhardwaj ebf9d83
fixes
romilbhardwaj da570fc
fixes
romilbhardwaj 1bea866
Fix ssh-key auth to create unique secrets
romilbhardwaj 9def756
Fix for ContainerCreating timeout
romilbhardwaj 8f9cafe
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cloud
romilbhardwaj 65366eb
Fix head node ssh port caching
romilbhardwaj b984ead
mypy
romilbhardwaj 3bca8a9
lint
romilbhardwaj 61df297
fix ports
romilbhardwaj 036eaf9
typo
romilbhardwaj 95e160c
cleanup
romilbhardwaj 301a914
cleanup
romilbhardwaj 2c88daf
wip
romilbhardwaj 7ece7f7
Update setup
romilbhardwaj cc85f94
readme updates
romilbhardwaj 0450cee
lint
romilbhardwaj f3f0578
Fix failover
romilbhardwaj 574a9c6
Fix failover
romilbhardwaj 0632b48
optimize setup
romilbhardwaj 05508d3
Fix sync down logs for k8s
romilbhardwaj fb36a40
test wip
romilbhardwaj 7db4027
instance name parsing wip
romilbhardwaj 632ed30
Fix instance name parsing
romilbhardwaj d7bd766
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cloud
romilbhardwaj 1a444d1
Merge fixes for query_status
romilbhardwaj da9cba2
[k8s_cloud] Delete k8s service resources. (#2105)
aviweit 81871ac
Status refresh WIP
romilbhardwaj 0d1c4ac
refactor to kubernetes adaptor
romilbhardwaj 8017020
tests wip
romilbhardwaj 5d7f8e8
clean up auth
romilbhardwaj aa787f8
wip tests
romilbhardwaj c026559
cli
romilbhardwaj 3dc80d2
cli
romilbhardwaj 63ce29b
sky local up/down cli
romilbhardwaj f9d5b73
cli
romilbhardwaj b81647a
lint
romilbhardwaj 050cfc2
lint
romilbhardwaj d64c394
lint
romilbhardwaj 7367b4a
Speed up kind cluster creation
romilbhardwaj 756c56c
tests
romilbhardwaj d4c0990
lint
romilbhardwaj b64dd19
tests
romilbhardwaj 10333d7
handling for non-reachable clusters
romilbhardwaj b07fc58
Invalid kubeconfig handling
romilbhardwaj 5af58aa
Timeout for sky check
romilbhardwaj 4d6710f
code cleanup
romilbhardwaj c057c88
lint
romilbhardwaj b8e414e
Do not raise error if GPUs requested, return empty list
romilbhardwaj c2ebfe7
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cloud
romilbhardwaj 1fc857b
Address comments
romilbhardwaj 0ae92eb
comments
romilbhardwaj 10f302f
lint
romilbhardwaj 2a4caac
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cloud
romilbhardwaj 54b2b28
Remove public key upload
romilbhardwaj 5ee821d
add shebang
romilbhardwaj d6ca85a
comments
romilbhardwaj fbae4bf
change permissions
romilbhardwaj 6e9e6ba
remove chmod
romilbhardwaj 7fa9d7e
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cloud
romilbhardwaj a3f827e
merge 2241
romilbhardwaj 9687ea8
add todo
romilbhardwaj 4b54555
Handle kube config management for sky local commands (#2253)
hemildesai f73f1b2
Switch context in create_cluster if cluster already exists.
romilbhardwaj 0c45b9a
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cloud
romilbhardwaj a69df01
fix typo
romilbhardwaj ff1d832
Merge branch 'master' of github.com:skypilot-org/skypilot into k8s_cloud
romilbhardwaj 6a931e2
update sky check error msg after sky local down
romilbhardwaj 662e4b9
lint
romilbhardwaj 4046749
update timeout check
romilbhardwaj 92d588d
fix import error
romilbhardwaj 9ff1662
Fix kube API access from within cluster (load_incluster_auth)
romilbhardwaj 364b03f
lint
romilbhardwaj 691f6b7
lint
romilbhardwaj ed0741f
working autodown and sky status -r
romilbhardwaj 3fe9bfb
lint
romilbhardwaj b98ced3
add test_kubernetes_autodown
romilbhardwaj 07ea97d
lint
romilbhardwaj 73ee737
address comments
romilbhardwaj 7726850
address comments
romilbhardwaj 2ee4833
lint
romilbhardwaj 9e0f5b6
deletion timeouts wip
romilbhardwaj b36fba4
[k8s_cloud] Ray pod not created under current context namespace. (#2302)
aviweit c137360
Merge branch 'k8s_cloud' of github.com:skypilot-org/skypilot into k8s…
romilbhardwaj a806b39
head ssh port namespace fix
romilbhardwaj a9b9636
[k8s-cloud] Typo in sky local --help. (#2308)
aviweit 7903339
[k8s-cloud] Set build_image.sh to be executable. (#2307)
aviweit 4ab5329
remove ingress
romilbhardwaj 4b49241
remove debug statements
romilbhardwaj 83aecd3
UX and readme updates
romilbhardwaj bdeb7d5
lint
romilbhardwaj 993f736
Merge branch 'k8s_cloud' of github.com:skypilot-org/skypilot into k8s…
romilbhardwaj 4fb1d94
fix logging for 409 retry
romilbhardwaj 02e3415
lint
romilbhardwaj c1b7438
lint
romilbhardwaj 6eae8bd
comments
romilbhardwaj 57a37b3
remove k8s from default clouds to run
romilbhardwaj File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
FROM continuumio/miniconda3:22.11.1 | ||
|
||
# TODO(romilb): Investigate if this image can be consolidated with the skypilot | ||
# client image (`Dockerfile`) | ||
|
||
# Initialize conda for root user, install ssh and other local dependencies | ||
RUN apt update -y && \ | ||
apt install gcc rsync sudo patch openssh-server pciutils nano fuse -y && \ | ||
rm -rf /var/lib/apt/lists/* && \ | ||
apt remove -y python3 && \ | ||
conda init | ||
|
||
# Setup SSH and generate hostkeys | ||
RUN mkdir -p /var/run/sshd && \ | ||
sed -i 's/PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config && \ | ||
sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd && \ | ||
cd /etc/ssh/ && \ | ||
ssh-keygen -A | ||
|
||
# Setup new user named sky and add to sudoers. Also add /opt/conda/bin to sudo path. | ||
RUN useradd -m -s /bin/bash sky && \ | ||
echo "sky ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers && \ | ||
echo 'Defaults secure_path="/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"' > /etc/sudoers.d/sky | ||
|
||
# Switch to sky user | ||
USER sky | ||
|
||
# Install SkyPilot pip dependencies | ||
RUN pip install wheel Click colorama cryptography jinja2 jsonschema && \ | ||
pip install networkx oauth2client pandas pendulum PrettyTable && \ | ||
pip install ray==2.4.0 rich tabulate filelock && \ | ||
pip install packaging 'protobuf<4.0.0' pulp && \ | ||
pip install awscli boto3 pycryptodome==3.12.0 && \ | ||
pip install docker kubernetes | ||
|
||
# Add /home/sky/.local/bin/ to PATH | ||
RUN echo 'export PATH="$PATH:$HOME/.local/bin"' >> ~/.bashrc | ||
|
||
# Install SkyPilot. This is purposely separate from installing SkyPilot | ||
# dependencies to optimize rebuild time | ||
COPY --chown=sky . /skypilot/sky/ | ||
|
||
# TODO(romilb): Installing SkyPilot may not be necessary since ray up will do it | ||
RUN cd /skypilot/ && \ | ||
sudo mv -v sky/setup_files/* . && \ | ||
pip install ".[aws]" | ||
|
||
# Set WORKDIR and initialize conda for sky user | ||
WORKDIR /home/sky | ||
RUN conda init |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,140 @@ | ||
"""Kubernetes adaptors""" | ||
|
||
# pylint: disable=import-outside-toplevel | ||
|
||
import functools | ||
import os | ||
|
||
from sky.utils import ux_utils, env_options | ||
|
||
kubernetes = None | ||
urllib3 = None | ||
|
||
_configured = False | ||
_core_api = None | ||
_auth_api = None | ||
_networking_api = None | ||
_custom_objects_api = None | ||
|
||
# Timeout to use for API calls | ||
API_TIMEOUT = 5 | ||
|
||
|
||
def import_package(func): | ||
|
||
@functools.wraps(func) | ||
def wrapper(*args, **kwargs): | ||
global kubernetes | ||
global urllib3 | ||
if kubernetes is None: | ||
try: | ||
import kubernetes as _kubernetes | ||
import urllib3 as _urllib3 | ||
except ImportError: | ||
# TODO(romilb): Update this message to point to installation | ||
# docs when they are ready. | ||
raise ImportError('Fail to import dependencies for Kubernetes. ' | ||
'Run `pip install kubernetes` to ' | ||
'install them.') from None | ||
kubernetes = _kubernetes | ||
urllib3 = _urllib3 | ||
return func(*args, **kwargs) | ||
|
||
return wrapper | ||
|
||
|
||
@import_package | ||
def get_kubernetes(): | ||
return kubernetes | ||
|
||
|
||
@import_package | ||
def _load_config(): | ||
global _configured | ||
if _configured: | ||
return | ||
try: | ||
# Load in-cluster config if running in a pod | ||
# Kubernetes set environment variables for service discovery do not | ||
# show up in SkyPilot tasks. For now, we work around by using | ||
# DNS name instead of environment variables. | ||
# See issue: https://github.com/skypilot-org/skypilot/issues/2287 | ||
os.environ['KUBERNETES_SERVICE_HOST'] = 'kubernetes.default.svc' | ||
os.environ['KUBERNETES_SERVICE_PORT'] = '443' | ||
kubernetes.config.load_incluster_config() | ||
except kubernetes.config.config_exception.ConfigException: | ||
try: | ||
kubernetes.config.load_kube_config() | ||
except kubernetes.config.config_exception.ConfigException as e: | ||
suffix = '' | ||
if env_options.Options.SHOW_DEBUG_INFO.get(): | ||
suffix += f' Error: {str(e)}' | ||
# Check if exception was due to no current-context | ||
if 'Expected key current-context' in str(e): | ||
err_str = ('Failed to load Kubernetes configuration. ' | ||
'Kubeconfig does not contain any valid context(s).' | ||
f'{suffix}\n' | ||
' If you were running a local Kubernetes ' | ||
'cluster, run `sky local up` to start the cluster.') | ||
else: | ||
err_str = ( | ||
'Failed to load Kubernetes configuration. ' | ||
f'Please check if your kubeconfig file is valid.{suffix}') | ||
with ux_utils.print_exception_no_traceback(): | ||
raise ValueError(err_str) from None | ||
_configured = True | ||
|
||
|
||
@import_package | ||
def core_api(): | ||
global _core_api | ||
if _core_api is None: | ||
_load_config() | ||
_core_api = kubernetes.client.CoreV1Api() | ||
|
||
return _core_api | ||
|
||
|
||
@import_package | ||
def auth_api(): | ||
global _auth_api | ||
if _auth_api is None: | ||
_load_config() | ||
_auth_api = kubernetes.client.RbacAuthorizationV1Api() | ||
|
||
return _auth_api | ||
|
||
|
||
@import_package | ||
def networking_api(): | ||
global _networking_api | ||
if _networking_api is None: | ||
_load_config() | ||
_networking_api = kubernetes.client.NetworkingV1Api() | ||
|
||
return _networking_api | ||
|
||
|
||
@import_package | ||
def custom_objects_api(): | ||
global _custom_objects_api | ||
if _custom_objects_api is None: | ||
_load_config() | ||
_custom_objects_api = kubernetes.client.CustomObjectsApi() | ||
|
||
return _custom_objects_api | ||
|
||
|
||
@import_package | ||
def api_exception(): | ||
return kubernetes.client.rest.ApiException | ||
|
||
|
||
@import_package | ||
def config_exception(): | ||
return kubernetes.config.config_exception.ConfigException | ||
|
||
|
||
@import_package | ||
def max_retry_error(): | ||
return urllib3.exceptions.MaxRetryError |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to print this out? It seems a bit confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh good point. I used it for debugging. Changed the
warning
todebug
.