-
Notifications
You must be signed in to change notification settings - Fork 503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GCP] Adopt new provisioner to stop/down clusters #2199
Merged
Merged
Changes from all commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
7d8b2b2
WIP
Michaelvll 0dc3bf4
Add instance handler for tpu node
Michaelvll 04cfa1a
format
Michaelvll 8c8384c
format
Michaelvll c886cb8
wip
Michaelvll 0812b55
typo
Michaelvll 76a8c67
remove log
Michaelvll 5874d59
typo
Michaelvll 5b32455
add comment
Michaelvll 53fabad
fix typo
Michaelvll a8afc14
format
Michaelvll d23f7ed
larger wait time for GCP
Michaelvll 2655a64
terminate all instance states
Michaelvll 022f29d
Merge branch 'master' of github.com:skypilot-org/skypilot into gcp-ne…
Michaelvll 3eb38c0
address comment
Michaelvll eaf1db8
Merge branch 'gcp-new-termination' of github.com:skypilot-org/skypilo…
Michaelvll 19f3438
Avoid caching
Michaelvll 20cbe02
Merge branch 'master' of github.com:skypilot-org/skypilot into gcp-ne…
Michaelvll File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
"""GCP provisioner for SkyPilot.""" | ||
|
||
from sky.provision.gcp.instance import stop_instances, terminate_instances |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,153 @@ | ||
"""GCP instance provisioning.""" | ||
import collections | ||
import time | ||
from typing import Any, Callable, Dict, Iterable, List, Optional, Type | ||
|
||
from sky import sky_logging | ||
from sky.provision.gcp import instance_utils | ||
|
||
logger = sky_logging.init_logger(__name__) | ||
|
||
MAX_POLLS = 12 | ||
# Stopping instances can take several minutes, so we increase the timeout | ||
MAX_POLLS_STOP = MAX_POLLS * 8 | ||
POLL_INTERVAL = 5 | ||
|
||
# Tag uniquely identifying all nodes of a cluster | ||
TAG_RAY_CLUSTER_NAME = 'ray-cluster-name' | ||
|
||
|
||
def _filter_instances( | ||
handlers: Iterable[Type[instance_utils.GCPInstance]], | ||
project_id: str, | ||
zone: str, | ||
label_filters: Dict[str, str], | ||
status_filters_fn: Callable[[Type[instance_utils.GCPInstance]], | ||
Optional[List[str]]], | ||
included_instances: Optional[List[str]] = None, | ||
excluded_instances: Optional[List[str]] = None, | ||
) -> Dict[Type[instance_utils.GCPInstance], List[str]]: | ||
"""Filter instances using all instance handlers.""" | ||
instances = set() | ||
logger.debug(f'handlers: {handlers}') | ||
for instance_handler in handlers: | ||
instances |= set( | ||
instance_handler.filter(project_id, zone, label_filters, | ||
status_filters_fn(instance_handler), | ||
included_instances, excluded_instances)) | ||
handler_to_instances = collections.defaultdict(list) | ||
for instance in instances: | ||
handler = instance_utils.instance_to_handler(instance) | ||
handler_to_instances[handler].append(instance) | ||
logger.debug(f'handler_to_instances: {handler_to_instances}') | ||
return handler_to_instances | ||
|
||
|
||
def _wait_for_operations( | ||
handlers_to_operations: Dict[Type[instance_utils.GCPInstance], List[dict]], | ||
project_id: str, | ||
zone: str, | ||
) -> None: | ||
"""Poll for compute zone operation until finished.""" | ||
total_polls = 0 | ||
for handler, operations in handlers_to_operations.items(): | ||
for operation in operations: | ||
logger.debug( | ||
'wait_for_compute_zone_operation: ' | ||
f'Waiting for operation {operation["name"]} to finish...') | ||
while total_polls < MAX_POLLS: | ||
if handler.wait_for_operation(operation, project_id, zone): | ||
break | ||
time.sleep(POLL_INTERVAL) | ||
total_polls += 1 | ||
|
||
|
||
def stop_instances( | ||
cluster_name: str, | ||
provider_config: Optional[Dict[str, Any]] = None, | ||
included_instances: Optional[List[str]] = None, | ||
excluded_instances: Optional[List[str]] = None, | ||
) -> None: | ||
assert provider_config is not None, cluster_name | ||
zone = provider_config['availability_zone'] | ||
project_id = provider_config['project_id'] | ||
name_filter = {TAG_RAY_CLUSTER_NAME: cluster_name} | ||
|
||
handlers: List[Type[instance_utils.GCPInstance]] = [ | ||
instance_utils.GCPComputeInstance | ||
] | ||
use_tpu_vms = provider_config.get('_has_tpus', False) | ||
if use_tpu_vms: | ||
handlers.append(instance_utils.GCPTPUVMInstance) | ||
|
||
handler_to_instances = _filter_instances( | ||
handlers, | ||
project_id, | ||
zone, | ||
name_filter, | ||
lambda handler: handler.NEED_TO_STOP_STATES, | ||
included_instances, | ||
excluded_instances, | ||
) | ||
all_instances = [ | ||
i for instances in handler_to_instances.values() for i in instances | ||
] | ||
|
||
operations = collections.defaultdict(list) | ||
for handler, instances in handler_to_instances.items(): | ||
for instance in instances: | ||
operations[handler].append(handler.stop(project_id, zone, instance)) | ||
_wait_for_operations(operations, project_id, zone) | ||
# Check if the instance is actually stopped. | ||
# GCP does not fully stop an instance even after | ||
# the stop operation is finished. | ||
for _ in range(MAX_POLLS_STOP): | ||
handler_to_instances = _filter_instances( | ||
handler_to_instances.keys(), | ||
project_id, | ||
zone, | ||
name_filter, | ||
lambda handler: handler.NON_STOPPED_STATES, | ||
included_instances=all_instances, | ||
) | ||
if not handler_to_instances: | ||
break | ||
time.sleep(POLL_INTERVAL) | ||
else: | ||
raise RuntimeError(f'Maximum number of polls: ' | ||
f'{MAX_POLLS_STOP} reached. ' | ||
f'Instance {all_instances} is still not in ' | ||
'STOPPED status.') | ||
|
||
|
||
def terminate_instances( | ||
cluster_name: str, | ||
provider_config: Optional[Dict[str, Any]] = None, | ||
included_instances: Optional[List[str]] = None, | ||
excluded_instances: Optional[List[str]] = None, | ||
) -> None: | ||
"""See sky/provision/__init__.py""" | ||
assert provider_config is not None, cluster_name | ||
zone = provider_config['availability_zone'] | ||
project_id = provider_config['project_id'] | ||
use_tpu_vms = provider_config.get('_has_tpus', False) | ||
|
||
name_filter = {TAG_RAY_CLUSTER_NAME: cluster_name} | ||
handlers: List[Type[instance_utils.GCPInstance]] = [ | ||
instance_utils.GCPComputeInstance | ||
] | ||
if use_tpu_vms: | ||
handlers.append(instance_utils.GCPTPUVMInstance) | ||
|
||
handler_to_instances = _filter_instances(handlers, project_id, zone, | ||
name_filter, lambda _: None, | ||
included_instances, | ||
excluded_instances) | ||
operations = collections.defaultdict(list) | ||
for handler, instances in handler_to_instances.items(): | ||
for instance in instances: | ||
operations[handler].append( | ||
handler.terminate(project_id, zone, instance)) | ||
_wait_for_operations(operations, project_id, zone) | ||
# We don't wait for the instances to be terminated, as it can take a long | ||
# time (same as what we did in ray's node_provider). |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why need to change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a known issue that the
120
seconds is not enough for launching multiple nodes for GCP. We need to increase this to make suresky launch --cloud gcp --num-nodes 8
to work.