Skip to content

Commit

Permalink
Fast removal of S3 Storage buckets with 10k-1 million objects (#893)
Browse files Browse the repository at this point in the history
* Fixes for document and README (#360)

* Update the installation and tutorial

* fix

* Add description in README

* fix

* Add link to examples

* Polish readme

* Update readme

* readme

* Fix comment

* Update links in documentation and README (#366)

* Improve adaptors (#359)

* gather adaptors

* give instructions

* Limit grpcio version for #369 (#370)

* Limit grpcio version

* rename disk_size

* format

* Remove typo from README (#371)

* Fix Azure credential errors on AWS (#372)

* fix typo (#376)

* Restore README to the original place (#374)

* restore README to the original place

* Support 'name:cnt' accelerators spec in YAML (#396)

* Support 'name:cnt' accelerators spec in YAML

* Fixes #373: 'sky start/down' should error out

* README example: change `K80:4` to `V100:1` (#399)

Reasons:
- `K80:4` is not available on AWS, arguably the most common cloud our target users use (so they will hit resource unavailable)
- `V100:1` is available on all three clouds and is a popular GPU

* Workdir Docs (#405)

* Ok

* Addressed all comments

* Changed to new git link

* ok

* Fix `check_local_gpus` (#411)

* Fix check_local_gpus

* Break a line to meet 80 char constraint

* Address the review comments

* Sky Storage CLI (#338)

* Initial Draft

* Delete bad file

* ?

* ???

* format

* sky storage status; sky storage down

* Fixed

* Done

* Addressed Comments

* Addressed Comments, TODO: Documentation

* Documentation added

* ok

* Sky Storage CLI - Polishing (#414)

* Initial Draft

* Delete bad file

* ?

* ???

* format

* sky storage status; sky storage down

* Fixed

* Done

* Addressed Comments

* Addressed Comments, TODO: Documentation

* Documentation added

* ok

* Addressed Zhanghao's comments

* Fix

* SGTM

* Decouple `ray up` and user's file_mounts / setup (#407)

* wip: Add setup in provision pipeline

* Fix gcp/azure

* remove useless variables

* minor fix

* Add some TODOs

* Fix comments

* Fix comments

* Fix gcp/azure initialization_commands

* Remove setup from template

* Fix setup directory

* Change rsync back to -Pavz

* Remove unused argument

* fix file_mount/dir_mount

* Minor fix (#419)

* Revert to using file_mounts for object mounting (#412)

* WIP Debug

* revert file_mounts using storage mounts and update docs

* remove print

* Fix credential cmd

* lint

Co-authored-by: Zongheng Yang <[email protected]>

* [docker] Quick fix to remove already-dead docker containers from `sky status` (#427)

* A quick fix in killing sky docker containers

* Add a comment

* Update documentation with provisioning quickstart (#415)

* Add improved progress meters + log suppression (#425)

* Add cross-cloud failover to docs. (#433)

* Add document for `bash script.sh` cannot do `conda activate` problem (#422)

* Add conda activate support to bashrc

* Add doc and make sure conda activate works

* bring back conda activate command for GCP

* Move comment to quickstart

* format

* Fix comments

* Add test/example of using user_script

* Fix indents

* bash -i only for conda activate

* Fix the SKY_NODE_IPS fail to pass to the shell script

* Update readme

* update env_check

* Fix comments

* Change to head -n1

* Remove `-i` for `bash user_script.sh` (#436)

* Remove -i option

* Fix docs

* Fix run.sh

* add comments

* fix comment

* format

* Fix Azure key generation (#429)

* Fix az key generation

* Add private keys to retval

* [Storage/File-Mounting] Fix Symlink Issue + Fix File Mounting (#431)

* ok

* Shorten YAML

* Ok

* Done

* Nit added

* Romil's changes

* Fix error message when azure-cli is not installed (#424)

* Azure import

* Add docs.

* Fixing Relative Directory for Workdir and File Mounts (`~/sky_workdir/...` not `~/sky_workdir/workdir/...`) (#443)

* Fix

* Resnet Example working

* Fix

* Fix gitignore for rsync (#448)

* Fix gitignore in file_mounts

* Add test

* Update j2 for filter

* format

* Remove tracebacks for exceptions to improve UX (#441)

* Remove tracebacks

* Fix job fail color

* fix comments

* Hide tracebacks

* Fix #442

* fix `workdir` becomes `~/sky_workdir/workdir` #442

* add logging error for job_id problem

* format

* update error message for retry

* Update docs

* Fix login

* Add more checks

* format

* fix return type

* format

* refactor returncode handling

* Update return handling

* Fix filemount testing

* Fix delayed logs of `sky logs` locally vs `tail -f` remotely (#454)

* Switch to ray job logs

* Optimize log tailing

* format

* Fix exec logging

* Add comment

* Move back to run_with_log

* Bring back our tailing function for progress bar

* format

* Fix comments

* Remove check argument from run_with_log

* Add comment

* Add comment

* lint

* [Storage/File Mount] Check `workdir` and Filemount `src` Size (#440)

* Check  Workdir size

* Temp

* Done

* Addressed comments

* Fix assertion on Azure's A100/V100 and fuzzy resource search (#368)

* Fix assertion on Azure's A100/V100

* Fix azure catalog

* Add simple fuzzy search for accelerators

* Fix issues and add colors

* Update azure.csv to fix bug

* Only keep one candidate, cleaner msg

* yapf

* Make one line msg

* refactor cloud check

* Address comments and add more hints

* yapf

* fix space

* Address comments

* Fix using_file_mounts (#465)

* Improve docs / logging messages. (#466)

* Minor: improve setup logging.

* CLI messages fix.

* Update docs on code sync, artifacts

* Minor: remove a new line from a message

* Update docs

* Fix `sky logs` with interactive ssh causing job status wrongly set if ctrl-c'ed (#473)

* Add warning for ctrl-c

* Make `sky logs` read only

* Fix logs

* add comment

* Pricing information for `sky show-gpus` (#472)

* [Azure] Downgrade the image version for K80 instances (#460)

* Downgrade the image version for Azure K80 instances

* Remove a reference link

* Minor refactoring

* Consider non-gpu instances

* Add clouds=azure

* Minor fix

* Remove unnecessary if statement

* Minor fix

* Minor fix in var name

* Minor fix

* Address comments

* Centralize sky_logs, polish logging, and update `sky start` documentation (#432)

* Gcloud Authentication Bug Fix (#437)

* testing

* Fix

* No retry

* fix tail_logs (#476)

* Mention .gitignore for workdir sync in docs (#470)

* Suppress rsync output (#459)

* Suppress rsync output

* Fix spaces

* fix path

* fix

* Remove rsync logs

* fix logging

* Change order

* Small fix for the cloud_stores cli installation (#477)

* Refactor cloud storage

* Test aws before pip install

* refactor

* Add test for s3 bucket

* fix comment

* Revert "Small fix for the cloud_stores cli installation" (#479)

* Revert "Small fix for the cloud_stores cli installation (#477)"

This reverts commit 1a0bcc141c5cf3892ffa116eaf24e7fdb6bd7c5a.

* Add back the aws cli check

* Add back s3:// check in file_mounts

* format

* Fix hint for multiple instance candidates (#475)

* Add confirmation prompt for cluster management operations (#471)

* Minor fix for sky launch --gpus tpu-* (#481)

* Docs: polish installation & quickstart (#478)

* Update docs installation / README

* quickstart polish

* Polish initial messages

* Spelling

* Updates

* Fix comments

* Fix prompting for launching on a stopped cluster. (#487)

* Fix credential mounting (#483)

* Fix credential mounting

* format

* Chanage back to ~/.config/gcloud

* Add exclude files in gcp credential mounting

* format

* Change using_file_mounts to multinodes to let it check more things

* Fix docs

* format

* Fix gcp installation hints

* Parallel Setup + Filemounting (#458)

* Parallel Setup

* Done

* Done

* Fix color

* Fix

* wow

* Better parallel solution

* Better

* using imap

* Tested with failed setup

* format

* Update the workdir uploading logic

* format

* Fix indent

* Fix comments

* Update log

* Remove context manager

* Change logging

* Fix doc

* Use context manager

* Add exception

* Remove num_threads limit

* Update comment

* Format

* lint

* Add timing

* format

Co-authored-by: Zhanghao Wu <[email protected]>

* Fix Azure Promo Instances (#485)

* Fix azure data fetcher

* Slightly safer

* Add safeguard for missing price

* Fix

* Simplify

* Better impl

* Polish workdir/file_mounts validation and logging. (#495)

* Polish workdir/file_mounts validation and logging.

* Fix cloud URIs being displayed with an extra slash

Previous:
   gs://cloud-tpu-test-datasets/fake_imagenet/train-00001-of-01024/ -> /train-00001-of-01024
Now:
   gs://cloud-tpu-test-datasets/fake_imagenet/train-00001-of-01024 -> /train-00001-of-01024

* Fail early for non-existent local file mount sources.

* !r

* Docs: polish quickstart, interactive-nodes (#491)

* Polish interactive-nodes.rst

* Polish quickstart, interactive-nodes

* Address comments

* Address comments

* Fix rsync_exclude for gcp credential file_mounts (#496)

* Fix rsync_exclude for gcp credential file_mounts

* fix comment

* address comment

* Fix: return when there are no matched clusters (#500)

* Make GPU/TPU names case-insensitive (#463)

* task.py: validate workdir by expanding full path (#505)

* Revamp docs (getting started; use cases). (#503)

* Revamp docs (getting started; use cases).

* task.py: validate workdir by expanding full path

* romil comments

* fixes

* comments

* zhanghao's comment

Co-authored-by: Romil <[email protected]>
Co-authored-by: Zhanghao Wu <[email protected]>

* Fix ulimit for provision (#512)

* fix ulimit for aws

* update templates for ulimit

* Add comment

* README: remove quick example.  (#521)

* README: remove quick example.  Added notes for developers.

* Remove pip install azure-cli==2.30.0 (already in setup.py)

* Reword

* [docker] Change remote working directory into `/sky_workdir` (#497)

* Fix docker workdir name

* Revert the change in create_dockerfile

* Minor fix

* Fix Azure resource leak (#501)

* fix

* comment and lint

* Timeout for multi-node launching (#504)

* add timeout for waiting cluster ready

* Timeout on launching nodes

* Fix gcp azure upscaling speed

* Timeout only checks node launching

* fix logging

* format

* Redirect the ray status to log file as well

* format

* replace ray exec

* fix comment

* add comment

* Fix ssh_credential

* Fix ip fetching

* Fix count _worker logic

* Refactor

* fix

* format

* bug with yapf

* add comment

* remove unused import

* wording

* Add retry for head ip fetching at the first time

* Fix get head ip

* fix

* fix

* fix assert

* address comments

* format

* change back to info mode

* address comments

* fix color

* [docker] Fix a bug in updating image list (#525)

* Fix help str for aws/azure credential check (#532)

* Fix help str for azure credential check

* Fix aws check

* format

* Fix TPU logging when failed to launch (#526)

* Fix tpu logging info

* format

* fix index

* fix

* change to zone_str

* Add progress for TPU launching

* change back to original logging

* Fix console

* Add clear

* move console back

* Indent

* restore uneccessary changes

* format

* Fix aws apt install (#529)

* Add kill dpkg

* Add comment

* optimize a bit

* Add tree back to using_file_mounts

* Docs: more polishing. (#523)

* Polish grid-search.rst

* Polish more; address #503 comments

* Revamp - Syncing Code and Artifacts

* Address comments

* Polish

* Add copy button to code blocks in documentation (#534)

* Fix non-interactive SSH commands in task setup (#533)

* Simple SSH into Multi-node Workers from Local (#469)

* Done

* format.sh

* temp

* Done

* Remote SSH

* Fix

* Addressed comments

* Addressed Zhanghao's edge case

* ok

* Fix

* Docs: significantly de-ink quickstart (#539)

* Docs: significantly de-ink quickstart

* More polishing.

* deleted:    source/reference/iterative-development.rst

* Address comments.

* Revert "Simple SSH into Multi-node Workers from Local (#469)" (#552)

This reverts commit d2774206e1eedb1b79a22cab06b2ccc1df807691.

* Fix the dpkg lock for apt install (#554)

* Disable unattended-upgrade

* minor: set ray version constraint

* Disable unattended-upgrades for azure as well

* Address comments

* fix comments

* Added FileLock on cluster launch and teardown (#510)

* added FileLock for cluster launch

* Run yapf and pylint

* made suggested changes to locking during provisioning and teardown

* run format

* removed use of pathlib

* test to see if updating filelock version works

* updated version to >=

* disabled pylint warning for filelock & left notes to upgrade when python version is upgraded

* applied formatting

* moved pylint disable to block level

* fixed what pylint ignored

* use handle.cluster_name

* Polish quickstart. (#560)

* Fix the lock logic for provision (#562)

* Fix the lock logic for provision.

* format

* lint

* Improve instance hints by moving messages to optimizer & speedup sky commands (#493)

* Improve instance hints by moving to optimizer

* Fix

* Fix

* Fix cpunode

* Remove traceback message

* Merge master and improve speed

* Fix tests

* Fix test and uppercase TPU

* case-insensitive check for string matching

* yapf

* Fix

* simplify

* Address comments

* Refactor

* Fix message and types

* AWS Ray Setup - Alias Python=Python3 if Alias does not exist (#558)

* Alias python for python3

* ok

* Pip

* ok

* Improve optimizer message (#571)

* Fix sky behavior when weird behavior of `ray status` for GCP occur (#574)

* Fix Weird behavior of `ray status` for GCP #573

* fix comment

* address comment

* Allowed forced teardown when called within locked code (#582)

* Allowed forced teardown when called within locked code

* hopefully fixed pylint issue

* removed forced down from try catch for timeout

* Set $TPU_NAME during provisioning (#587)

* Sky Installation for Mac <1.15 Warnings + Doc (#569)

* Mac fix

* Fix

* Fix

* done

* Ok

* quotes

* Fix prompt, add fallback retry for INIT cluster (#559)

* Fix loggings and fallback logic

* Fix fallback

* Fix loggings

* format

* Rename the variable

* format

* Add assert

* Fix dryrun

* remove assert

* address comments and handle UP status

* format

* Add comments

* lint

* fix lock for cluster status change

* format

* address comment

* address comment

* fix smoke tests

* Add a TODO

* fix stopped multi-node being terminated

* format

* fix function name

* stop/terminate for head_failed as well

* Query ip error

* format

* format

* add comment

* status back to up

* format

* add back assert

* fix tpu merge issue

* Fix CLI not installed for `sky check` (#592)

* add hint for installing gcloud

* fix indent

* Hint in sky check as well

* fix subprocess output

* Fix azure/gcp check

* fix import

* Fix azure check

* Remove output

* Address comment

* update info

* remove tpu gcloud dependency, due to the fix of `sky check`

* update optimizer info

* format

* Upgrade aws ami and add back us-west-1 (#564)

* Upgrade aws ami

* fallback to lower nvidia driver version for K80

* remove print

* add back us-west-1

* Fix order of setup (#593)

* unique cluster name list for sky down (#596)

* SSH into Worker Nodes from Local (#557)

* Done

* format.sh

* temp

* Done

* Remote SSH

* Fix

* Addressed comments

* Addressed Zhanghao's edge case

* ok

* Fix

* Fix

* Fix

* Doc changes

* Comments incorporated

* Glob cluster name for sky down (#598)

* glob search for `sky down`

* add doc

* format

* address comments

* Fix cuda version for tf training (#603)

* Add `conda init` to AWS setup commands (#604)

* Add conda init to aws setup cmds

* Move the line upward

* Fix typo

* Created and handled teardown success bool (#581)

* Created and handled teardown success bool

* better error messages + return true + include stop

* incorporate _force

* formatting

* minor fixes

* Add instructions in document for quota increase (#588)

* Add instructions for quota increase

* Add hint for azure subscriptions

* Address comments

* update

* Add fine-grained logging during provisioning (#565)

* Remove job_submit.log & clarify `sky exec --help` (#611)

* Remove job_submit.log

* Clarify `sky exec --help`

* Update comments

* Parallel runner of smoke tests. (#584)

* run_smoke_tests.py: parallel runner of smoke tests.

* gcp-tpu-delete.sh.j2: remove --async to avoid race conditions.

* Update.

* `sky logs --status`: exit with appropriate exit code.

* test script: echo trick; query job statuses; cluster names

* Use pytest for smoke_test

* ignore smoke for github

* Add file lock fo the wheel building

* Fix file_mounts for ubuntu 20

* format

* Fix job_status

* Update readme

* Print logs while tests are running.

* Update test script

* git rm examples/run_smoke_tests.py

* Minor

* Fix 'sky down non_exist_cluster' message.

* move wheel lock to temp folder

* Update logs for testing

Co-authored-by: Zhanghao Wu <[email protected]>

* Update GCP image for OS consistency and fix GCP's spot CPU (#614)

* Update GCP image and Fix spot CPU

* Fix worker nodes and use same image for K80/V100

* Fix worker

* Fix cpunode

* Optimize the `sky job queue` when large amount of jobs running (#616)

* Only update job status during provision

* format

* Only update job_status when previously stopped

* address comments

* fix comment

* fix merge error

* longer controlpersist

* apt update for aws (#620)

* add apt update for aws

* address comment

* address comment

* [Docker] Install sudo in docker (#595)

* Alias sudo for docker

* lint

* fixes

* Install sudo instead

* Refactor JobLibCodeGen and fix stale job for restarting cluster (#621)

* Refactor JobLibCodeGen and fix job status update logic

* Fix job_lib

* fix job lib again

* Add comment

* fix job_lib

* fix update_status

* Fix INIT status update

* format

* fix

* Add comment

* add assertion

* address comments

* Address comment

* Make azure-start-stop use 1 node to speed up tests. (#619)

* Make azure-start-stop use 1 node to speed up tests.

* Add --num_nodes to both launch and exec.

Tested:
- sky launch --num_nodes=2 'echo $(hostname)' -c test
- sky exec test 'echo $(hostname)' --num_nodes=2
- sky exec test 'echo $(hostname)'

* Support test_azure_start_stop_two_nodes() under --runslow.

* Support passing a test name.

* Make Task.num_nodes a property and validate it.

* Fix hint msg (#634)

* Make test cluster names unique for (user, mac address). (#639)

* Make test cluster names unique for (user, mac address).

* Make sky logs --status print job id.

Ex:

Job 1 SUCCEEDED
Job 2 SUCCEEDED
Job 3 SUCCEEDED
Job 4 SUCCEEDED
Job 5 SUCCEEDED
Job 6 SUCCEEDED
Job 7 SUCCEEDED
Job 8 SUCCEEDED
Job 9 SUCCEEDED
Job 10 SUCCEEDED
Job 11 SUCCEEDED
Job 12 SUCCEEDED
Job 13 SUCCEEDED
Job 14 SUCCEEDED
Job 15 SUCCEEDED
Job 16 SUCCEEDED
Terminating cluster test-multi-echo-zongheng-fe6d...done.

* Roll back to debian-based image (#636)

* Fix the second `sky launch` assertion and ray job status for failed job (#638)

* Fix job fail status in ray job

* Fix second `sky launch`

* format

* add launch again in the smoke test

* address comments

* Wait all the logs

* Fix get_status

* fix print

* format

* Add disk_size to YAML ref; other minor cleanups. (#643)

Closes #546.

* Add a skylet daemon and fix job status problem (#623)

* Refactor JobLibCodeGen and fix job status update logic

* Fix job_lib

* fix job lib again

* Add comment

* fix job_lib

* fix update_status

* Fix INIT status update

* format

* add daemon

* fix

* Add comment

* start skylet

* format

* pylint

* fix skylet start in template

* Fix job fail status in ray job

* Fix second `sky launch`

* format

* fix skylet checking in the test

* fix skylet launching

* remove -v for ray up

* format

* address comments

* Only teardown the cluster when test succeeded

* fix space

* Align underscore / dash for num-nodes cli option (#654)

* align underscore with dash for num-nodes

* fix num_nodes

* Optimize backend_utils.get_node_ips(), esp. for Azure. (#630)

Azure's APIs are extremely slow; as a result, ray get-head-ip and the
like are very slow for Azure clusters.

The below is for a 1-node Azure cluster.

Before: takes 1min 14sec

  » sky exec b2 --workdir=. -- <cmd>
  I 03-22 15:45:54 cloud_vm_ray_backend.py:1296] In sync_workdir
  I 03-22 15:47:08 cloud_vm_ray_backend.py:1301] Done get_node_ips()

After: instant

 » sky exec b2 --workdir=. -- <cmd>
  I 03-22 15:54:59 cloud_vm_ray_backend.py:1296] In sync_workdir
  I 03-22 15:54:59 cloud_vm_ray_backend.py:1302] Done get_node_ips()

* Minor touches on docs + improve install UX (#660)

* Minor touches on docs.

* Remove awscli pinning to not download a bunch of boto3 versions.

* Remove awscli pinning in cloud_stores

* Refactor type_checking (#655)

* refactor type_checking

* Address comments

* Fix resources_lib

* Build Sky local wheel in a unique tempdir per launch. (#657)

* Build Sky local wheel in a unique tempdir per launch.

* Refactor wheel cleanup

* reorg statements

* Fix caller.

* Tear down head node even for HEAD_FAILED. (#661)

* Added sky down --purge (#635)

* added sky down --purge

* made suggested edits

* minor formatting and changes

* fixed force

* output formatting fix

* Parallel sky down (#659)

* fix multi-thread

* refactor

* Address comment

* format

* hidden variable

* Progress bar for termination

* fix

* format

* mitigate logging problem

* rename

* rsync: --filter on .git/info/exclude (#652)

* rsync: --filter on .git/info/exclude

* Update docs.

* Use --exclude-from, and check if git exclude exists

* Update docs

* Fix repeating IP Address bug (#663)

* Fix output for parallel down (#666)

* Fix output for parallel down

* format

* linting

* fix import

* Auto stop for cluster (#653)

* refactorize skylet

* implement autostop event without cluster stopping

* wip

* Remove autostop from yaml file

* fix naming

* fix config

* fix skylet

* add autostop to status

* fix state and name match

* Replace min_workers/max_workers for gcp

* using ray up / ray down process

* fix stopping

* set autostop in globle user state

* update sky status

* format

* Add refresh to sky status

* address comments

* comment

* address comments

* Fix logging

* update help

* remove ssh config and bring cursor back

* Fix exec on stopped instance

* address comment

* format

* fix

* Add test for autostop

* Fix cancel

* address comment

* address comment

* Fix sky launch will change autostop to -1

* format

* Add docs

* update

* Refactor DAG Optimizer (#628)

* Refactor optimizer

* Remove unnecessary import

* yapf

* Minor fix

* Add NotImplementedError

* Minor

* Rename vars & Annotate types

* Minor fix

* Minor

* Minor fix

* Fix type annotation

* yapf

* [Minor] Address comment

* Add type alias & enhance comments

* yapf

* Fix minor error in dag_lib.Dag

* Add is_chain to Dag

* Address comments

* yapf

* yapf

* Address comments

* Add total in optimizer msg

* Add a comment in is_chain

* Address reviews & Fix egress msg

* yapf

* Minor fix

* Fix egress msg

* yapf

* obj -> objective

* pass yapf

* cost -> cost/time

* Improve UX for autostopping (#676)

* Add progress bar for status refreshing

* Keep autostop after refreshing

* Add glob for start

* Fix message for autostop

* Fix messages for autostop

* Improve logging in error conditions & update auto-stop.rst (#675)

* Log error for HEAD_FAILED; don't duplicate logging for no_retry=True.

* Minor touches on auto-stop.rst

* Revert to only printing errors on GANG_FAILED

* Add GLOB for sky queue (#678)

* add glob for sky queue and start

* format

* Added price to sky status (#561)

* Added price to sky status

* put region and hourly price behind -a in sky status

* removed whitespace

* cache cluster region

* some touches + added computation to constructor

* forgot one fix

* formatting

* Add line processor abstraction and fix gitignored path size (#615)

* ILP-based DAG Optimizer (#637)

* Refactor optimizer

* Remove unnecessary import

* yapf

* Minor fix

* Add NotImplementedError

* ILP-based optimization

* yapf

* Add pulp in setup.py

* Minor

* Rename vars & Annotate types

* Minor fix

* Minor

* Minor fix

* yapf

* Fix type annotation

* yapf

* [Minor] Address comment

* Add type alias & enhance comments

* yapf

* Fix minor error in dag_lib.Dag

* Add is_chain to Dag

* Address comments

* yapf

* yapf

* Address comments

* Add total in optimizer msg

* Add a comment in is_chain

* Address reviews & Fix egress msg

* yapf

* Minor fix

* Fix egress msg

* yapf

* obj -> objective

* pass yapf

* cost -> cost/time

* Add random DAG generator

* Add random DAG generator

* Change variable names

* Minor fix

* yapf on test_random_dag.py

* Add docstring

* Rename

* _optimize_cost -> _optimize_objective

* Minor

* Default num_tasks to 10

* Add docstrings & Fix variable names

* yapf

* Minor

* Improve test_optimizer_random_dag

* yapf

* Fix optimizer

* Add docstring about ILP objective

* fix typo

* yapf

* Minor

* Add monkeyptach

* Fix docstring

* yapf

* Touches on docs. (#684)

* Touches on docs.

* Touches

* touches on yaml-spec

* update --gpus=all

* extend underline

* Storage mounting (#658)

* squash

* fix

* yapf workaround

* Update artifact syncing docs (#689)

* Update docs

* comments

* Docker example and fix goofys-docker mounting (#686)

* Fix docker killing

* Add docker example

* Fix docker example

* Fix

* Fix the docker example for pytorch installation

* Use model caching

* Mount output folder

* Permission issue

* remove useless lines

* fix license

* Add storage mounting for output and fix the goofys mounting

* Minor touches

* examples/docker_app.yaml -> examples/detectron2_docker.yaml

* Minor

* Fix gcp fuse.conf

* simplify file_mount options

* remove wait

Co-authored-by: Zongheng Yang <[email protected]>

* Add faq.rst; move CLI section to the bottom. (#690)

* Speedup ci/cd with parallelism (#693)

* Testing for different os and python version

* downgrade python

* speedup testing

* Remove 3.9

* generic workflow

* remove mac and add caching

* Verify acclerators and support float for inline GPU requirement V100:0.5 (#698)

* Verify acclerators and support float number for inline GPU specification V100:0.5

* format

* Add testing for the cli

* Remove unused function

* fix test

* Case insensitive gpu checking

* Storage Subdirectory Fix (#709)

* fix

* Ok

* Minor: suggest using a conda env when installing Sky. (#710)

* Fix optimizer messages (#711)

* Fix optimizer msg

* yapf

* Fix optimizer msg

* any -> all

* Delete hourly

* Minor

* Prompt Before Storage Initialization for `sky launch` (#701)

* Fix pushed

* Fix

* Fix sky start when cluster is autostopped without`sky status --refresh` (#713)

* Fix sky start when cluster is autostopped and `sky status --refresh` is not called

* format

* lint

* lint

* Rename status function

* Fix the hardcoded gcp project id (#692)

* Fix gcp project id

* Refactor

* Move azure subscription id to auth

* project id back to backend_utils

* Add todo

* Fix azure subscription id (#720)

* Skip cloud when instance type is provided and make resources immutable (#714)

* cloud is not required when instance_type set

* format

* Make Resources immutable to avoid verification problems

* update

* Fix version

* lint

* fix test

* Address comment

* Fix version

* oops

* Move accelerator_args setting to the _set_accelerators

* Remove default

* Instruction for file_mounts trick (#694)

* instruction for file_mounts trick

* address comment

* address comments

* Specify region in resources (#722)

* cloud is not required when instance_type set

* format

* Make Resources immutable to avoid verification problems

* update

* Fix version

* lint

* fix test

* wip

* Address comment

* Fix version

* Add region_limit to resources

* Add dryrun region test

* Add test_region

* Fix region

* format

* Add case insensity check for region

* fix region in config_dict

* fix test

* address comments

* Update doc

* address comment

* Ship AWS cloud provider (#725)

* ship aws

* adjust

* update template

* update LICENCE

* Goofys memory optimizations (#726)

* comments

* yapf

* Make sky exec submit jobs for inline commands; guard Azure disabled subscription error. (#727)

* Guard against Azure disabled sub error; tested with: sky gpunode

* Make `sky exec` submit jobs for inline commands as well.

* YAPF

* Fix exec examples + check empty entrypoint

* Enable empty string entrypoint for 'sky launch'.

* Add job duration and resources (#729)

* Add job duration and resources

* address comments

* Fix job status terminal -> non-terminal

* Change end_at to null if status is not terminal

* Default value to null for end_at

* Address comments

* UX improvement for sky stop and down (#734)

* UX improvement for sky stop and down

* Change skipped to be yellow

* yapf

* Fix AWS multi-node failure (#736)

* Fix TPU naming issue (#737)

* Format duration for jobs (#744)

* Format duration

* format

* Address comments

* Shared default security group for AWS (#731)

* shared default security group for AWS

* update

* update name

* Add TPU pods (#739)

* Init fix

* Fix

* quick fix

* Add notes

* Add us-east1 region

* Add assertion on multi-node TPU

* Fix

* Fix small nits in code (#748)

* Fix race condition for skylet daemon (#747)

* Fix race condition

* update comment

* Fix submitted_at

* Add retry for sky logs

* format

* Fix retry for job log

* Fix the setup progress bar and conda confirmation message (#746)

* Fix setup progress bar and confirmation of conda

* minor fix

* Address comments

* Hack to remove bash warning

* rm

* Fix the pipe output after ctrl-c

* Get rid of cloud dependencies in the config template (#749)

* Add W&B setup in FAQ (#751)

* Add W&B setup in FAQ

* Reflect comments

* Fix typo

* Small refactor of CLI and cloud (#752)

* Fix resources check

* Automatic cloud registry and task_option override

* fix test

* provide option to reset the setting

* Rename the option adding function

* Fix dummy cloud

* fix

* cloud register

* address comments

* address comments

* Fix descendant processes termination for sky cancel (#758)

* fix children processes termination

* fix comment

* fix sig

* fix PIPE kill

* Fix PIPE kill

* format

* Minor logging fix. (#760)

* Add test_cancel() to smoke. (#761)

* Support glob patterns for jobs for `sky logs -s` (#685)

* allow globbing when calling sky logs -s

* formatting

* removed whitespace

* small fixes

* final fixes

* Minor: update README (#762)

* Reformat `sky status` codepath and minor fixes (#721)

* refactor sky status codepath and minor fixes

* moved things over to status_utils

* fix conflicts and styling errors

* final changes

* added inits

* A quick fix for CLOUD_REGISTRY.from_str for None (#766)

* Fix killing the whole session problem (#767)

* Fix kill tmux issue

* Add comments

* Fix gpu issue

* Add test and fix

* format

* Add comment

* Add doc for gcloud 400 error (#764)

* Add gcloud 400 error hint

* add command

* Fix sky cancel (#770)

* Fix format all (#768)

* fix format all

* Managed spot (alpha) (#759)

* wip

* Fix resources check

* Automatic cloud registry and task_option override

* fix test

* provide option to reset the setting

* Rename the option adding function

* Fix dummy cloud

* fix

* wip

* Add Spot CLI support

* fix spot controller logic

* Add spot_recovery example

* fix strategy

* add todo

* Fix status

* add spot status

* fix todo

* Fix merge error

* Fix status

* fix signal

* format

* wip: integrate sky spot launch

* wip: spot launch integration

* Add autostop for the task controller

* Add spot_status cli

* Fix spot status

* fix spot status

* wip: add spot cancel

* fix spot status

* Fix

* disable autoscaling for spot instance

* Add tests and fix yaml specs

* fix tests and make controller resources unspecified

* fix test

* format

* Fix test spot

* format

* Fix resources setstate

* Fix empty run

* format

* Skip empty run section

* Add network check when doing status refresh

* Fix logging

* Remove buggy job not submitted yet

* Fix status refresh

* Add field check for task

* Align job ids

* Fix failover when recovering

* Address part of the comments

* Address part of the comments

* address comment in backend_utils

* Adress part of the comments

* Fix cancelled job duration

* address comments

* Rename status.submit

* Fix cancel

* logging info

* Allow sky spot status to be run when a spot job is launching

* Add status cache and show job status after launch

* Fix failover

* Remove spot-controller from status

* Disable azure use_spot

* Enable gcp

* Fix optimizer dag dummy node

* Fix setup and recovery

* Merge branch 'master' of github.com:concretevitamin/sky-experiments into managed-spot

* format

* Fix optimizer test

* Address comments

* address comments

* Fix exception catch for job cancel

* Handle unexpected failure

* address comments

* address comments

* Default to use_spot for spot_launch

* Fix smoke test

* Fix smoke test

* fix the message after all retry fails

* Add back status check to smoke test

* format

* fix managed spot test

* format

Co-authored-by: Wei-Lin Chiang <[email protected]>

* Reuse sky wheels for efficiency (#769)

* cache sky wheels

* Add spot status --all, fix default value for use_spot and fix GCP dependency (#772)

* Add spot status --all and fix default value for use_spot

* Remove useless TODO

* make gcloud available interactively

* Fix gcp dependency installation and sky check

* Add test for managed spot instance

* format

* Fix security group mismatch issue with autostop (#780)

* fix

* delete unused global var

* restore the original Ray implementation

* Spot cancel -a and spot status --refresh (#776)

* add cancel -a

* add sky spot cancel -a and sky spot status --refresh

* fix

* fix refresh

* fix return

* fix refresh

* address comments

* Disallow long cluster names. (#781)

* Disallow long cluster names.

This fixed a smoke test failure.

* Fix another test name.

* Remove storage_demo.yaml from smoke tests (#782)

* Remove storage_demo.yaml from smoke

* add todo

* Fix smoke test for accelerators (#785)

* Fix yaml_spec test for accelerators

* format

* Fixing some errors encountered in smoke tests. (#787)

* spot_state: fix a SQL syntax error.

Previously:

» sky spot cancel -y -n test-managed-spot-zongheng-fe6d-1
E 05-02 14:37:02 backend_utils.py:989] Traceback (most recent call last):
E 05-02 14:37:02 backend_utils.py:989]   File "<string>", line 1, in <module>
E 05-02 14:37:02 backend_utils.py:989]   File "/home/ubuntu/.local/lib/python3.9/site-packages/sky/spot/spot_utils.py", line 68, in cancel_job_by_name
E 05-02 14:37:02 backend_utils.py:989]     job_ids = spot_state.get_nonterminal_job_ids_by_name(job_name)
E 05-02 14:37:02 backend_utils.py:989]   File "/home/ubuntu/.local/lib/python3.9/site-packages/sky/spot/spot_state.py", line 177, in get_nonterminal_job_ids_by_name
E 05-02 14:37:02 backend_utils.py:989]     rows = _CURSOR.execute(
E 05-02 14:37:02 backend_utils.py:989] sqlite3.OperationalError: near "job_name": syntax error
E 05-02 14:37:02 backend_utils.py:989]
E 05-02 14:37:02 backend_utils.py:994] Command failed with code 1: python3 -u -c 'from sky.spot import spot_utils; result = spot_utils.cancel_job_by_name('"'"'test-managed-spot-zongheng-fe6d-1'"'"'); print(result, end="", flush=True)'
E 05-02 14:37:02 backend_utils.py:995] Failed to cancel managed spot job

* Guard against `sky status --refresh` race

* Increase check_network_connection() timeout to 3s.

* Fixes

* Fix smoke

* Add a comment

* remove filter in aws command and fix spot failover (#789)

* Add status check when cancelling managed spot job and disable ambiguous termination of reserved clusters (#784)

* add cancel -a

* add sky spot cancel -a and sky spot status --refresh

* fix

* fix refresh

* fix return

* fix refresh

* address comments

* Fix cli prompt for cancel -a

* Add status check for job and controller on spot cancelling

* Remove output from setup for spot controller

* Disable `sky stop --all` to stop sky-spot-controller

* Fix operation str and remove controller from --all

* Check reserved cluster for termination operations

* fix name check

* Update massage

* Add repr for name

* fix comment

* Disallow canceling on reserved clusters

* Add tests

* format

* address comments

* format

* fix storage dump/load

* fix smoke test for autostop

* Address comments

* format

* add assertion

* Fix output for cancel

* fix cancel handle

* Fix storage tests failing with parallel runners (#794)

* Change storage id to time_ns

* replace time_ns with time

* File mounts for managed spot jobs (#788)

* Fix storage from_yaml_config

* Add sky storage support for managed spot jobs

* add example for storage in managed spot job

* Fix testing for the spot storage

* format

* format

* Add todos

* Fix test

* Fix comments

* format

* Fix retry cnt

* Fix test name

* Fix test by adding flush

* delete storage after spot task finish

* persistent=false

* Add new line at the end

* address comments

* Minor: logging polishes. (#804)

* Minor: logging polishes.

* Revert storage.py to master; except StorageBucketGetError msg

* Minor fix to test_storage.py

* Logging fixes for `sky spot cancel` and `sky logs -s`.

* Fix copy_mount_str (#806)

* Fix TPU resource leak (#797)

* Fix tpu leak

* Add test

* TPU fixes: record `sky status` before provisioning tpu.

* Switch to v2 to save cost

Co-authored-by: Zongheng Yang <[email protected]>

* Add BERT and Resnet spot examples (#792)

* Add BERT and Resnet spot examples

* Fix

* Add lightning example

* Fix

* Update spot examples comments

* file mount storage

* Add resnet spot codes for version control

* yapf

* remove comments

Co-authored-by: Zongheng Yang <[email protected]>

* Add env option for sky launch and exec (#803)

* Add env option

* Fix task env config

* Add env doc and for spot_launch

* format

* add test for env vars

* address comments

* fix help str rendering

* Fail for unset env var

* format

* Fix race condition between job set_state and update_status (#805)

* Fix race condition when between job and update_status

* address comments

* address comments

* Fix bert qa example (#808)

* Fix `tpu_rc` unassigned bug; minor fix on task resources_str. (#810)

* Fix `tpu_rc` not assigned error, seen in `sky down`

* Fix bug: "resources={'K80': n}" not included in generated program

* Support streaming logs from the spot cluster through spot controller (#798)

* Streaming logs from the spot cluster through spot controller

* fix comment

* format

* Wait for job running for spot logs

* Fix job lib status check

* Add support for `sky spot logs` showing the latest log

* sleep for sky spot status

* remove uneccessary argument from execution

* Refactor

* Fix the keyboard interruption of sky spot logs

* format

* Add comments

* fix comments

* Add wait for the controller and job to be started for spot logs

* address part of the comments

* Fix job id logging

* update type hint

* Fix race condition when between job and update_status

* format

* address comment

* format

* Fix copy_mount_str

* Address comments

* fix logs

* address comments

* format

* Fix logs

* Fix repeat logs

* address comments

* Fix spot logs

* Fix logging

* refactor spot logs loop

* fix log

* Fix log_lib infinite loop printing "Job finished".

This appears to be an accidentally deleted line.

* UX: Remove a logging message

* fix spot_status

* fix comment

Co-authored-by: Zongheng Yang <[email protected]>

* Fix a rare failure of wheels build (#809)

There is a very narrow window in https://github.com/sky-proj/sky/blob/1fd81ef00884780cb476405e3f9da84fd05fbd47/sky/backends/wheel_utils.py#L43-L71
where ctrl+c would prevent cleaning up files. This causes a failure when the wheel is been built again. This PR fixes this.

- [x] Unit tests
- [x] https://github.com/sky-proj/sky/issues/656
- [x] smoke tests

* Avoid ray messing up terminal with misaligned outputs (#813)

* Fix ray mess up terminal output

* fix comment

* Fix tail_logs by remove stdin

* add ray's implementation link

* Remove input for subprocess daemon

* Install ray only when it is not installed on the remote instance (#811)

* Not install ray if ray already exists

* longer sleep time for cancel_pytorch

* Fix autoscaler benign assertion by patching (#815)

* Patch resource_demand_scheduler.py from Ray 1.10.0.

* Make multi_echo test autoscaler bug.

* Fix LICENSE, comments, test

* Change examples/multi_echo.py to use thread pool

* Make wheel paths stable to avoid disrupting certain running Sky tasks (#819)

* Make wheel paths stable to enable concurrent launch on same cluster.

* Message fixes in cli.py

* Make ray_patches real patch files (#821)

* WIP

* Make ray_patches/worker.py use Ray 1.10 formatting (but keep our changes)

* Make ray_patches real patch files.

* Fix logging

* Fix GCP project ID (#824)

* Fix GCP project ID

* yapf

* Move the STARTED column to sky spot status -a (#823)

* save and load spot status --all caches

* format

* swap path

* Removed different cached table for -a

* Fix signal handling with stdin=NULL (#818)

* fix signal handling with stdin=NULL

* Add ctrl-c message

* Handle ctrl-c and ctrl-z

* refactor

* revert refactor

* Fix spot logs with ctrl-c/ctrl-z

* Fix status showing

* remove catch

* format

* fix indent

* Disable process_stream by killing children processes

* Fix comment

* Add sleep for spot test to wait for status to be updated

* format

* address comments

* [Breaks existing AWS clusters!] Change AWS security group name (#826)

* [Breaks existing AWS clusters] Change AWS security group names

* typo

* Fix back incompat descriptions

* Error msg fix

* Fix sky spot status 'ago'

* Remove undesired autoscaling (#830)

* disable autoscale when upscaling_speed is 0

* fix patch file

* Fix the upscaling=0

* remove output

* fix patch

* Fixed multi_echo

* format

* Use real job time in the `sky spot status` (#827)

* use real job time

* fix

* address comments

* nit

* Fix compatibility of the patch (#838)

* Fix compatibility of the patch

* Add comment

* Fix file existance test

* Fix patching

* Fix comment

* patch again

* Remove unuseful comments

* Spot controller UX improvements (#839)

* Fix recovery_strategy.py missing import / var shadowing

* Change controller autostop to 30 mins

Useful for large scale launch debugging

* cli: better messages for downing controller

* Show in-progress counts in sky spot status

* Add a TODO about duplicate keys in task.py.

* yapf

* fix test

* UX for job duration (#840)

* Make job start/end/submit time to be float for accurate time

* Fix microseconds

* format

* Fix < 1 second

* format

* fix

* fix None problem (#842)

* Docs for spot jobs (#844)

* docs for spot jobs

* mention code/files sync

* typo

* typo

* Update docs

* fix duration

Co-authored-by: Zongheng Yang <[email protected]>

* Add cli doc for spot (#846)

* Add cli reference for spot

* address comments

* Fix spot price for GCP (#847)

* fix spot price for GCP

* format

* Fix sky cancel for on-prem mode (#775)

* Replace ps forest with pstree to handle on-prem

* remove pgid

* Fix

* Add tests for all clouds

* Python based subprocess daemon

* yapf

* rm subprocess_daemon.sh

* comment

* add setup and fix test

* replace workdir with git clone for test_distributed_tf

* Fix sky spot issue

* Needs more time for sky cancel to work

* Fix bug

* address comments

* Fix subprocess daemon bug (#850)

* Fix

* Simplify

* Fix A100 provisioning on GCP (#829)

* Fix A100 on GCP

* yapf

* eof

* Fix 16x A100 and spot

* fix name

* Clean codes

* Fix pylint error

* Fix

* fix gcp return code

* Address comments

* comment

* Add notes

* update note

* Fix bug

* Cluster status meaning in `sky status --help` (#843)

* Better optimizer plan logs (#860)

* Better optimizer plan logs

* Add minimize logging option

* Change MINIMIZE_LOGGING to False by default

* address comments

* update output

* print plan in topo order

* Fix docs build warnings and remove code search (#834)

* Add Jupyter notebook tutorial to docs (#841)

* Fix sqlite3 rename problem with older version (#868)

* Fix on-demand price (#866)

* Fix on-demand price

* format

* Change the default cpu instance for aws

* address comments

* Enable timeline recording for Sky (#833)

* timeline

* Fix Ray autoscaler's failure of gpu auto detection (#848)

* Fix Ray autoscaler failure of detecting gpu

* rename

* ensure echo only once

* Add retry-until-up feature for launch and start (#863)

* Add retry until up

* fix message

* fix message

* fix

* Add exponential backoff

* Add backoff for spot recovery

* Address comments

* format

* Fix merge error

* Fix message

* Fix message

* Add comment

* Sky docker image (#869)

* Sky docker image

* docs

* Address comments

* optimize build order and remove deps

* add .sky mount

* fix docs

* fix docs

* Upload only necessary credentials and add gpu to cpu mapping for GCP (#853)

* upload only necessary credentials and add gpu to cpu mapping for GCP

* Fix comments

* Fix api

* rename

* refactor

* hide variables

* format

* fix test

* Add n2 instances

* Fix power of two

* format

* Fix azure cancel test

* fix azure smoke

* specify credential files

* Address comments

* format

* Address comment

* Fix default image (#874)

* add check before ping (#876)

* Minor: reformat `sky show-gpus` output. (#877)

* Distinguish spot failure for user code and cluster/controller failure. (#862)

* Distinguish controller failure and user failure

* Add hints for getting error messages

* Fix

* update message

* rename to cluster failure

* message for cluster failed as well

* Fix failing

* address comments

* Add id for end of logs

* Split resource failure and controller failure

* Fix terminal state

* Address comments

* fix typo

* Add YAML schema check (#680)

* Add explanations on spot docs (#852)

* Add some docs

* update

* fix

* fix

* update

* address comments

* reorg

* reorg and add fig

* Add imgs

* fix

* update

* Fix typo (#880)

* Make `sky launch` prompting consistent with interactive nodes (#867)

* Make the spot job pending as soon as the job is submitted (#870)

* Distinguish controller failure and user failure

* Add hints for getting error messages

* Fix

* update message

* rename to cluster failure

* message for cluster failed as well

* Fix failing

* Add pending state for spot jobs

* Fix job id

* format

* address comments

* Add id for end of logs

* fix pending

* Add name and resources

* format

* Add failed status check for spot state

* Refactor the backend interface

* address comments

* fix status

* address comment

* Fix comment

* remove azureTokens.json from the credential list (#883)

* Fast removal of buckets

* Fast removal of buckets

* Replace os.system with subprocess

* Replace os.system with subprocess

* fix syntax

* fix syntax

* Addressed Romil's comments

* fix comment

* Fix comments 2

Co-authored-by: Zhanghao Wu <[email protected]>
Co-authored-by: Gautam Mittal <[email protected]>
Co-authored-by: Siyuan (Ryans) Zhuang <[email protected]>
Co-authored-by: Zongheng Yang <[email protected]>
Co-authored-by: Woosuk Kwon <[email protected]>
Co-authored-by: Woosuk Kwon <[email protected]>
Co-authored-by: Romil Bhardwaj <[email protected]>
Co-authored-by: Zongheng Yang <[email protected]>
Co-authored-by: Wei-Lin Chiang <[email protected]>
Co-authored-by: Mehul Raheja <[email protected]>
Co-authored-by: Wei-Lin Chiang <[email protected]>
  • Loading branch information
12 people authored Jun 6, 2022
1 parent 8fe7d32 commit 6ac9687
Show file tree
Hide file tree
Showing 3 changed files with 24 additions and 12 deletions.
4 changes: 2 additions & 2 deletions sky/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -2010,7 +2010,7 @@ def storage_delete(all: bool, name: str): # pylint: disable=redefined-builtin
sky storage delete -a
"""
if all:
click.echo('Deleting all storage objects...')
click.echo('Deleting all storage objects.')
storages = global_user_state.get_storage()
for row in storages:
store_object = data.Storage(name=row['name'],
Expand All @@ -2023,7 +2023,7 @@ def storage_delete(all: bool, name: str): # pylint: disable=redefined-builtin
if handle is None:
click.echo(f'Storage name {n} not found.')
else:
click.echo(f'Deleting storage object {n}...')
click.echo(f'Deleting storage object {n}.')
store_object = data.Storage(name=handle.storage_name,
source=handle.source,
sync_on_reconstruction=False)
Expand Down
27 changes: 17 additions & 10 deletions sky/data/storage.py
Original file line number Diff line number Diff line change
Expand Up @@ -737,8 +737,8 @@ def upload(self):
f'Upload failed for store {self.name}') from e

def delete(self) -> None:
logger.info(f'Deleting S3 Bucket {self.name}')
return self._delete_s3_bucket(self.name)
self._delete_s3_bucket(self.name)
logger.info(f'Deleted S3 bucket {self.name}.')

def get_handle(self) -> StorageHandle:
return aws.resource('s3').Bucket(self.name)
Expand Down Expand Up @@ -941,15 +941,22 @@ def _delete_s3_bucket(self, bucket_name: str) -> None:
Args:
bucket_name: str; Name of bucket
"""
# Deleting objects is very slow programatically
# (i.e. bucket.objects.all().delete() is slow).
# In addition, standard delete operations (i.e. via `aws s3 rm`)
# are slow, since AWS puts deletion markers.
# https://stackoverflow.com/questions/49239351/why-is-it-so-much-slower-to-delete-objects-in-aws-s3-than-it-is-to-create-them
# The fastest way to delete is to run `aws s3 rb --force`,
# which removes the bucket by force.
remove_command = f'aws s3 rb s3://{bucket_name} --force'
try:
s3 = aws.resource('s3')
bucket = s3.Bucket(bucket_name)
bucket.objects.all().delete()
bucket.delete()
except aws.client_exception() as e:
logger.error(f'Unable to delete S3 bucket {self.name}')
logger.error(e)
raise e
with backend_utils.safe_console_status(
f'[bold cyan]Deleting [green]bucket {bucket_name}'):
subprocess.check_output(remove_command.split(' '))
except subprocess.CalledProcessError as e:
logger.error(e.output)
raise exceptions.StorageBucketDeleteError(
f'Failed to delete S3 bucket {bucket_name}.')


class GcsStore(AbstractStore):
Expand Down
5 changes: 5 additions & 0 deletions sky/exceptions.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,11 @@ class StorageBucketGetError(StorageInitError):
pass


class StorageBucketDeleteError(StorageError):
# Error raised if attempt to delete an existing bucket fails.
pass


class StorageUploadError(StorageError):
# Error raised when bucket is successfully initialized, but upload fails,
# either due to permissions, ctrl-c, or other reasons.
Expand Down

0 comments on commit 6ac9687

Please sign in to comment.