Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fast removal of S3 Storage buckets with 10k-1 million objects (#893)
* Fixes for document and README (#360) * Update the installation and tutorial * fix * Add description in README * fix * Add link to examples * Polish readme * Update readme * readme * Fix comment * Update links in documentation and README (#366) * Improve adaptors (#359) * gather adaptors * give instructions * Limit grpcio version for #369 (#370) * Limit grpcio version * rename disk_size * format * Remove typo from README (#371) * Fix Azure credential errors on AWS (#372) * fix typo (#376) * Restore README to the original place (#374) * restore README to the original place * Support 'name:cnt' accelerators spec in YAML (#396) * Support 'name:cnt' accelerators spec in YAML * Fixes #373: 'sky start/down' should error out * README example: change `K80:4` to `V100:1` (#399) Reasons: - `K80:4` is not available on AWS, arguably the most common cloud our target users use (so they will hit resource unavailable) - `V100:1` is available on all three clouds and is a popular GPU * Workdir Docs (#405) * Ok * Addressed all comments * Changed to new git link * ok * Fix `check_local_gpus` (#411) * Fix check_local_gpus * Break a line to meet 80 char constraint * Address the review comments * Sky Storage CLI (#338) * Initial Draft * Delete bad file * ? * ??? * format * sky storage status; sky storage down * Fixed * Done * Addressed Comments * Addressed Comments, TODO: Documentation * Documentation added * ok * Sky Storage CLI - Polishing (#414) * Initial Draft * Delete bad file * ? * ??? * format * sky storage status; sky storage down * Fixed * Done * Addressed Comments * Addressed Comments, TODO: Documentation * Documentation added * ok * Addressed Zhanghao's comments * Fix * SGTM * Decouple `ray up` and user's file_mounts / setup (#407) * wip: Add setup in provision pipeline * Fix gcp/azure * remove useless variables * minor fix * Add some TODOs * Fix comments * Fix comments * Fix gcp/azure initialization_commands * Remove setup from template * Fix setup directory * Change rsync back to -Pavz * Remove unused argument * fix file_mount/dir_mount * Minor fix (#419) * Revert to using file_mounts for object mounting (#412) * WIP Debug * revert file_mounts using storage mounts and update docs * remove print * Fix credential cmd * lint Co-authored-by: Zongheng Yang <[email protected]> * [docker] Quick fix to remove already-dead docker containers from `sky status` (#427) * A quick fix in killing sky docker containers * Add a comment * Update documentation with provisioning quickstart (#415) * Add improved progress meters + log suppression (#425) * Add cross-cloud failover to docs. (#433) * Add document for `bash script.sh` cannot do `conda activate` problem (#422) * Add conda activate support to bashrc * Add doc and make sure conda activate works * bring back conda activate command for GCP * Move comment to quickstart * format * Fix comments * Add test/example of using user_script * Fix indents * bash -i only for conda activate * Fix the SKY_NODE_IPS fail to pass to the shell script * Update readme * update env_check * Fix comments * Change to head -n1 * Remove `-i` for `bash user_script.sh` (#436) * Remove -i option * Fix docs * Fix run.sh * add comments * fix comment * format * Fix Azure key generation (#429) * Fix az key generation * Add private keys to retval * [Storage/File-Mounting] Fix Symlink Issue + Fix File Mounting (#431) * ok * Shorten YAML * Ok * Done * Nit added * Romil's changes * Fix error message when azure-cli is not installed (#424) * Azure import * Add docs. * Fixing Relative Directory for Workdir and File Mounts (`~/sky_workdir/...` not `~/sky_workdir/workdir/...`) (#443) * Fix * Resnet Example working * Fix * Fix gitignore for rsync (#448) * Fix gitignore in file_mounts * Add test * Update j2 for filter * format * Remove tracebacks for exceptions to improve UX (#441) * Remove tracebacks * Fix job fail color * fix comments * Hide tracebacks * Fix #442 * fix `workdir` becomes `~/sky_workdir/workdir` #442 * add logging error for job_id problem * format * update error message for retry * Update docs * Fix login * Add more checks * format * fix return type * format * refactor returncode handling * Update return handling * Fix filemount testing * Fix delayed logs of `sky logs` locally vs `tail -f` remotely (#454) * Switch to ray job logs * Optimize log tailing * format * Fix exec logging * Add comment * Move back to run_with_log * Bring back our tailing function for progress bar * format * Fix comments * Remove check argument from run_with_log * Add comment * Add comment * lint * [Storage/File Mount] Check `workdir` and Filemount `src` Size (#440) * Check Workdir size * Temp * Done * Addressed comments * Fix assertion on Azure's A100/V100 and fuzzy resource search (#368) * Fix assertion on Azure's A100/V100 * Fix azure catalog * Add simple fuzzy search for accelerators * Fix issues and add colors * Update azure.csv to fix bug * Only keep one candidate, cleaner msg * yapf * Make one line msg * refactor cloud check * Address comments and add more hints * yapf * fix space * Address comments * Fix using_file_mounts (#465) * Improve docs / logging messages. (#466) * Minor: improve setup logging. * CLI messages fix. * Update docs on code sync, artifacts * Minor: remove a new line from a message * Update docs * Fix `sky logs` with interactive ssh causing job status wrongly set if ctrl-c'ed (#473) * Add warning for ctrl-c * Make `sky logs` read only * Fix logs * add comment * Pricing information for `sky show-gpus` (#472) * [Azure] Downgrade the image version for K80 instances (#460) * Downgrade the image version for Azure K80 instances * Remove a reference link * Minor refactoring * Consider non-gpu instances * Add clouds=azure * Minor fix * Remove unnecessary if statement * Minor fix * Minor fix in var name * Minor fix * Address comments * Centralize sky_logs, polish logging, and update `sky start` documentation (#432) * Gcloud Authentication Bug Fix (#437) * testing * Fix * No retry * fix tail_logs (#476) * Mention .gitignore for workdir sync in docs (#470) * Suppress rsync output (#459) * Suppress rsync output * Fix spaces * fix path * fix * Remove rsync logs * fix logging * Change order * Small fix for the cloud_stores cli installation (#477) * Refactor cloud storage * Test aws before pip install * refactor * Add test for s3 bucket * fix comment * Revert "Small fix for the cloud_stores cli installation" (#479) * Revert "Small fix for the cloud_stores cli installation (#477)" This reverts commit 1a0bcc141c5cf3892ffa116eaf24e7fdb6bd7c5a. * Add back the aws cli check * Add back s3:// check in file_mounts * format * Fix hint for multiple instance candidates (#475) * Add confirmation prompt for cluster management operations (#471) * Minor fix for sky launch --gpus tpu-* (#481) * Docs: polish installation & quickstart (#478) * Update docs installation / README * quickstart polish * Polish initial messages * Spelling * Updates * Fix comments * Fix prompting for launching on a stopped cluster. (#487) * Fix credential mounting (#483) * Fix credential mounting * format * Chanage back to ~/.config/gcloud * Add exclude files in gcp credential mounting * format * Change using_file_mounts to multinodes to let it check more things * Fix docs * format * Fix gcp installation hints * Parallel Setup + Filemounting (#458) * Parallel Setup * Done * Done * Fix color * Fix * wow * Better parallel solution * Better * using imap * Tested with failed setup * format * Update the workdir uploading logic * format * Fix indent * Fix comments * Update log * Remove context manager * Change logging * Fix doc * Use context manager * Add exception * Remove num_threads limit * Update comment * Format * lint * Add timing * format Co-authored-by: Zhanghao Wu <[email protected]> * Fix Azure Promo Instances (#485) * Fix azure data fetcher * Slightly safer * Add safeguard for missing price * Fix * Simplify * Better impl * Polish workdir/file_mounts validation and logging. (#495) * Polish workdir/file_mounts validation and logging. * Fix cloud URIs being displayed with an extra slash Previous: gs://cloud-tpu-test-datasets/fake_imagenet/train-00001-of-01024/ -> /train-00001-of-01024 Now: gs://cloud-tpu-test-datasets/fake_imagenet/train-00001-of-01024 -> /train-00001-of-01024 * Fail early for non-existent local file mount sources. * !r * Docs: polish quickstart, interactive-nodes (#491) * Polish interactive-nodes.rst * Polish quickstart, interactive-nodes * Address comments * Address comments * Fix rsync_exclude for gcp credential file_mounts (#496) * Fix rsync_exclude for gcp credential file_mounts * fix comment * address comment * Fix: return when there are no matched clusters (#500) * Make GPU/TPU names case-insensitive (#463) * task.py: validate workdir by expanding full path (#505) * Revamp docs (getting started; use cases). (#503) * Revamp docs (getting started; use cases). * task.py: validate workdir by expanding full path * romil comments * fixes * comments * zhanghao's comment Co-authored-by: Romil <[email protected]> Co-authored-by: Zhanghao Wu <[email protected]> * Fix ulimit for provision (#512) * fix ulimit for aws * update templates for ulimit * Add comment * README: remove quick example. (#521) * README: remove quick example. Added notes for developers. * Remove pip install azure-cli==2.30.0 (already in setup.py) * Reword * [docker] Change remote working directory into `/sky_workdir` (#497) * Fix docker workdir name * Revert the change in create_dockerfile * Minor fix * Fix Azure resource leak (#501) * fix * comment and lint * Timeout for multi-node launching (#504) * add timeout for waiting cluster ready * Timeout on launching nodes * Fix gcp azure upscaling speed * Timeout only checks node launching * fix logging * format * Redirect the ray status to log file as well * format * replace ray exec * fix comment * add comment * Fix ssh_credential * Fix ip fetching * Fix count _worker logic * Refactor * fix * format * bug with yapf * add comment * remove unused import * wording * Add retry for head ip fetching at the first time * Fix get head ip * fix * fix * fix assert * address comments * format * change back to info mode * address comments * fix color * [docker] Fix a bug in updating image list (#525) * Fix help str for aws/azure credential check (#532) * Fix help str for azure credential check * Fix aws check * format * Fix TPU logging when failed to launch (#526) * Fix tpu logging info * format * fix index * fix * change to zone_str * Add progress for TPU launching * change back to original logging * Fix console * Add clear * move console back * Indent * restore uneccessary changes * format * Fix aws apt install (#529) * Add kill dpkg * Add comment * optimize a bit * Add tree back to using_file_mounts * Docs: more polishing. (#523) * Polish grid-search.rst * Polish more; address #503 comments * Revamp - Syncing Code and Artifacts * Address comments * Polish * Add copy button to code blocks in documentation (#534) * Fix non-interactive SSH commands in task setup (#533) * Simple SSH into Multi-node Workers from Local (#469) * Done * format.sh * temp * Done * Remote SSH * Fix * Addressed comments * Addressed Zhanghao's edge case * ok * Fix * Docs: significantly de-ink quickstart (#539) * Docs: significantly de-ink quickstart * More polishing. * deleted: source/reference/iterative-development.rst * Address comments. * Revert "Simple SSH into Multi-node Workers from Local (#469)" (#552) This reverts commit d2774206e1eedb1b79a22cab06b2ccc1df807691. * Fix the dpkg lock for apt install (#554) * Disable unattended-upgrade * minor: set ray version constraint * Disable unattended-upgrades for azure as well * Address comments * fix comments * Added FileLock on cluster launch and teardown (#510) * added FileLock for cluster launch * Run yapf and pylint * made suggested changes to locking during provisioning and teardown * run format * removed use of pathlib * test to see if updating filelock version works * updated version to >= * disabled pylint warning for filelock & left notes to upgrade when python version is upgraded * applied formatting * moved pylint disable to block level * fixed what pylint ignored * use handle.cluster_name * Polish quickstart. (#560) * Fix the lock logic for provision (#562) * Fix the lock logic for provision. * format * lint * Improve instance hints by moving messages to optimizer & speedup sky commands (#493) * Improve instance hints by moving to optimizer * Fix * Fix * Fix cpunode * Remove traceback message * Merge master and improve speed * Fix tests * Fix test and uppercase TPU * case-insensitive check for string matching * yapf * Fix * simplify * Address comments * Refactor * Fix message and types * AWS Ray Setup - Alias Python=Python3 if Alias does not exist (#558) * Alias python for python3 * ok * Pip * ok * Improve optimizer message (#571) * Fix sky behavior when weird behavior of `ray status` for GCP occur (#574) * Fix Weird behavior of `ray status` for GCP #573 * fix comment * address comment * Allowed forced teardown when called within locked code (#582) * Allowed forced teardown when called within locked code * hopefully fixed pylint issue * removed forced down from try catch for timeout * Set $TPU_NAME during provisioning (#587) * Sky Installation for Mac <1.15 Warnings + Doc (#569) * Mac fix * Fix * Fix * done * Ok * quotes * Fix prompt, add fallback retry for INIT cluster (#559) * Fix loggings and fallback logic * Fix fallback * Fix loggings * format * Rename the variable * format * Add assert * Fix dryrun * remove assert * address comments and handle UP status * format * Add comments * lint * fix lock for cluster status change * format * address comment * address comment * fix smoke tests * Add a TODO * fix stopped multi-node being terminated * format * fix function name * stop/terminate for head_failed as well * Query ip error * format * format * add comment * status back to up * format * add back assert * fix tpu merge issue * Fix CLI not installed for `sky check` (#592) * add hint for installing gcloud * fix indent * Hint in sky check as well * fix subprocess output * Fix azure/gcp check * fix import * Fix azure check * Remove output * Address comment * update info * remove tpu gcloud dependency, due to the fix of `sky check` * update optimizer info * format * Upgrade aws ami and add back us-west-1 (#564) * Upgrade aws ami * fallback to lower nvidia driver version for K80 * remove print * add back us-west-1 * Fix order of setup (#593) * unique cluster name list for sky down (#596) * SSH into Worker Nodes from Local (#557) * Done * format.sh * temp * Done * Remote SSH * Fix * Addressed comments * Addressed Zhanghao's edge case * ok * Fix * Fix * Fix * Doc changes * Comments incorporated * Glob cluster name for sky down (#598) * glob search for `sky down` * add doc * format * address comments * Fix cuda version for tf training (#603) * Add `conda init` to AWS setup commands (#604) * Add conda init to aws setup cmds * Move the line upward * Fix typo * Created and handled teardown success bool (#581) * Created and handled teardown success bool * better error messages + return true + include stop * incorporate _force * formatting * minor fixes * Add instructions in document for quota increase (#588) * Add instructions for quota increase * Add hint for azure subscriptions * Address comments * update * Add fine-grained logging during provisioning (#565) * Remove job_submit.log & clarify `sky exec --help` (#611) * Remove job_submit.log * Clarify `sky exec --help` * Update comments * Parallel runner of smoke tests. (#584) * run_smoke_tests.py: parallel runner of smoke tests. * gcp-tpu-delete.sh.j2: remove --async to avoid race conditions. * Update. * `sky logs --status`: exit with appropriate exit code. * test script: echo trick; query job statuses; cluster names * Use pytest for smoke_test * ignore smoke for github * Add file lock fo the wheel building * Fix file_mounts for ubuntu 20 * format * Fix job_status * Update readme * Print logs while tests are running. * Update test script * git rm examples/run_smoke_tests.py * Minor * Fix 'sky down non_exist_cluster' message. * move wheel lock to temp folder * Update logs for testing Co-authored-by: Zhanghao Wu <[email protected]> * Update GCP image for OS consistency and fix GCP's spot CPU (#614) * Update GCP image and Fix spot CPU * Fix worker nodes and use same image for K80/V100 * Fix worker * Fix cpunode * Optimize the `sky job queue` when large amount of jobs running (#616) * Only update job status during provision * format * Only update job_status when previously stopped * address comments * fix comment * fix merge error * longer controlpersist * apt update for aws (#620) * add apt update for aws * address comment * address comment * [Docker] Install sudo in docker (#595) * Alias sudo for docker * lint * fixes * Install sudo instead * Refactor JobLibCodeGen and fix stale job for restarting cluster (#621) * Refactor JobLibCodeGen and fix job status update logic * Fix job_lib * fix job lib again * Add comment * fix job_lib * fix update_status * Fix INIT status update * format * fix * Add comment * add assertion * address comments * Address comment * Make azure-start-stop use 1 node to speed up tests. (#619) * Make azure-start-stop use 1 node to speed up tests. * Add --num_nodes to both launch and exec. Tested: - sky launch --num_nodes=2 'echo $(hostname)' -c test - sky exec test 'echo $(hostname)' --num_nodes=2 - sky exec test 'echo $(hostname)' * Support test_azure_start_stop_two_nodes() under --runslow. * Support passing a test name. * Make Task.num_nodes a property and validate it. * Fix hint msg (#634) * Make test cluster names unique for (user, mac address). (#639) * Make test cluster names unique for (user, mac address). * Make sky logs --status print job id. Ex: Job 1 SUCCEEDED Job 2 SUCCEEDED Job 3 SUCCEEDED Job 4 SUCCEEDED Job 5 SUCCEEDED Job 6 SUCCEEDED Job 7 SUCCEEDED Job 8 SUCCEEDED Job 9 SUCCEEDED Job 10 SUCCEEDED Job 11 SUCCEEDED Job 12 SUCCEEDED Job 13 SUCCEEDED Job 14 SUCCEEDED Job 15 SUCCEEDED Job 16 SUCCEEDED Terminating cluster test-multi-echo-zongheng-fe6d...done. * Roll back to debian-based image (#636) * Fix the second `sky launch` assertion and ray job status for failed job (#638) * Fix job fail status in ray job * Fix second `sky launch` * format * add launch again in the smoke test * address comments * Wait all the logs * Fix get_status * fix print * format * Add disk_size to YAML ref; other minor cleanups. (#643) Closes #546. * Add a skylet daemon and fix job status problem (#623) * Refactor JobLibCodeGen and fix job status update logic * Fix job_lib * fix job lib again * Add comment * fix job_lib * fix update_status * Fix INIT status update * format * add daemon * fix * Add comment * start skylet * format * pylint * fix skylet start in template * Fix job fail status in ray job * Fix second `sky launch` * format * fix skylet checking in the test * fix skylet launching * remove -v for ray up * format * address comments * Only teardown the cluster when test succeeded * fix space * Align underscore / dash for num-nodes cli option (#654) * align underscore with dash for num-nodes * fix num_nodes * Optimize backend_utils.get_node_ips(), esp. for Azure. (#630) Azure's APIs are extremely slow; as a result, ray get-head-ip and the like are very slow for Azure clusters. The below is for a 1-node Azure cluster. Before: takes 1min 14sec » sky exec b2 --workdir=. -- <cmd> I 03-22 15:45:54 cloud_vm_ray_backend.py:1296] In sync_workdir I 03-22 15:47:08 cloud_vm_ray_backend.py:1301] Done get_node_ips() After: instant » sky exec b2 --workdir=. -- <cmd> I 03-22 15:54:59 cloud_vm_ray_backend.py:1296] In sync_workdir I 03-22 15:54:59 cloud_vm_ray_backend.py:1302] Done get_node_ips() * Minor touches on docs + improve install UX (#660) * Minor touches on docs. * Remove awscli pinning to not download a bunch of boto3 versions. * Remove awscli pinning in cloud_stores * Refactor type_checking (#655) * refactor type_checking * Address comments * Fix resources_lib * Build Sky local wheel in a unique tempdir per launch. (#657) * Build Sky local wheel in a unique tempdir per launch. * Refactor wheel cleanup * reorg statements * Fix caller. * Tear down head node even for HEAD_FAILED. (#661) * Added sky down --purge (#635) * added sky down --purge * made suggested edits * minor formatting and changes * fixed force * output formatting fix * Parallel sky down (#659) * fix multi-thread * refactor * Address comment * format * hidden variable * Progress bar for termination * fix * format * mitigate logging problem * rename * rsync: --filter on .git/info/exclude (#652) * rsync: --filter on .git/info/exclude * Update docs. * Use --exclude-from, and check if git exclude exists * Update docs * Fix repeating IP Address bug (#663) * Fix output for parallel down (#666) * Fix output for parallel down * format * linting * fix import * Auto stop for cluster (#653) * refactorize skylet * implement autostop event without cluster stopping * wip * Remove autostop from yaml file * fix naming * fix config * fix skylet * add autostop to status * fix state and name match * Replace min_workers/max_workers for gcp * using ray up / ray down process * fix stopping * set autostop in globle user state * update sky status * format * Add refresh to sky status * address comments * comment * address comments * Fix logging * update help * remove ssh config and bring cursor back * Fix exec on stopped instance * address comment * format * fix * Add test for autostop * Fix cancel * address comment * address comment * Fix sky launch will change autostop to -1 * format * Add docs * update * Refactor DAG Optimizer (#628) * Refactor optimizer * Remove unnecessary import * yapf * Minor fix * Add NotImplementedError * Minor * Rename vars & Annotate types * Minor fix * Minor * Minor fix * Fix type annotation * yapf * [Minor] Address comment * Add type alias & enhance comments * yapf * Fix minor error in dag_lib.Dag * Add is_chain to Dag * Address comments * yapf * yapf * Address comments * Add total in optimizer msg * Add a comment in is_chain * Address reviews & Fix egress msg * yapf * Minor fix * Fix egress msg * yapf * obj -> objective * pass yapf * cost -> cost/time * Improve UX for autostopping (#676) * Add progress bar for status refreshing * Keep autostop after refreshing * Add glob for start * Fix message for autostop * Fix messages for autostop * Improve logging in error conditions & update auto-stop.rst (#675) * Log error for HEAD_FAILED; don't duplicate logging for no_retry=True. * Minor touches on auto-stop.rst * Revert to only printing errors on GANG_FAILED * Add GLOB for sky queue (#678) * add glob for sky queue and start * format * Added price to sky status (#561) * Added price to sky status * put region and hourly price behind -a in sky status * removed whitespace * cache cluster region * some touches + added computation to constructor * forgot one fix * formatting * Add line processor abstraction and fix gitignored path size (#615) * ILP-based DAG Optimizer (#637) * Refactor optimizer * Remove unnecessary import * yapf * Minor fix * Add NotImplementedError * ILP-based optimization * yapf * Add pulp in setup.py * Minor * Rename vars & Annotate types * Minor fix * Minor * Minor fix * yapf * Fix type annotation * yapf * [Minor] Address comment * Add type alias & enhance comments * yapf * Fix minor error in dag_lib.Dag * Add is_chain to Dag * Address comments * yapf * yapf * Address comments * Add total in optimizer msg * Add a comment in is_chain * Address reviews & Fix egress msg * yapf * Minor fix * Fix egress msg * yapf * obj -> objective * pass yapf * cost -> cost/time * Add random DAG generator * Add random DAG generator * Change variable names * Minor fix * yapf on test_random_dag.py * Add docstring * Rename * _optimize_cost -> _optimize_objective * Minor * Default num_tasks to 10 * Add docstrings & Fix variable names * yapf * Minor * Improve test_optimizer_random_dag * yapf * Fix optimizer * Add docstring about ILP objective * fix typo * yapf * Minor * Add monkeyptach * Fix docstring * yapf * Touches on docs. (#684) * Touches on docs. * Touches * touches on yaml-spec * update --gpus=all * extend underline * Storage mounting (#658) * squash * fix * yapf workaround * Update artifact syncing docs (#689) * Update docs * comments * Docker example and fix goofys-docker mounting (#686) * Fix docker killing * Add docker example * Fix docker example * Fix * Fix the docker example for pytorch installation * Use model caching * Mount output folder * Permission issue * remove useless lines * fix license * Add storage mounting for output and fix the goofys mounting * Minor touches * examples/docker_app.yaml -> examples/detectron2_docker.yaml * Minor * Fix gcp fuse.conf * simplify file_mount options * remove wait Co-authored-by: Zongheng Yang <[email protected]> * Add faq.rst; move CLI section to the bottom. (#690) * Speedup ci/cd with parallelism (#693) * Testing for different os and python version * downgrade python * speedup testing * Remove 3.9 * generic workflow * remove mac and add caching * Verify acclerators and support float for inline GPU requirement V100:0.5 (#698) * Verify acclerators and support float number for inline GPU specification V100:0.5 * format * Add testing for the cli * Remove unused function * fix test * Case insensitive gpu checking * Storage Subdirectory Fix (#709) * fix * Ok * Minor: suggest using a conda env when installing Sky. (#710) * Fix optimizer messages (#711) * Fix optimizer msg * yapf * Fix optimizer msg * any -> all * Delete hourly * Minor * Prompt Before Storage Initialization for `sky launch` (#701) * Fix pushed * Fix * Fix sky start when cluster is autostopped without`sky status --refresh` (#713) * Fix sky start when cluster is autostopped and `sky status --refresh` is not called * format * lint * lint * Rename status function * Fix the hardcoded gcp project id (#692) * Fix gcp project id * Refactor * Move azure subscription id to auth * project id back to backend_utils * Add todo * Fix azure subscription id (#720) * Skip cloud when instance type is provided and make resources immutable (#714) * cloud is not required when instance_type set * format * Make Resources immutable to avoid verification problems * update * Fix version * lint * fix test * Address comment * Fix version * oops * Move accelerator_args setting to the _set_accelerators * Remove default * Instruction for file_mounts trick (#694) * instruction for file_mounts trick * address comment * address comments * Specify region in resources (#722) * cloud is not required when instance_type set * format * Make Resources immutable to avoid verification problems * update * Fix version * lint * fix test * wip * Address comment * Fix version * Add region_limit to resources * Add dryrun region test * Add test_region * Fix region * format * Add case insensity check for region * fix region in config_dict * fix test * address comments * Update doc * address comment * Ship AWS cloud provider (#725) * ship aws * adjust * update template * update LICENCE * Goofys memory optimizations (#726) * comments * yapf * Make sky exec submit jobs for inline commands; guard Azure disabled subscription error. (#727) * Guard against Azure disabled sub error; tested with: sky gpunode * Make `sky exec` submit jobs for inline commands as well. * YAPF * Fix exec examples + check empty entrypoint * Enable empty string entrypoint for 'sky launch'. * Add job duration and resources (#729) * Add job duration and resources * address comments * Fix job status terminal -> non-terminal * Change end_at to null if status is not terminal * Default value to null for end_at * Address comments * UX improvement for sky stop and down (#734) * UX improvement for sky stop and down * Change skipped to be yellow * yapf * Fix AWS multi-node failure (#736) * Fix TPU naming issue (#737) * Format duration for jobs (#744) * Format duration * format * Address comments * Shared default security group for AWS (#731) * shared default security group for AWS * update * update name * Add TPU pods (#739) * Init fix * Fix * quick fix * Add notes * Add us-east1 region * Add assertion on multi-node TPU * Fix * Fix small nits in code (#748) * Fix race condition for skylet daemon (#747) * Fix race condition * update comment * Fix submitted_at * Add retry for sky logs * format * Fix retry for job log * Fix the setup progress bar and conda confirmation message (#746) * Fix setup progress bar and confirmation of conda * minor fix * Address comments * Hack to remove bash warning * rm * Fix the pipe output after ctrl-c * Get rid of cloud dependencies in the config template (#749) * Add W&B setup in FAQ (#751) * Add W&B setup in FAQ * Reflect comments * Fix typo * Small refactor of CLI and cloud (#752) * Fix resources check * Automatic cloud registry and task_option override * fix test * provide option to reset the setting * Rename the option adding function * Fix dummy cloud * fix * cloud register * address comments * address comments * Fix descendant processes termination for sky cancel (#758) * fix children processes termination * fix comment * fix sig * fix PIPE kill * Fix PIPE kill * format * Minor logging fix. (#760) * Add test_cancel() to smoke. (#761) * Support glob patterns for jobs for `sky logs -s` (#685) * allow globbing when calling sky logs -s * formatting * removed whitespace * small fixes * final fixes * Minor: update README (#762) * Reformat `sky status` codepath and minor fixes (#721) * refactor sky status codepath and minor fixes * moved things over to status_utils * fix conflicts and styling errors * final changes * added inits * A quick fix for CLOUD_REGISTRY.from_str for None (#766) * Fix killing the whole session problem (#767) * Fix kill tmux issue * Add comments * Fix gpu issue * Add test and fix * format * Add comment * Add doc for gcloud 400 error (#764) * Add gcloud 400 error hint * add command * Fix sky cancel (#770) * Fix format all (#768) * fix format all * Managed spot (alpha) (#759) * wip * Fix resources check * Automatic cloud registry and task_option override * fix test * provide option to reset the setting * Rename the option adding function * Fix dummy cloud * fix * wip * Add Spot CLI support * fix spot controller logic * Add spot_recovery example * fix strategy * add todo * Fix status * add spot status * fix todo * Fix merge error * Fix status * fix signal * format * wip: integrate sky spot launch * wip: spot launch integration * Add autostop for the task controller * Add spot_status cli * Fix spot status * fix spot status * wip: add spot cancel * fix spot status * Fix * disable autoscaling for spot instance * Add tests and fix yaml specs * fix tests and make controller resources unspecified * fix test * format * Fix test spot * format * Fix resources setstate * Fix empty run * format * Skip empty run section * Add network check when doing status refresh * Fix logging * Remove buggy job not submitted yet * Fix status refresh * Add field check for task * Align job ids * Fix failover when recovering * Address part of the comments * Address part of the comments * address comment in backend_utils * Adress part of the comments * Fix cancelled job duration * address comments * Rename status.submit * Fix cancel * logging info * Allow sky spot status to be run when a spot job is launching * Add status cache and show job status after launch * Fix failover * Remove spot-controller from status * Disable azure use_spot * Enable gcp * Fix optimizer dag dummy node * Fix setup and recovery * Merge branch 'master' of github.com:concretevitamin/sky-experiments into managed-spot * format * Fix optimizer test * Address comments * address comments * Fix exception catch for job cancel * Handle unexpected failure * address comments * address comments * Default to use_spot for spot_launch * Fix smoke test * Fix smoke test * fix the message after all retry fails * Add back status check to smoke test * format * fix managed spot test * format Co-authored-by: Wei-Lin Chiang <[email protected]> * Reuse sky wheels for efficiency (#769) * cache sky wheels * Add spot status --all, fix default value for use_spot and fix GCP dependency (#772) * Add spot status --all and fix default value for use_spot * Remove useless TODO * make gcloud available interactively * Fix gcp dependency installation and sky check * Add test for managed spot instance * format * Fix security group mismatch issue with autostop (#780) * fix * delete unused global var * restore the original Ray implementation * Spot cancel -a and spot status --refresh (#776) * add cancel -a * add sky spot cancel -a and sky spot status --refresh * fix * fix refresh * fix return * fix refresh * address comments * Disallow long cluster names. (#781) * Disallow long cluster names. This fixed a smoke test failure. * Fix another test name. * Remove storage_demo.yaml from smoke tests (#782) * Remove storage_demo.yaml from smoke * add todo * Fix smoke test for accelerators (#785) * Fix yaml_spec test for accelerators * format * Fixing some errors encountered in smoke tests. (#787) * spot_state: fix a SQL syntax error. Previously: » sky spot cancel -y -n test-managed-spot-zongheng-fe6d-1 E 05-02 14:37:02 backend_utils.py:989] Traceback (most recent call last): E 05-02 14:37:02 backend_utils.py:989] File "<string>", line 1, in <module> E 05-02 14:37:02 backend_utils.py:989] File "/home/ubuntu/.local/lib/python3.9/site-packages/sky/spot/spot_utils.py", line 68, in cancel_job_by_name E 05-02 14:37:02 backend_utils.py:989] job_ids = spot_state.get_nonterminal_job_ids_by_name(job_name) E 05-02 14:37:02 backend_utils.py:989] File "/home/ubuntu/.local/lib/python3.9/site-packages/sky/spot/spot_state.py", line 177, in get_nonterminal_job_ids_by_name E 05-02 14:37:02 backend_utils.py:989] rows = _CURSOR.execute( E 05-02 14:37:02 backend_utils.py:989] sqlite3.OperationalError: near "job_name": syntax error E 05-02 14:37:02 backend_utils.py:989] E 05-02 14:37:02 backend_utils.py:994] Command failed with code 1: python3 -u -c 'from sky.spot import spot_utils; result = spot_utils.cancel_job_by_name('"'"'test-managed-spot-zongheng-fe6d-1'"'"'); print(result, end="", flush=True)' E 05-02 14:37:02 backend_utils.py:995] Failed to cancel managed spot job * Guard against `sky status --refresh` race * Increase check_network_connection() timeout to 3s. * Fixes * Fix smoke * Add a comment * remove filter in aws command and fix spot failover (#789) * Add status check when cancelling managed spot job and disable ambiguous termination of reserved clusters (#784) * add cancel -a * add sky spot cancel -a and sky spot status --refresh * fix * fix refresh * fix return * fix refresh * address comments * Fix cli prompt for cancel -a * Add status check for job and controller on spot cancelling * Remove output from setup for spot controller * Disable `sky stop --all` to stop sky-spot-controller * Fix operation str and remove controller from --all * Check reserved cluster for termination operations * fix name check * Update massage * Add repr for name * fix comment * Disallow canceling on reserved clusters * Add tests * format * address comments * format * fix storage dump/load * fix smoke test for autostop * Address comments * format * add assertion * Fix output for cancel * fix cancel handle * Fix storage tests failing with parallel runners (#794) * Change storage id to time_ns * replace time_ns with time * File mounts for managed spot jobs (#788) * Fix storage from_yaml_config * Add sky storage support for managed spot jobs * add example for storage in managed spot job * Fix testing for the spot storage * format * format * Add todos * Fix test * Fix comments * format * Fix retry cnt * Fix test name * Fix test by adding flush * delete storage after spot task finish * persistent=false * Add new line at the end * address comments * Minor: logging polishes. (#804) * Minor: logging polishes. * Revert storage.py to master; except StorageBucketGetError msg * Minor fix to test_storage.py * Logging fixes for `sky spot cancel` and `sky logs -s`. * Fix copy_mount_str (#806) * Fix TPU resource leak (#797) * Fix tpu leak * Add test * TPU fixes: record `sky status` before provisioning tpu. * Switch to v2 to save cost Co-authored-by: Zongheng Yang <[email protected]> * Add BERT and Resnet spot examples (#792) * Add BERT and Resnet spot examples * Fix * Add lightning example * Fix * Update spot examples comments * file mount storage * Add resnet spot codes for version control * yapf * remove comments Co-authored-by: Zongheng Yang <[email protected]> * Add env option for sky launch and exec (#803) * Add env option * Fix task env config * Add env doc and for spot_launch * format * add test for env vars * address comments * fix help str rendering * Fail for unset env var * format * Fix race condition between job set_state and update_status (#805) * Fix race condition when between job and update_status * address comments * address comments * Fix bert qa example (#808) * Fix `tpu_rc` unassigned bug; minor fix on task resources_str. (#810) * Fix `tpu_rc` not assigned error, seen in `sky down` * Fix bug: "resources={'K80': n}" not included in generated program * Support streaming logs from the spot cluster through spot controller (#798) * Streaming logs from the spot cluster through spot controller * fix comment * format * Wait for job running for spot logs * Fix job lib status check * Add support for `sky spot logs` showing the latest log * sleep for sky spot status * remove uneccessary argument from execution * Refactor * Fix the keyboard interruption of sky spot logs * format * Add comments * fix comments * Add wait for the controller and job to be started for spot logs * address part of the comments * Fix job id logging * update type hint * Fix race condition when between job and update_status * format * address comment * format * Fix copy_mount_str * Address comments * fix logs * address comments * format * Fix logs * Fix repeat logs * address comments * Fix spot logs * Fix logging * refactor spot logs loop * fix log * Fix log_lib infinite loop printing "Job finished". This appears to be an accidentally deleted line. * UX: Remove a logging message * fix spot_status * fix comment Co-authored-by: Zongheng Yang <[email protected]> * Fix a rare failure of wheels build (#809) There is a very narrow window in https://github.com/sky-proj/sky/blob/1fd81ef00884780cb476405e3f9da84fd05fbd47/sky/backends/wheel_utils.py#L43-L71 where ctrl+c would prevent cleaning up files. This causes a failure when the wheel is been built again. This PR fixes this. - [x] Unit tests - [x] https://github.com/sky-proj/sky/issues/656 - [x] smoke tests * Avoid ray messing up terminal with misaligned outputs (#813) * Fix ray mess up terminal output * fix comment * Fix tail_logs by remove stdin * add ray's implementation link * Remove input for subprocess daemon * Install ray only when it is not installed on the remote instance (#811) * Not install ray if ray already exists * longer sleep time for cancel_pytorch * Fix autoscaler benign assertion by patching (#815) * Patch resource_demand_scheduler.py from Ray 1.10.0. * Make multi_echo test autoscaler bug. * Fix LICENSE, comments, test * Change examples/multi_echo.py to use thread pool * Make wheel paths stable to avoid disrupting certain running Sky tasks (#819) * Make wheel paths stable to enable concurrent launch on same cluster. * Message fixes in cli.py * Make ray_patches real patch files (#821) * WIP * Make ray_patches/worker.py use Ray 1.10 formatting (but keep our changes) * Make ray_patches real patch files. * Fix logging * Fix GCP project ID (#824) * Fix GCP project ID * yapf * Move the STARTED column to sky spot status -a (#823) * save and load spot status --all caches * format * swap path * Removed different cached table for -a * Fix signal handling with stdin=NULL (#818) * fix signal handling with stdin=NULL * Add ctrl-c message * Handle ctrl-c and ctrl-z * refactor * revert refactor * Fix spot logs with ctrl-c/ctrl-z * Fix status showing * remove catch * format * fix indent * Disable process_stream by killing children processes * Fix comment * Add sleep for spot test to wait for status to be updated * format * address comments * [Breaks existing AWS clusters!] Change AWS security group name (#826) * [Breaks existing AWS clusters] Change AWS security group names * typo * Fix back incompat descriptions * Error msg fix * Fix sky spot status 'ago' * Remove undesired autoscaling (#830) * disable autoscale when upscaling_speed is 0 * fix patch file * Fix the upscaling=0 * remove output * fix patch * Fixed multi_echo * format * Use real job time in the `sky spot status` (#827) * use real job time * fix * address comments * nit * Fix compatibility of the patch (#838) * Fix compatibility of the patch * Add comment * Fix file existance test * Fix patching * Fix comment * patch again * Remove unuseful comments * Spot controller UX improvements (#839) * Fix recovery_strategy.py missing import / var shadowing * Change controller autostop to 30 mins Useful for large scale launch debugging * cli: better messages for downing controller * Show in-progress counts in sky spot status * Add a TODO about duplicate keys in task.py. * yapf * fix test * UX for job duration (#840) * Make job start/end/submit time to be float for accurate time * Fix microseconds * format * Fix < 1 second * format * fix * fix None problem (#842) * Docs for spot jobs (#844) * docs for spot jobs * mention code/files sync * typo * typo * Update docs * fix duration Co-authored-by: Zongheng Yang <[email protected]> * Add cli doc for spot (#846) * Add cli reference for spot * address comments * Fix spot price for GCP (#847) * fix spot price for GCP * format * Fix sky cancel for on-prem mode (#775) * Replace ps forest with pstree to handle on-prem * remove pgid * Fix * Add tests for all clouds * Python based subprocess daemon * yapf * rm subprocess_daemon.sh * comment * add setup and fix test * replace workdir with git clone for test_distributed_tf * Fix sky spot issue * Needs more time for sky cancel to work * Fix bug * address comments * Fix subprocess daemon bug (#850) * Fix * Simplify * Fix A100 provisioning on GCP (#829) * Fix A100 on GCP * yapf * eof * Fix 16x A100 and spot * fix name * Clean codes * Fix pylint error * Fix * fix gcp return code * Address comments * comment * Add notes * update note * Fix bug * Cluster status meaning in `sky status --help` (#843) * Better optimizer plan logs (#860) * Better optimizer plan logs * Add minimize logging option * Change MINIMIZE_LOGGING to False by default * address comments * update output * print plan in topo order * Fix docs build warnings and remove code search (#834) * Add Jupyter notebook tutorial to docs (#841) * Fix sqlite3 rename problem with older version (#868) * Fix on-demand price (#866) * Fix on-demand price * format * Change the default cpu instance for aws * address comments * Enable timeline recording for Sky (#833) * timeline * Fix Ray autoscaler's failure of gpu auto detection (#848) * Fix Ray autoscaler failure of detecting gpu * rename * ensure echo only once * Add retry-until-up feature for launch and start (#863) * Add retry until up * fix message * fix message * fix * Add exponential backoff * Add backoff for spot recovery * Address comments * format * Fix merge error * Fix message * Fix message * Add comment * Sky docker image (#869) * Sky docker image * docs * Address comments * optimize build order and remove deps * add .sky mount * fix docs * fix docs * Upload only necessary credentials and add gpu to cpu mapping for GCP (#853) * upload only necessary credentials and add gpu to cpu mapping for GCP * Fix comments * Fix api * rename * refactor * hide variables * format * fix test * Add n2 instances * Fix power of two * format * Fix azure cancel test * fix azure smoke * specify credential files * Address comments * format * Address comment * Fix default image (#874) * add check before ping (#876) * Minor: reformat `sky show-gpus` output. (#877) * Distinguish spot failure for user code and cluster/controller failure. (#862) * Distinguish controller failure and user failure * Add hints for getting error messages * Fix * update message * rename to cluster failure * message for cluster failed as well * Fix failing * address comments * Add id for end of logs * Split resource failure and controller failure * Fix terminal state * Address comments * fix typo * Add YAML schema check (#680) * Add explanations on spot docs (#852) * Add some docs * update * fix * fix * update * address comments * reorg * reorg and add fig * Add imgs * fix * update * Fix typo (#880) * Make `sky launch` prompting consistent with interactive nodes (#867) * Make the spot job pending as soon as the job is submitted (#870) * Distinguish controller failure and user failure * Add hints for getting error messages * Fix * update message * rename to cluster failure * message for cluster failed as well * Fix failing * Add pending state for spot jobs * Fix job id * format * address comments * Add id for end of logs * fix pending * Add name and resources * format * Add failed status check for spot state * Refactor the backend interface * address comments * fix status * address comment * Fix comment * remove azureTokens.json from the credential list (#883) * Fast removal of buckets * Fast removal of buckets * Replace os.system with subprocess * Replace os.system with subprocess * fix syntax * fix syntax * Addressed Romil's comments * fix comment * Fix comments 2 Co-authored-by: Zhanghao Wu <[email protected]> Co-authored-by: Gautam Mittal <[email protected]> Co-authored-by: Siyuan (Ryans) Zhuang <[email protected]> Co-authored-by: Zongheng Yang <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: Romil Bhardwaj <[email protected]> Co-authored-by: Zongheng Yang <[email protected]> Co-authored-by: Wei-Lin Chiang <[email protected]> Co-authored-by: Mehul Raheja <[email protected]> Co-authored-by: Wei-Lin Chiang <[email protected]>
- Loading branch information