[Core] Upgrade ray to 2.4.0 #1734

Michaelvll · 2023-02-27T23:32:32Z

This is a routine upgrade of the remote ray version we are using from 2.0.1 to 2.4.0. This is to unblock #1586

Azure: the existing azure cluster might be affected by this PR, due to the changes of the VM naming from the upstream ray to avoid name conflicts (ref). It should be fine for now, as we don't have many active Azure user.

Tested (run the relevant ones):

Any manual or new tests for this PR (please specify below)
- sky launch check all the patches work as expected.
All smoke tests: pytest tests/test_smoke.py
pytest tests/test_smoke.py --aws
Backward compatibility tests: bash tests/backward_comaptibility_tests.sh
- AWS
- GCP
pip install . with the clean python 3.10 environment with ray==2.0.1 installed.

…nto upgrade-ray-2.2

…nto upgrade-ray-2.3

…itamin/sky-experiments into upgrade-ray-2.3

…e-ray-2.3

…98a5f5

…e-ray-2.3

romilbhardwaj

Thanks @Michaelvll! I tried it out with some manual test cases and works nicely. Backward compatibility also seemed to be working (started a cluster with 2.0.1, upgraded to 2.4, tried exec and launch with various YAMLs). Haven't tried spot yet, but if tests pass I assume it should be good.

romilbhardwaj · 2023-05-21T21:33:01Z

sky/skylet/providers/aws/cloudwatch/cloudwatch_helper.py

@@ -4,7 +4,8 @@
 import logging
 import os
 import time
-from typing import Any, Dict, List, Tuple, Union
+from enum import Enum


I assume changes to this file are from https://github.com/ray-project/ray/blob/ray-2.4.0/python/ray/autoscaler/_private/aws/cloudwatch/cloudwatch_helper.py

Ahh, yes, it was directly copied from that file with the change for L12, where we changed it to aws.utils in sky.skylet.providers instead of the ray

romilbhardwaj · 2023-05-22T22:21:31Z

sky/skylet/ray_patches/__init__.py

-    from ray.dashboard.modules.job import job_manager
-    _run_patch(job_manager.__file__, _to_absolute('job_manager.py.patch'))


Do we know if this has been fixed in Ray 2.4?

Ahh, this is from #1618 (comment). I will test the large scale ray job submit again:

for i in {1..1000}; do ray job submit --job-id $i-gcpuser-2 --address http://127.0.0.1:8265 --no-wait 'echo hi; sleep 800; echo bye'; done (ray 2.4 is more robust for the OOM caused by too many submitted job, where the Job will correctly fail when OOM happens, instead of hanging)

for i in {1..1000}; do ray job submit --job-id $i-gcpuser-2 --address http://127.0.0.1:8265 --no-wait 'echo hi; sleep 1; echo bye'; sleep 1; done

romilbhardwaj · 2023-05-22T22:29:25Z

sky/skylet/ray_patches/command_runner.py.patch

+0a1,2
+> # From https://github.com/ray-project/ray/blob/ray-2.4.0/python/ray/autoscaler/_private/command_runner.py
+> 
+140c142
 <                     "ControlPersist": "10s",
 ---
 >                     "ControlPersist": "300s",


Do we still need this? Considering we override it for the default case here: https://github.com/skypilot-org/skypilot/blob/9dca02b2c6b0290d46d1ab03829879cee9949d8c/sky/utils/command_runner.py#LL80C22-L80C22

I might be missing something though

Our ray's node_provider is still using the command_runner from ray instead of our own command_runner, so I think it might still be needed?

Makes sense

romilbhardwaj · 2023-05-22T22:31:37Z

tests/test_smoke.py

@@ -269,6 +269,7 @@ def test_azure_region():
            f'sky status --all | grep {name} | grep eastus2',  # Ensure the region is correct.
        ],
        f'sky down -y {name}',
+        timeout=30 * 60,  # 30 mins


Since this is just the minimal example launching this should not take too long. Should we reduce this to something more reasonable (say 10 min?) :)

Just reverted it. I am testing it again. : )

…e-ray-2.3

Michaelvll

Thanks for the review @romilbhardwaj! I fixed most of them as well as the comments from @concretevitamin in #1618. I am running the tests again:

pytest tests/test_smoke.py
tests in the comments.

Note: I updated the local ray version to 2.2.0 to resolve the package version conflicts for click/grpcio/protobuf. Wdyt @romilbhardwaj @concretevitamin?

Michaelvll · 2023-05-23T17:58:43Z

sky/skylet/providers/aws/cloudwatch/cloudwatch_helper.py

@@ -4,7 +4,8 @@
 import logging
 import os
 import time
-from typing import Any, Dict, List, Tuple, Union
+from enum import Enum


Ahh, yes, it was directly copied from that file with the change for L12, where we changed it to aws.utils in sky.skylet.providers instead of the ray

Michaelvll · 2023-05-23T18:18:07Z

sky/skylet/ray_patches/command_runner.py.patch

+0a1,2
+> # From https://github.com/ray-project/ray/blob/ray-2.4.0/python/ray/autoscaler/_private/command_runner.py
+> 
+140c142
 <                     "ControlPersist": "10s",
 ---
 >                     "ControlPersist": "300s",


Our ray's node_provider is still using the command_runner from ray instead of our own command_runner, so I think it might still be needed?

Michaelvll · 2023-05-23T18:39:04Z

sky/skylet/ray_patches/__init__.py

-    from ray.dashboard.modules.job import job_manager
-    _run_patch(job_manager.__file__, _to_absolute('job_manager.py.patch'))


Ahh, this is from #1618 (comment). I will test the large scale ray job submit again:

for i in {1..1000}; do ray job submit --job-id $i-gcpuser-2 --address http://127.0.0.1:8265 --no-wait 'echo hi; sleep 800; echo bye'; done (ray 2.4 is more robust for the OOM caused by too many submitted job, where the Job will correctly fail when OOM happens, instead of hanging)

for i in {1..1000}; do ray job submit --job-id $i-gcpuser-2 --address http://127.0.0.1:8265 --no-wait 'echo hi; sleep 1; echo bye'; sleep 1; done

Michaelvll · 2023-05-23T19:15:45Z

tests/test_smoke.py

@@ -269,6 +269,7 @@ def test_azure_region():
            f'sky status --all | grep {name} | grep eastus2',  # Ensure the region is correct.
        ],
        f'sky down -y {name}',
+        timeout=30 * 60,  # 30 mins


Just reverted it. I am testing it again. : )

romilbhardwaj

Thanks @Michaelvll! Code looks good to go. Since we are now also bumping local ray to 2.2, I still need to do some backward compatibility tests (e.g., upgrading from Ray 1.9).

romilbhardwaj · 2023-05-23T22:51:58Z

sky/backends/cloud_vm_ray_backend.py

@@ -200,8 +200,7 @@ def add_prologue(self,
        self.job_id = job_id
        # Should use 'auto' or 'ray://<internal_head_ip>:10001' rather than
        # 'ray://localhost:10001', or 'ray://127.0.0.1:10001', for public cloud.
-        # Otherwise, it will a bug of ray job failed to get the placement group
-        # in ray <= 2.4.0.
+        # Otherwise, it will a bug of ray job failed to get the placement group.


nit: This sentence seems malformed - does it mean Otherwise, ray will fail to get the placement group because of a bug in ray job?

Fixed. Thanks!

concretevitamin · 2023-05-24T04:34:24Z

Note: I updated the local ray version to 2.2.0 to resolve the package version conflicts for click/grpcio/protobuf. Wdyt @romilbhardwaj @concretevitamin?

Sounds good. Thanks for the great work!

romilbhardwaj

Awesome work @Michaelvll! Ran some quick backcompat tests upgrading from Ray 1.13 (logs). Should be good to go if smoke tests pass.

…e-ray-2.3

Michaelvll added 24 commits January 23, 2023 13:36

update the patches

ca94eba

upgrade node providers

1651e7f

Merge branch 'master' of github.com:concretevitamin/sky-experiments i…

55dd5b4

…nto upgrade-ray-2.2

Merge branch 'master' of github.com:concretevitamin/sky-experiments i…

dc3c14f

…nto upgrade-ray-2.2

fix azure config.py

d4ea222

print sky queue

ffb6f7b

add back azure disk size

d33593e

fix job manager

380d8b6

Merge branch 'master' of github.com:concretevitamin/sky-experiments i…

16fe424

…nto upgrade-ray-2.2

fix hash

c21d46d

longer timeout

99cc5dc

fix test smoke

5ad228d

Remove the patch for job_manager

e3f0c60

longer timeout for azure_region test

3e42635

Merge branch 'master' of github.com:concretevitamin/sky-experiments i…

e0d8e7c

…nto upgrade-ray-2.2

address comments

0cb298b

Merge branch 'master' of github.com:concretevitamin/sky-experiments i…

366173b

…nto upgrade-ray-2.2

format

caee0e1

fix templates

4e280a3

pip install --exists-action

582b0ba

Upgrade to 2.3 instead

4351433

upgrade to ray 2.3

79627b8

Merge branch 'master' of github.com:concretevitamin/sky-experiments i…

9cc992e

…nto upgrade-ray-2.3

Merge branches 'upgrade-ray-2.3' and 'master' of github.com:concretev…

daeaf02

…itamin/sky-experiments into upgrade-ray-2.3

romilbhardwaj mentioned this pull request Apr 20, 2023

An older ray version (2.0.1) that does not have a Linux aarch64 wheel should probably be updated (2.2.0+) #1885

Closed

Michaelvll added 4 commits May 9, 2023 11:08

Merge branch 'master' of github.com:skypilot-org/skypilot into upgrad…

e0a41ac

…e-ray-2.3

update patches for 2.4

32f2b7e

adopt changes for azure providers: a777a028b8dbd7bbae9a7393c98f6cd65f…

00cb9e9

…98a5f5

fix license

68b7283

Michaelvll changed the title ~~[Core] Upgrade ray to 2.3.0~~ [Core] Upgrade ray to 2.4.0 May 9, 2023

This was referenced May 11, 2023

Default to ubuntu for GCP and avoid key pair checking #1641

Closed

Old maximum grpcio version is blocking other dependencies #1956

Closed

Merge branch 'master' of github.com:skypilot-org/skypilot into upgrad…

af74c2f

…e-ray-2.3

romilbhardwaj reviewed May 22, 2023

View reviewed changes

Michaelvll added 4 commits May 23, 2023 10:55

Merge branch 'master' of github.com:skypilot-org/skypilot into upgrad…

76e137f

…e-ray-2.3

fix click dependencies

c31874c

update setup.py

9814617

Fix #1618 (comment)

52571e9

Michaelvll mentioned this pull request May 23, 2023

[Core] Upgrade ray to 2.3.0 #1618

Closed

12 tasks

Michaelvll added 2 commits May 23, 2023 12:12

fix #1618 (comment)

b312244

revert test_smoke

d070e39

Michaelvll commented May 23, 2023

View reviewed changes

Michaelvll added 2 commits May 23, 2023 12:22

fix comment

621de78

revert to w instead of wipe

b30bf5d

Michaelvll requested a review from romilbhardwaj May 23, 2023 22:18

romilbhardwaj reviewed May 23, 2023

View reviewed changes

romilbhardwaj added this to the v0.3 milestone May 23, 2023

rewording

b414ddb

Michaelvll closed this May 24, 2023

Michaelvll deleted the upgrade-ray-2.3 branch May 24, 2023 06:24

Michaelvll restored the upgrade-ray-2.3 branch May 24, 2023 06:25

Michaelvll reopened this May 24, 2023

romilbhardwaj approved these changes May 25, 2023

View reviewed changes

Michaelvll added 2 commits May 25, 2023 21:15

Merge branch 'master' of github.com:skypilot-org/skypilot into upgrad…

fc9eaf2

…e-ray-2.3

minor fix

24fb0b4

Michaelvll merged commit ef08910 into master May 26, 2023

Michaelvll deleted the upgrade-ray-2.3 branch May 26, 2023 18:11

romilbhardwaj mentioned this pull request May 26, 2023

Support AWS Graviton instances #1586

Closed

Michaelvll mentioned this pull request Jun 5, 2023

OOM will cause the ray job fail #622

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Upgrade ray to 2.4.0 #1734

[Core] Upgrade ray to 2.4.0 #1734

Michaelvll commented Feb 27, 2023 •

edited

Loading

romilbhardwaj left a comment

romilbhardwaj May 21, 2023

Michaelvll May 23, 2023

romilbhardwaj May 22, 2023

Michaelvll May 23, 2023 •

edited

Loading

romilbhardwaj May 22, 2023

Michaelvll May 23, 2023

romilbhardwaj May 23, 2023

romilbhardwaj May 22, 2023

Michaelvll May 23, 2023

Michaelvll left a comment •

edited

Loading

Michaelvll May 23, 2023

Michaelvll May 23, 2023

Michaelvll May 23, 2023 •

edited

Loading

Michaelvll May 23, 2023

romilbhardwaj left a comment •

edited

Loading

romilbhardwaj May 23, 2023

Michaelvll May 24, 2023

concretevitamin commented May 24, 2023

romilbhardwaj left a comment

		from ray.dashboard.modules.job import job_manager
		_run_patch(job_manager.__file__, _to_absolute('job_manager.py.patch'))

[Core] Upgrade ray to 2.4.0 #1734

[Core] Upgrade ray to 2.4.0 #1734

Conversation

Michaelvll commented Feb 27, 2023 • edited Loading

romilbhardwaj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll May 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll May 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romilbhardwaj left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

concretevitamin commented May 24, 2023

romilbhardwaj left a comment

Choose a reason for hiding this comment

Michaelvll commented Feb 27, 2023 •

edited

Loading

Michaelvll May 23, 2023 •

edited

Loading

Michaelvll left a comment •

edited

Loading

Michaelvll May 23, 2023 •

edited

Loading

romilbhardwaj left a comment •

edited

Loading