-
Notifications
You must be signed in to change notification settings - Fork 503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Upgrade ray to 2.4.0 #1734
Conversation
…nto upgrade-ray-2.2
…nto upgrade-ray-2.2
…nto upgrade-ray-2.2
…nto upgrade-ray-2.2
…nto upgrade-ray-2.2
…nto upgrade-ray-2.3
…itamin/sky-experiments into upgrade-ray-2.3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll! I tried it out with some manual test cases and works nicely. Backward compatibility also seemed to be working (started a cluster with 2.0.1, upgraded to 2.4, tried exec and launch with various YAMLs). Haven't tried spot yet, but if tests pass I assume it should be good.
@@ -4,7 +4,8 @@ | |||
import logging | |||
import os | |||
import time | |||
from typing import Any, Dict, List, Tuple, Union | |||
from enum import Enum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume changes to this file are from https://github.com/ray-project/ray/blob/ray-2.4.0/python/ray/autoscaler/_private/aws/cloudwatch/cloudwatch_helper.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, yes, it was directly copied from that file with the change for L12, where we changed it to aws.utils
in sky.skylet.providers
instead of the ray
from ray.dashboard.modules.job import job_manager | ||
_run_patch(job_manager.__file__, _to_absolute('job_manager.py.patch')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know if this has been fixed in Ray 2.4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, this is from #1618 (comment). I will test the large scale ray job submit again:
-
for i in {1..1000}; do ray job submit --job-id $i-gcpuser-2 --address http://127.0.0.1:8265 --no-wait 'echo hi; sleep 800; echo bye'; done
(ray 2.4 is more robust for the OOM caused by too many submitted job, where the Job will correctly fail when OOM happens, instead of hanging) -
for i in {1..1000}; do ray job submit --job-id $i-gcpuser-2 --address http://127.0.0.1:8265 --no-wait 'echo hi; sleep 1; echo bye'; sleep 1; done
0a1,2 | ||
> # From https://github.com/ray-project/ray/blob/ray-2.4.0/python/ray/autoscaler/_private/command_runner.py | ||
> | ||
140c142 | ||
< "ControlPersist": "10s", | ||
--- | ||
> "ControlPersist": "300s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this? Considering we override it for the default case here: https://github.com/skypilot-org/skypilot/blob/9dca02b2c6b0290d46d1ab03829879cee9949d8c/sky/utils/command_runner.py#LL80C22-L80C22
I might be missing something though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our ray's node_provider is still using the command_runner
from ray instead of our own command_runner
, so I think it might still be needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
tests/test_smoke.py
Outdated
@@ -269,6 +269,7 @@ def test_azure_region(): | |||
f'sky status --all | grep {name} | grep eastus2', # Ensure the region is correct. | |||
], | |||
f'sky down -y {name}', | |||
timeout=30 * 60, # 30 mins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is just the minimal example launching this should not take too long. Should we reduce this to something more reasonable (say 10 min?) :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just reverted it. I am testing it again. : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @romilbhardwaj! I fixed most of them as well as the comments from @concretevitamin in #1618. I am running the tests again:
-
pytest tests/test_smoke.py
- tests in the comments.
Note: I updated the local ray version to 2.2.0
to resolve the package version conflicts for click/grpcio/protobuf
. Wdyt @romilbhardwaj @concretevitamin?
@@ -4,7 +4,8 @@ | |||
import logging | |||
import os | |||
import time | |||
from typing import Any, Dict, List, Tuple, Union | |||
from enum import Enum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, yes, it was directly copied from that file with the change for L12, where we changed it to aws.utils
in sky.skylet.providers
instead of the ray
0a1,2 | ||
> # From https://github.com/ray-project/ray/blob/ray-2.4.0/python/ray/autoscaler/_private/command_runner.py | ||
> | ||
140c142 | ||
< "ControlPersist": "10s", | ||
--- | ||
> "ControlPersist": "300s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our ray's node_provider is still using the command_runner
from ray instead of our own command_runner
, so I think it might still be needed?
from ray.dashboard.modules.job import job_manager | ||
_run_patch(job_manager.__file__, _to_absolute('job_manager.py.patch')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, this is from #1618 (comment). I will test the large scale ray job submit again:
-
for i in {1..1000}; do ray job submit --job-id $i-gcpuser-2 --address http://127.0.0.1:8265 --no-wait 'echo hi; sleep 800; echo bye'; done
(ray 2.4 is more robust for the OOM caused by too many submitted job, where the Job will correctly fail when OOM happens, instead of hanging) -
for i in {1..1000}; do ray job submit --job-id $i-gcpuser-2 --address http://127.0.0.1:8265 --no-wait 'echo hi; sleep 1; echo bye'; sleep 1; done
tests/test_smoke.py
Outdated
@@ -269,6 +269,7 @@ def test_azure_region(): | |||
f'sky status --all | grep {name} | grep eastus2', # Ensure the region is correct. | |||
], | |||
f'sky down -y {name}', | |||
timeout=30 * 60, # 30 mins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just reverted it. I am testing it again. : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll! Code looks good to go. Since we are now also bumping local ray to 2.2, I still need to do some backward compatibility tests (e.g., upgrading from Ray 1.9).
sky/backends/cloud_vm_ray_backend.py
Outdated
@@ -200,8 +200,7 @@ def add_prologue(self, | |||
self.job_id = job_id | |||
# Should use 'auto' or 'ray://<internal_head_ip>:10001' rather than | |||
# 'ray://localhost:10001', or 'ray://127.0.0.1:10001', for public cloud. | |||
# Otherwise, it will a bug of ray job failed to get the placement group | |||
# in ray <= 2.4.0. | |||
# Otherwise, it will a bug of ray job failed to get the placement group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This sentence seems malformed - does it mean Otherwise, ray will fail to get the placement group because of a bug in ray job
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks!
Sounds good. Thanks for the great work! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work @Michaelvll! Ran some quick backcompat tests upgrading from Ray 1.13 (logs). Should be good to go if smoke tests pass.
This is a routine upgrade of the remote ray version we are using from 2.0.1 to 2.4.0. This is to unblock #1586
Azure: the existing azure cluster might be affected by this PR, due to the changes of the VM naming from the upstream ray to avoid name conflicts (ref). It should be fine for now, as we don't have many active Azure user.
Tested (run the relevant ones):
sky launch
check all the patches work as expected.pytest tests/test_smoke.py
pytest tests/test_smoke.py --aws
bash tests/backward_comaptibility_tests.sh
pip install .
with the clean python 3.10 environment withray==2.0.1
installed.