-
Notifications
You must be signed in to change notification settings - Fork 503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Upgrade ray to 2.3.0 #1618
Conversation
…nto upgrade-ray-2.2
…nto upgrade-ray-2.2
…nto upgrade-ray-2.2
from ray.dashboard.modules.job import job_manager | ||
_run_patch(job_manager.__file__, _to_absolute('job_manager.py.patch')) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted this, as our patch will make the ray job submit
stuck randomly in the latest ray, and the tests are all passed without the patch. It is possible that the await ray_func.remote()
not raising exception problem is fixed.
TODO:
- Find out the ray's PR that solves the problem (optional) (possibly fixed by [Serve][Core] Fix serve_long_running memory leak by fixing GCS pubsub asyncio task leak ray-project/ray#29187)
- Try large scale job submission
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! Do we have the snippet used when developing the patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll for upgrading this. This will pave the way for Graviton. Did a pass.
- Shall we run smoke tests
--aws
as well, since node provider is being updated? - The back compat on 3 clouds may need to be tested.
sky/backends/cloud_vm_ray_backend.py
Outdated
@@ -193,7 +193,7 @@ def add_prologue(self, | |||
# Should use 'auto' or 'ray://<internal_head_ip>:10001' rather than | |||
# 'ray://localhost:10001', or 'ray://127.0.0.1:10001', for public cloud. | |||
# Otherwise, it will a bug of ray job failed to get the placement group | |||
# in ray <= 2.0.1. | |||
# in ray <= 2.2.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unneeded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this in #1734.
sky/setup_files/setup.py
Outdated
@@ -64,7 +64,7 @@ def parse_readme(readme: str) -> str: | |||
|
|||
install_requires = [ | |||
'wheel', | |||
# NOTE: ray 2.0.1 requires click<=8.0.4,>=7.0; We disable the | |||
# NOTE: ray 2.2.0 requires click<=8.0.4,>=7.0; We disable the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true for 2.2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, good catch. They recently support click>8.0.4
in ray==2.2.0
ray-project/ray#29574
TODO:
- Test with the latest click
8.1.3
pytest tests/test_smoke.py
0a1,3 | ||
> # Adapted from https://github.com/ray-project/ray/blob/ray-2.0.1/python/ray/_private/log_monitor.py | ||
0a1,4 | ||
> # Adapted from https://github.com/ray-project/ray/blob/ray-2.2.0/python/ray/_private/log_monitor.py | ||
> # Fixed the problem for progress bar, as the latest version does not preserve \r for progress bar. | ||
> # The change is adapted from https://github.com/ray-project/ray/blob/ray-1.10.0/python/ray/_private/log_monitor.py#L299-L300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want both L2 and L4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, sorry for the confusion. Fix the comment to:
# It is reverted to https://github.com/ray-project/ray/blob/ray-1.10.0/python/ray/_private/log_monitor.py#L299-L300
from ray.dashboard.modules.job import job_manager | ||
_run_patch(job_manager.__file__, _to_absolute('job_manager.py.patch')) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! Do we have the snippet used when developing the patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll. Some questions.
@@ -1,7 +1,7 @@ | |||
0a1,4 | |||
> # Adapted from https://github.com/ray-project/ray/blob/ray-2.2.0/python/ray/_private/log_monitor.py | |||
> # Fixed the problem for progress bar, as the latest version does not preserve \r for progress bar. | |||
> # The change is adapted from https://github.com/ray-project/ray/blob/ray-1.10.0/python/ray/_private/log_monitor.py#L299-L300 | |||
> # It is reverted to https://github.com/ray-project/ray/blob/ray-1.10.0/python/ray/_private/log_monitor.py#L299-L300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit confused: what is reverted to what...? Not understanding this line given L2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in #1734. PTAL. : )
@@ -124,8 +124,8 @@ setup_commands: | |||
(type -a pip | grep -q pip3) || echo 'alias pip=pip3' >> ~/.bashrc; | |||
which conda > /dev/null 2>&1 || (wget -nc https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && bash Miniconda3-latest-Linux-x86_64.sh -b && eval "$(/home/azureuser/miniconda3/bin/conda shell.bash hook)" && conda init && conda config --set auto_activate_base true); | |||
source ~/.bashrc; | |||
(pip3 list | grep ray | grep {{ray_version}} 2>&1 > /dev/null || pip3 install -U ray[default]=={{ray_version}}) && mkdir -p ~/sky_workdir && mkdir -p ~/.sky/sky_app && touch ~/.sudo_as_admin_successful; | |||
(pip3 list | grep skypilot && [ "$(cat {{sky_remote_path}}/current_sky_wheel_hash)" == "{{sky_wheel_hash}}" ]) || (pip3 uninstall skypilot -y; pip3 install "$(echo {{sky_remote_path}}/{{sky_wheel_hash}}/skypilot-{{sky_version}}*.whl)[azure]" && echo "{{sky_wheel_hash}}" > {{sky_remote_path}}/current_sky_wheel_hash || exit 1); | |||
(pip3 list | grep "ray " | grep {{ray_version}} 2>&1 > /dev/null || pip3 install --exists-action w -U ray[default]=={{ray_version}}) && mkdir -p ~/sky_workdir && mkdir -p ~/.sky/sky_app && touch ~/.sudo_as_admin_successful; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: spell out --exists-action wipe
for clarity
Why do we need it now? It seems like the grep would've ensured when we reach pip3 install
there's no existing package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed, because otherwise, the original patched files will not be removed from the package, causing the upgraded ray package corrupted due to the staled files. This makes sure that the existing VM upgraded with sky launch
(upgrading ray version) will not have staled files in the ray package
nit: spell out
--exists-action wipe
for clarity
It seems only w
is valid, but wipe
is not.
from ray.dashboard.modules.job import job_manager | ||
_run_patch(job_manager.__file__, _to_absolute('job_manager.py.patch')) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder
sky/backends/cloud_vm_ray_backend.py
Outdated
@@ -193,7 +193,7 @@ def add_prologue(self, | |||
# Should use 'auto' or 'ray://<internal_head_ip>:10001' rather than | |||
# 'ray://localhost:10001', or 'ray://127.0.0.1:10001', for public cloud. | |||
# Otherwise, it will a bug of ray job failed to get the placement group | |||
# in ray <= 2.0.1. | |||
# in ray <= 2.2.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder
sky/setup_files/setup.py
Outdated
@@ -64,9 +64,9 @@ def parse_readme(readme: str) -> str: | |||
|
|||
install_requires = [ | |||
'wheel', | |||
# NOTE: ray 2.0.1 requires click<=8.0.4,>=7.0; We disable the | |||
# NOTE: ray 2.2.0 requires click>=7.0; We disable the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Local ray versions may be older than 2.2. Does that mean the click<=8.0.4
constraint is still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to upgrade the local ray version to 2.2.0
due to a bunch of the conflicts of the dependencies in #1734 . Wdyt?
* update the patches * upgrade node providers * fix azure config.py * print sky queue * add back azure disk size * fix job manager * fix hash * longer timeout * fix test smoke * Remove the patch for job_manager * longer timeout for azure_region test * address comments * format * fix templates * pip install --exists-action * Upgrade to 2.3 instead * upgrade to ray 2.3 * update patches for 2.4 * adopt changes for azure providers: a777a028b8dbd7bbae9a7393c98f6cd65f98a5f5 * fix license * fix patch for log monitor * sleep longer for the multi-echo * longer waiting time * longer wait time * fix click dependencies * update setup.py * Fix #1618 (comment) * fix #1618 (comment) * revert test_smoke * fix comment * revert to w instead of wipe * rewording * minor fix
This is a routine upgrade of the ray version we are using from 2.0.1 to 2.3.0.
TODO:
Tested (run the relevant ones):
sky launch
check all the patches work as expected.pytest tests/test_smoke.py
pytest tests/test_smoke.py --aws
bash tests/backward_comaptibility_tests.sh
ray job
commands not responding #1088for i in {1..1000}; do ray job submit --job-id $i-gcpuser-2 --address http://127.0.0.1:8265 --no-wait 'echo hi; sleep 800; echo bye'; sleep 1; done
(later jobs after 339 keep pending, which should be a ray issue; asked in https://ray-distributed.slack.com/archives/C03JK4S65GS/p1676311951428969)