Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Upgrade ray to 2.3.0 #1618

Closed
wants to merge 23 commits into from
Closed

[Core] Upgrade ray to 2.3.0 #1618

wants to merge 23 commits into from

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Jan 23, 2023

This is a routine upgrade of the ray version we are using from 2.0.1 to 2.3.0.

TODO:

  • Figure out the backward compatibility

Tested (run the relevant ones):

Comment on lines -72 to -74
from ray.dashboard.modules.job import job_manager
_run_patch(job_manager.__file__, _to_absolute('job_manager.py.patch'))

Copy link
Collaborator Author

@Michaelvll Michaelvll Jan 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted this, as our patch will make the ray job submit stuck randomly in the latest ray, and the tests are all passed without the patch. It is possible that the await ray_func.remote() not raising exception problem is fixed.

TODO:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Do we have the snippet used when developing the patch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder

sky/skylet/ray_patches/__init__.py Show resolved Hide resolved
@Michaelvll Michaelvll marked this pull request as ready for review January 27, 2023 07:24
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll for upgrading this. This will pave the way for Graviton. Did a pass.

  1. Shall we run smoke tests --aws as well, since node provider is being updated?
  2. The back compat on 3 clouds may need to be tested.

@@ -193,7 +193,7 @@ def add_prologue(self,
# Should use 'auto' or 'ray://<internal_head_ip>:10001' rather than
# 'ray://localhost:10001', or 'ray://127.0.0.1:10001', for public cloud.
# Otherwise, it will a bug of ray job failed to get the placement group
# in ray <= 2.0.1.
# in ray <= 2.2.0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unneeded?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this in #1734.

@@ -64,7 +64,7 @@ def parse_readme(readme: str) -> str:

install_requires = [
'wheel',
# NOTE: ray 2.0.1 requires click<=8.0.4,>=7.0; We disable the
# NOTE: ray 2.2.0 requires click<=8.0.4,>=7.0; We disable the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true for 2.2?

Copy link
Collaborator Author

@Michaelvll Michaelvll Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, good catch. They recently support click>8.0.4 in ray==2.2.0 ray-project/ray#29574

TODO:

  • Test with the latest click 8.1.3 pytest tests/test_smoke.py

sky/skylet/providers/aws/cloudwatch/cloudwatch_helper.py Outdated Show resolved Hide resolved
sky/skylet/providers/aws/node_provider.py Outdated Show resolved Hide resolved
sky/skylet/providers/aws/node_provider.py Show resolved Hide resolved
0a1,3
> # Adapted from https://github.com/ray-project/ray/blob/ray-2.0.1/python/ray/_private/log_monitor.py
0a1,4
> # Adapted from https://github.com/ray-project/ray/blob/ray-2.2.0/python/ray/_private/log_monitor.py
> # Fixed the problem for progress bar, as the latest version does not preserve \r for progress bar.
> # The change is adapted from https://github.com/ray-project/ray/blob/ray-1.10.0/python/ray/_private/log_monitor.py#L299-L300
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want both L2 and L4?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, sorry for the confusion. Fix the comment to:

# It is reverted to https://github.com/ray-project/ray/blob/ray-1.10.0/python/ray/_private/log_monitor.py#L299-L300

Comment on lines -72 to -74
from ray.dashboard.modules.job import job_manager
_run_patch(job_manager.__file__, _to_absolute('job_manager.py.patch'))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Do we have the snippet used when developing the patch?

sky/skylet/ray_patches/__init__.py Show resolved Hide resolved
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll. Some questions.

@@ -1,7 +1,7 @@
0a1,4
> # Adapted from https://github.com/ray-project/ray/blob/ray-2.2.0/python/ray/_private/log_monitor.py
> # Fixed the problem for progress bar, as the latest version does not preserve \r for progress bar.
> # The change is adapted from https://github.com/ray-project/ray/blob/ray-1.10.0/python/ray/_private/log_monitor.py#L299-L300
> # It is reverted to https://github.com/ray-project/ray/blob/ray-1.10.0/python/ray/_private/log_monitor.py#L299-L300
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit confused: what is reverted to what...? Not understanding this line given L2.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #1734. PTAL. : )

@@ -124,8 +124,8 @@ setup_commands:
(type -a pip | grep -q pip3) || echo 'alias pip=pip3' >> ~/.bashrc;
which conda > /dev/null 2>&1 || (wget -nc https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && bash Miniconda3-latest-Linux-x86_64.sh -b && eval "$(/home/azureuser/miniconda3/bin/conda shell.bash hook)" && conda init && conda config --set auto_activate_base true);
source ~/.bashrc;
(pip3 list | grep ray | grep {{ray_version}} 2>&1 > /dev/null || pip3 install -U ray[default]=={{ray_version}}) && mkdir -p ~/sky_workdir && mkdir -p ~/.sky/sky_app && touch ~/.sudo_as_admin_successful;
(pip3 list | grep skypilot && [ "$(cat {{sky_remote_path}}/current_sky_wheel_hash)" == "{{sky_wheel_hash}}" ]) || (pip3 uninstall skypilot -y; pip3 install "$(echo {{sky_remote_path}}/{{sky_wheel_hash}}/skypilot-{{sky_version}}*.whl)[azure]" && echo "{{sky_wheel_hash}}" > {{sky_remote_path}}/current_sky_wheel_hash || exit 1);
(pip3 list | grep "ray " | grep {{ray_version}} 2>&1 > /dev/null || pip3 install --exists-action w -U ray[default]=={{ray_version}}) && mkdir -p ~/sky_workdir && mkdir -p ~/.sky/sky_app && touch ~/.sudo_as_admin_successful;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spell out --exists-action wipe for clarity

Why do we need it now? It seems like the grep would've ensured when we reach pip3 install there's no existing package?

Copy link
Collaborator Author

@Michaelvll Michaelvll May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed, because otherwise, the original patched files will not be removed from the package, causing the upgraded ray package corrupted due to the staled files. This makes sure that the existing VM upgraded with sky launch (upgrading ray version) will not have staled files in the ray package

nit: spell out --exists-action wipe for clarity

It seems only w is valid, but wipe is not.

Comment on lines -72 to -74
from ray.dashboard.modules.job import job_manager
_run_patch(job_manager.__file__, _to_absolute('job_manager.py.patch'))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder

@@ -193,7 +193,7 @@ def add_prologue(self,
# Should use 'auto' or 'ray://<internal_head_ip>:10001' rather than
# 'ray://localhost:10001', or 'ray://127.0.0.1:10001', for public cloud.
# Otherwise, it will a bug of ray job failed to get the placement group
# in ray <= 2.0.1.
# in ray <= 2.2.0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder

@@ -64,9 +64,9 @@ def parse_readme(readme: str) -> str:

install_requires = [
'wheel',
# NOTE: ray 2.0.1 requires click<=8.0.4,>=7.0; We disable the
# NOTE: ray 2.2.0 requires click>=7.0; We disable the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local ray versions may be older than 2.2. Does that mean the click<=8.0.4 constraint is still needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to upgrade the local ray version to 2.2.0 due to a bunch of the conflicts of the dependencies in #1734 . Wdyt?

@Michaelvll Michaelvll changed the title [Core] Upgrade ray to 2.2.0 [Core] Upgrade ray to 2.3.0 Feb 27, 2023
@Michaelvll Michaelvll closed this Feb 27, 2023
@Michaelvll Michaelvll deleted the upgrade-ray-2.2 branch February 27, 2023 23:31
@Michaelvll Michaelvll restored the upgrade-ray-2.2 branch February 27, 2023 23:31
@Michaelvll Michaelvll deleted the upgrade-ray-2.2 branch February 27, 2023 23:32
Michaelvll added a commit that referenced this pull request May 23, 2023
Michaelvll added a commit that referenced this pull request May 23, 2023
@Michaelvll Michaelvll mentioned this pull request May 23, 2023
8 tasks
Michaelvll added a commit that referenced this pull request May 26, 2023
* update the patches

* upgrade node providers

* fix azure config.py

* print sky queue

* add back azure disk size

* fix job manager

* fix hash

* longer timeout

* fix test smoke

* Remove the patch for job_manager

* longer timeout for azure_region test

* address comments

* format

* fix templates

* pip install --exists-action

* Upgrade to 2.3 instead

* upgrade to ray 2.3

* update patches for 2.4

* adopt changes for azure providers: a777a028b8dbd7bbae9a7393c98f6cd65f98a5f5

* fix license

* fix patch for log monitor

* sleep longer for the multi-echo

* longer waiting time

* longer wait time

* fix click dependencies

* update setup.py

* Fix #1618 (comment)

* fix #1618 (comment)

* revert test_smoke

* fix comment

* revert to w instead of wipe

* rewording

* minor fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants