Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Upgrade ray to 2.4.0 #1734

Merged
merged 44 commits into from
May 26, 2023
Merged

[Core] Upgrade ray to 2.4.0 #1734

merged 44 commits into from
May 26, 2023

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Feb 27, 2023

This is a routine upgrade of the remote ray version we are using from 2.0.1 to 2.4.0. This is to unblock #1586

Azure: the existing azure cluster might be affected by this PR, due to the changes of the VM naming from the upstream ray to avoid name conflicts (ref). It should be fine for now, as we don't have many active Azure user.

Tested (run the relevant ones):

  • Any manual or new tests for this PR (please specify below)
    • sky launch check all the patches work as expected.
  • All smoke tests: pytest tests/test_smoke.py
  • pytest tests/test_smoke.py --aws
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh
    • AWS
    • GCP
  • pip install . with the clean python 3.10 environment with ray==2.0.1 installed.

@Michaelvll Michaelvll changed the title [Core] Upgrade ray to 2.3.0 [Core] Upgrade ray to 2.4.0 May 9, 2023
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll! I tried it out with some manual test cases and works nicely. Backward compatibility also seemed to be working (started a cluster with 2.0.1, upgraded to 2.4, tried exec and launch with various YAMLs). Haven't tried spot yet, but if tests pass I assume it should be good.

@@ -4,7 +4,8 @@
import logging
import os
import time
from typing import Any, Dict, List, Tuple, Union
from enum import Enum
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, yes, it was directly copied from that file with the change for L12, where we changed it to aws.utils in sky.skylet.providers instead of the ray

Comment on lines -72 to -73
from ray.dashboard.modules.job import job_manager
_run_patch(job_manager.__file__, _to_absolute('job_manager.py.patch'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know if this has been fixed in Ray 2.4?

Copy link
Collaborator Author

@Michaelvll Michaelvll May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, this is from #1618 (comment). I will test the large scale ray job submit again:

  • for i in {1..1000}; do ray job submit --job-id $i-gcpuser-2 --address http://127.0.0.1:8265 --no-wait 'echo hi; sleep 800; echo bye'; done (ray 2.4 is more robust for the OOM caused by too many submitted job, where the Job will correctly fail when OOM happens, instead of hanging)
  • for i in {1..1000}; do ray job submit --job-id $i-gcpuser-2 --address http://127.0.0.1:8265 --no-wait 'echo hi; sleep 1; echo bye'; sleep 1; done

0a1,2
> # From https://github.com/ray-project/ray/blob/ray-2.4.0/python/ray/autoscaler/_private/command_runner.py
>
140c142
< "ControlPersist": "10s",
---
> "ControlPersist": "300s",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this? Considering we override it for the default case here: https://github.com/skypilot-org/skypilot/blob/9dca02b2c6b0290d46d1ab03829879cee9949d8c/sky/utils/command_runner.py#LL80C22-L80C22

I might be missing something though

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our ray's node_provider is still using the command_runner from ray instead of our own command_runner, so I think it might still be needed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

@@ -269,6 +269,7 @@ def test_azure_region():
f'sky status --all | grep {name} | grep eastus2', # Ensure the region is correct.
],
f'sky down -y {name}',
timeout=30 * 60, # 30 mins
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is just the minimal example launching this should not take too long. Should we reduce this to something more reasonable (say 10 min?) :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just reverted it. I am testing it again. : )

@Michaelvll Michaelvll mentioned this pull request May 23, 2023
12 tasks
Copy link
Collaborator Author

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @romilbhardwaj! I fixed most of them as well as the comments from @concretevitamin in #1618. I am running the tests again:

  • pytest tests/test_smoke.py
  • tests in the comments.

Note: I updated the local ray version to 2.2.0 to resolve the package version conflicts for click/grpcio/protobuf. Wdyt @romilbhardwaj @concretevitamin?

@@ -4,7 +4,8 @@
import logging
import os
import time
from typing import Any, Dict, List, Tuple, Union
from enum import Enum
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, yes, it was directly copied from that file with the change for L12, where we changed it to aws.utils in sky.skylet.providers instead of the ray

0a1,2
> # From https://github.com/ray-project/ray/blob/ray-2.4.0/python/ray/autoscaler/_private/command_runner.py
>
140c142
< "ControlPersist": "10s",
---
> "ControlPersist": "300s",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our ray's node_provider is still using the command_runner from ray instead of our own command_runner, so I think it might still be needed?

Comment on lines -72 to -73
from ray.dashboard.modules.job import job_manager
_run_patch(job_manager.__file__, _to_absolute('job_manager.py.patch'))
Copy link
Collaborator Author

@Michaelvll Michaelvll May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, this is from #1618 (comment). I will test the large scale ray job submit again:

  • for i in {1..1000}; do ray job submit --job-id $i-gcpuser-2 --address http://127.0.0.1:8265 --no-wait 'echo hi; sleep 800; echo bye'; done (ray 2.4 is more robust for the OOM caused by too many submitted job, where the Job will correctly fail when OOM happens, instead of hanging)
  • for i in {1..1000}; do ray job submit --job-id $i-gcpuser-2 --address http://127.0.0.1:8265 --no-wait 'echo hi; sleep 1; echo bye'; sleep 1; done

@@ -269,6 +269,7 @@ def test_azure_region():
f'sky status --all | grep {name} | grep eastus2', # Ensure the region is correct.
],
f'sky down -y {name}',
timeout=30 * 60, # 30 mins
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just reverted it. I am testing it again. : )

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll! Code looks good to go. Since we are now also bumping local ray to 2.2, I still need to do some backward compatibility tests (e.g., upgrading from Ray 1.9).

@@ -200,8 +200,7 @@ def add_prologue(self,
self.job_id = job_id
# Should use 'auto' or 'ray://<internal_head_ip>:10001' rather than
# 'ray://localhost:10001', or 'ray://127.0.0.1:10001', for public cloud.
# Otherwise, it will a bug of ray job failed to get the placement group
# in ray <= 2.4.0.
# Otherwise, it will a bug of ray job failed to get the placement group.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This sentence seems malformed - does it mean Otherwise, ray will fail to get the placement group because of a bug in ray job?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks!

@romilbhardwaj romilbhardwaj added this to the v0.3 milestone May 23, 2023
@concretevitamin
Copy link
Member

Note: I updated the local ray version to 2.2.0 to resolve the package version conflicts for click/grpcio/protobuf. Wdyt @romilbhardwaj @concretevitamin?

Sounds good. Thanks for the great work!

@Michaelvll Michaelvll closed this May 24, 2023
@Michaelvll Michaelvll deleted the upgrade-ray-2.3 branch May 24, 2023 06:24
@Michaelvll Michaelvll restored the upgrade-ray-2.3 branch May 24, 2023 06:25
@Michaelvll Michaelvll reopened this May 24, 2023
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work @Michaelvll! Ran some quick backcompat tests upgrading from Ray 1.13 (logs). Should be good to go if smoke tests pass.

@Michaelvll Michaelvll merged commit ef08910 into master May 26, 2023
@Michaelvll Michaelvll deleted the upgrade-ray-2.3 branch May 26, 2023 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

An older ray version (2.0.1) that does not have a Linux aarch64 wheel should probably be updated (2.2.0+)
3 participants