-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] move function and actor importer away from pubsub #24132
Conversation
@ericl could you help check the PR and let me know whether I missed something? Hopefully, it should work. If this is working, I'll get rid of run_on_all_nodes and then import_thread.py is not necessary anymore. |
def init_func(worker_info): | ||
a = ray.get_actor("recorder", namespace="n") | ||
a.record.remote(worker_info['worker'].worker_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old code has a race condition. Before the worker calls run in loop, we can't call any remote function.
If init_func is called before the worker calls run in loop, it'll crash.
If init_func is called after the worker executes tasks, it'll be wrong.
This PR also makes me aware it's not that easy to support create new worker when GCS is down. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we ran nightly tests on this? Moving importing to lazy may have unknown risks.
) | ||
if self.fetch_and_register_remote_function(key) is True: | ||
break | ||
elif not self._worker.actor_id.is_nil(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just else
?
job_id, | ||
function_descriptor.function_id.binary(), | ||
) | ||
if self.fetch_and_register_remote_function(key) is True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this busy loop if fetching fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's ok since the worker will not make progress anyway before
python/ray/tune/examples/pbt_dcgan_mnist/pbt_dcgan_mnist_func.py
Outdated
Show resolved
Hide resolved
Great, really hope this can work! |
@mwtian it's running here https://buildkite.com/ray-project/release-tests-pr/builds/1232 |
+1 to trigger nightly tests on this. I'm fine with merging this if nightly tests pass 3x in a row with no import related issues or hangs, and also making sure the unit tests are not flaky. |
Win tests are broken but not related to the flow change. Most are because the CI is not written well. I'll fix that and run nightly test again. |
Merge to master and trigger the third run here. https://buildkite.com/ray-project/release-tests-pr/builds/1232 I feel there might not be enough resources. Let's wait and see. |
scheduling_test_many_0s_tasks_many_nodes also failed with master shuffle_1tb_5000_partitions_1650980795 also failed with master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. One comment is that we are still importing all dependencies in their export order, by calling _do_import()
on worker startup and when a dependency is not found. This PR is not changing to per-function lazy import yet, so there is no import ordering issue. Maybe it is useful to clarify in the PR description.
Chatted with Yi offline, the only purpose of |
@ericl it seems like lacking of resources and queues forever. The first one has completely finished. The second/third one is partially finished. Do you think it's good to let it go? |
We can give it a try and keep an eye out. |
Why are these changes needed?
This PR moves function import to a lazy way. Several benefits of this:
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.