-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Task missing in a distributed cluster with multi-node synchronization using SignalActor #18465
Comments
Is the log message just missing since you exit the job immediately after? Try adding a sleep after the get. Btw, Ray bin packs as the load balancing policy, so what you describe is as expected. |
There is no more outputs with additional 2021-09-09 21:13:52,314 INFO worker.py:825 -- Connecting to existing Ray cluster at address: 192.168.250.135:6379
ready...
set..
(pid=100, ip=192.168.250.143) 2021-09-09 21:14:02.600329: 8.0
(pid=159) 2021-09-09 21:14:02.598194: 9.4
(pid=183) 2021-09-09 21:14:02.597647: 8.0
(pid=182) 2021-09-09 21:14:02.599355: 8.7
(pid=184) 2021-09-09 21:14:02.598192: 8.7
(pid=251) 2021-09-09 21:14:02.599427: 7.4 |
I tried it here using cluster test utils.
|
Thanks very much for the helpful 'cluster_utils'. The additional There seems to be an over scheduling, and I wonder whether this is the expected behavior. In my opinion, the last task should have been started after the first 6 tasks finish. The cluster should have support 6 tasks concurrently, not 7. I improve the snippet by replacing 2021-09-14 15:50:41.074432: ready...
2021-09-14 15:50:41.379844: wait_and_go 0 NodeID(21448a73eb30c4adb778f65f9a9863a40a56cf0f26c06f8129c092ff)
2021-09-14 15:50:41.679673: wait_and_go 1 NodeID(21448a73eb30c4adb778f65f9a9863a40a56cf0f26c06f8129c092ff)
2021-09-14 15:50:41.988779: wait_and_go 2 NodeID(bf28d069074c4b74ee298533bca5cae7f256b910cab5cf4a0855439e)
2021-09-14 15:50:42.289914: wait_and_go 4 NodeID(21448a73eb30c4adb778f65f9a9863a40a56cf0f26c06f8129c092ff)
2021-09-14 15:50:42.290024: wait_and_go 3 NodeID(bf28d069074c4b74ee298533bca5cae7f256b910cab5cf4a0855439e)
2021-09-14 15:50:42.290515: wait_and_go 5 NodeID(bf28d069074c4b74ee298533bca5cae7f256b910cab5cf4a0855439e)
2021-09-14 15:50:42.706980: wait_and_go 6 NodeID(21448a73eb30c4adb778f65f9a9863a40a56cf0f26c06f8129c092ff)
2021-09-14 15:50:43.074825: set..
2021-09-14 15:50:43.079578: wait
2021-09-14 15:50:43.079589: 1.1
2021-09-14 15:50:43.079858: 1.7
2021-09-14 15:50:43.079858: 1.4
2021-09-14 15:50:43.330041: 1.0
2021-09-14 15:50:43.330312: 1.0
2021-09-14 15:50:43.330696: 1.0
2021-09-14 15:50:43.714716: 1.0
2021-09-14 15:50:48.083772: done 2
2021-09-14 15:50:48.084798: done 0
2021-09-14 15:50:48.084951: done 1
2021-09-14 15:50:48.332319: done 4
2021-09-14 15:50:48.334935: done 5
2021-09-14 15:50:48.335287: done 3
2021-09-14 15:50:48.718917: done 6
2021-09-14 15:50:48.720804: END |
Yep the "over-scheduling" is due to the use of If this is problematic, one workaround is to use actors instead of tasks. Actors never release their CPUs even when they are blocked on waiting. |
Thank you for the hint! It is very helpful to make sense of the ray runtime. |
Hi, I'm a bot from the Ray team :) To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months. If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel. |
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message. Please feel free to reopen or open a new issue if you'd still like it to be addressed. Again, you can always ask for help on our discussion forum or Ray's public slack channel. Thanks again for opening the issue! |
What is the problem?
Tasks are delivered in an unbalanced manner, with one task missing in the meanwhile.
7 tasks are launched on 2 nodes, which are initialized with 3 CPUs on each.
Ray simultaneously delivers 5 tasks to the head node and 1 task to the worker node. One task is missing.
The expected behavior should have been to deliver 3 tasks to each node at first, and 1 task to either of the nodes after a relaxation.
Perhaps the same problem with #14863
Reproduction (REQUIRED)
The runtime is initialized from the docker image
rayproject/ray:1.6.0-py38-cpu
with Python3.8.5
and ray1.6.0
installed.A cluster of two nodes is initialized with the following command separately.
The head node launches the script, which is modified from the v1.6.0 document.
time.sleep
is inserted for relaxation.print_msg
is implemented to print message with a timestamp.If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".
The text was updated successfully, but these errors were encountered: