-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[aDAG] Support multi-read of the same shm channel #47311
Conversation
42c1806
to
13fa083
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: is this approach working for this kind of case? (I believe it will not work?)
with InputNode as inp:
out = s1.fwd(inp)
dag = s2.fwd(inp, out)
Also have some concerns that deepcopy on the first result can affect perf negatively.
I actually wonder if it is viable approach to just not allow the additional write until all downstream tasks finish reading (I think with buffering input PR, it may work well?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussed offline. deep/shallow copy overhead is pretty big, so we will make some assumptions that input won't be changed (which is not very correct). we can allow copy inputs with some sort of flags
Could you add a description to the PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: is intraprocess channel still needed after this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a good start but let's try not to introduce the new use_cached
flag if possible (it seems to make the code more complicated) and ideally clean up the cache once we know no more tasks will need it.
One suggestion might be to have an args cache per task, instead of a cache per channel. Then we can also do a list lookup instead of dict. Like:
cache: List[List[Any]]
The outer list index is the task idx on that actor, and the inner list is the resolved args for that task. The first time we deserialize an arg, we put it into the inner lists for all reader tasks.
48045ea
to
e118021
Compare
Sounds good.
Why is a list lookup preferred than a dict? It is performance concerns (hash function overhead)?
How do we know where to put into the inner lists? We still need a dict to maintain that info? Otherwise we need to go through all items of all inner lists and replace? |
e118021
to
fe2295b
Compare
Yes, I thought list lookup would be better for performance and I think it also makes the garbage collection simpler. But the new approach here also looks OK. |
Still WIP after using a new approach, will clean up more! |
@ruisearch42 plz remove author-action-required when it is ready! |
d007769
to
a1dcb32
Compare
Looking into some GPU test CI failures, but PR is ready for another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very clean! Can you run a microbenchmark and address last test comments? let's merge it after that
Kicked off microbenchmark: https://buildkite.com/ray-project/release/builds/21924#0191a221-42bb-4bd7-8f01-25adf035540e |
@ruisearch42 plz comment if there's any change here! if not, let's just merge it |
ce73169
to
f132333
Compare
Signed-off-by: Rui Qiao <[email protected]>
f132333
to
7eff3cc
Compare
@rkooo567 microbenchmark aligns with past runs: The latest code just added a TODO comment compared to the last version which passed full CI. |
@ruisearch42 auto merged enabled. so we should follow up with removing intra process channel right? |
premerge failure. maybe consider to merge latest master? |
If the same method of the same actor is bound to the same node (i.e., reads from the same shared memory channel), aDAG execution hangs. This PR adds support to this case by caching results read from the channel. Signed-off-by: ujjawal-khare <[email protected]>
If the same method of the same actor is bound to the same node (i.e., reads from the same shared memory channel), aDAG execution hangs. This PR adds support to this case by caching results read from the channel. Signed-off-by: ujjawal-khare <[email protected]>
If the same method of the same actor is bound to the same node (i.e., reads from the same shared memory channel), aDAG execution hangs. This PR adds support to this case by caching results read from the channel. Signed-off-by: ujjawal-khare <[email protected]>
If the same method of the same actor is bound to the same node (i.e., reads from the same shared memory channel), aDAG execution hangs. This PR adds support to this case by caching results read from the channel. Signed-off-by: ujjawal-khare <[email protected]>
If the same method of the same actor is bound to the same node (i.e., reads from the same shared memory channel), aDAG execution hangs. This PR adds support to this case by caching results read from the channel. Signed-off-by: ujjawal-khare <[email protected]>
Why are these changes needed?
If the same method of the same actor is bound to the same node (i.e., reads from the same shared memory channel), aDAG execution hangs. This PR adds support to this case by caching results read from the channel.
Related issue number
Closes #47041
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.