[aDAG] Support multi-read of the same shm channel #47311

ruisearch42 · 2024-08-24T00:33:49Z

Why are these changes needed?

If the same method of the same actor is bound to the same node (i.e., reads from the same shared memory channel), aDAG execution hangs. This PR adds support to this case by caching results read from the channel.

Related issue number

Closes #47041

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

rkooo567

QQ: is this approach working for this kind of case? (I believe it will not work?)

with InputNode as inp:
    out = s1.fwd(inp)
    dag = s2.fwd(inp, out)

Also have some concerns that deepcopy on the first result can affect perf negatively.

I actually wonder if it is viable approach to just not allow the additional write until all downstream tasks finish reading (I think with buffering input PR, it may work well?)

rkooo567

discussed offline. deep/shallow copy overhead is pretty big, so we will make some assumptions that input won't be changed (which is not very correct). we can allow copy inputs with some sort of flags

stephanie-wang · 2024-08-27T17:05:23Z

Could you add a description to the PR?

rkooo567

QQ: is intraprocess channel still needed after this change?

python/ray/dag/compiled_dag_node.py

python/ray/tests/test_channel.py

python/ray/dag/tests/experimental/test_accelerated_dag.py

python/ray/dag/compiled_dag_node.py

stephanie-wang

Looks like a good start but let's try not to introduce the new use_cached flag if possible (it seems to make the code more complicated) and ideally clean up the cache once we know no more tasks will need it.

One suggestion might be to have an args cache per task, instead of a cache per channel. Then we can also do a list lookup instead of dict. Like:

cache: List[List[Any]]

The outer list index is the task idx on that actor, and the inner list is the resolved args for that task. The first time we deserialize an arg, we put it into the inner lists for all reader tasks.

python/ray/dag/compiled_dag_node.py

ruisearch42 · 2024-08-28T16:24:08Z

Looks like a good start but let's try not to introduce the new use_cached flag if possible (it seems to make the code more complicated) and ideally clean up the cache once we know no more tasks will need it.

Sounds good.

One suggestion might be to have an args cache per task, instead of a cache per channel. Then we can also do a list lookup instead of dict.

Why is a list lookup preferred than a dict? It is performance concerns (hash function overhead)?

The first time we deserialize an arg, we put it into the inner lists for all reader tasks.

How do we know where to put into the inner lists? We still need a dict to maintain that info? Otherwise we need to go through all items of all inner lists and replace?

python/ray/experimental/channel/cached_channel.py

python/ray/dag/compiled_dag_node.py

python/ray/experimental/channel/shared_memory_channel.py

python/ray/dag/tests/experimental/test_accelerated_dag.py

stephanie-wang · 2024-08-28T18:58:39Z

One suggestion might be to have an args cache per task, instead of a cache per channel. Then we can also do a list lookup instead of dict.

Why is a list lookup preferred than a dict? It is performance concerns (hash function overhead)?

The first time we deserialize an arg, we put it into the inner lists for all reader tasks.

How do we know where to put into the inner lists? We still need a dict to maintain that info? Otherwise we need to go through all items of all inner lists and replace?

Yes, I thought list lookup would be better for performance and I think it also makes the garbage collection simpler. But the new approach here also looks OK.

ruisearch42 · 2024-08-28T19:11:15Z

Still WIP after using a new approach, will clean up more!

rkooo567 · 2024-08-28T19:12:57Z

@ruisearch42 plz remove author-action-required when it is ready!

python/ray/dag/tests/experimental/test_accelerated_dag.py

ruisearch42 · 2024-08-29T20:33:06Z

Looking into some GPU test CI failures, but PR is ready for another look.

python/ray/dag/compiled_dag_node.py

python/ray/experimental/channel/cached_channel.py

python/ray/experimental/channel/shared_memory_channel.py

rkooo567

Looks very clean! Can you run a microbenchmark and address last test comments? let's merge it after that

python/ray/experimental/channel/cached_channel.py

python/ray/dag/tests/experimental/test_accelerated_dag.py

python/ray/dag/compiled_dag_node.py

ruisearch42 · 2024-08-30T07:21:47Z

Kicked off microbenchmark: https://buildkite.com/ray-project/release/builds/21924#0191a221-42bb-4bd7-8f01-25adf035540e

rkooo567 · 2024-08-30T07:49:25Z

@ruisearch42 plz comment if there's any change here! if not, let's just merge it

Signed-off-by: Rui Qiao <[email protected]>

ruisearch42 · 2024-08-30T16:35:38Z

@rkooo567 microbenchmark aligns with past runs:
https://buildkite.com/ray-project/release/builds/21950#0191a40e-897c-44b1-b1f9-388ea3761ff6

The latest code just added a TODO comment compared to the last version which passed full CI.

rkooo567 · 2024-08-30T17:59:21Z

@ruisearch42 auto merged enabled. so we should follow up with removing intra process channel right?

rkooo567 · 2024-08-30T19:01:15Z

premerge failure. maybe consider to merge latest master?

If the same method of the same actor is bound to the same node (i.e., reads from the same shared memory channel), aDAG execution hangs. This PR adds support to this case by caching results read from the channel. Signed-off-by: ujjawal-khare <[email protected]>

ruisearch42 force-pushed the bind_multi branch from 42c1806 to 13fa083 Compare August 26, 2024 18:28

ruisearch42 changed the title ~~[Not ready for review] [aDAG] Support multi-read of the same shm channel~~ [aDAG] Support multi-read of the same shm channel Aug 26, 2024

ruisearch42 added the go add ONLY when ready to merge, run all tests label Aug 26, 2024

rkooo567 reviewed Aug 26, 2024

View reviewed changes

anyscalesam added P0 Issues that should be fixed in short order core Issues that should be addressed in Ray Core compiled-graph labels Aug 26, 2024

ruisearch42 marked this pull request as ready for review August 27, 2024 00:33

rkooo567 reviewed Aug 27, 2024

View reviewed changes

stephanie-wang self-assigned this Aug 27, 2024

kevin85421 self-assigned this Aug 27, 2024

rkooo567 self-assigned this Aug 27, 2024

rkooo567 reviewed Aug 27, 2024

View reviewed changes

stephanie-wang reviewed Aug 27, 2024

View reviewed changes

python/ray/dag/compiled_dag_node.py Outdated Show resolved Hide resolved

python/ray/dag/compiled_dag_node.py Outdated Show resolved Hide resolved

python/ray/dag/compiled_dag_node.py Outdated Show resolved Hide resolved

ruisearch42 force-pushed the bind_multi branch from 48045ea to e118021 Compare August 28, 2024 15:52

ruisearch42 removed the go add ONLY when ready to merge, run all tests label Aug 28, 2024

ruisearch42 force-pushed the bind_multi branch from e118021 to fe2295b Compare August 28, 2024 16:34

stephanie-wang reviewed Aug 28, 2024

View reviewed changes

rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 28, 2024

ruisearch42 force-pushed the bind_multi branch 3 times, most recently from d007769 to a1dcb32 Compare August 29, 2024 17:21

ruisearch42 commented Aug 29, 2024

View reviewed changes

python/ray/dag/tests/experimental/test_accelerated_dag.py Show resolved Hide resolved

ruisearch42 added go add ONLY when ready to merge, run all tests and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Aug 29, 2024

stephanie-wang approved these changes Aug 29, 2024

View reviewed changes

rkooo567 approved these changes Aug 30, 2024

View reviewed changes

ruisearch42 force-pushed the bind_multi branch from ce73169 to f132333 Compare August 30, 2024 15:43

[aDAG] Support multi-read of the same shm channel

7eff3cc

Signed-off-by: Rui Qiao <[email protected]>

ruisearch42 force-pushed the bind_multi branch from f132333 to 7eff3cc Compare August 30, 2024 15:44

rkooo567 enabled auto-merge (squash) August 30, 2024 17:59

rkooo567 merged commit c9c150a into ray-project:master Aug 30, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[aDAG] Support multi-read of the same shm channel #47311

[aDAG] Support multi-read of the same shm channel #47311

ruisearch42 commented Aug 24, 2024 •

edited

Loading

rkooo567 left a comment •

edited

Loading

rkooo567 left a comment

stephanie-wang commented Aug 27, 2024

rkooo567 left a comment

stephanie-wang left a comment

ruisearch42 commented Aug 28, 2024

stephanie-wang commented Aug 28, 2024

ruisearch42 commented Aug 28, 2024 •

edited

Loading

rkooo567 commented Aug 28, 2024

ruisearch42 commented Aug 29, 2024

rkooo567 left a comment

ruisearch42 commented Aug 30, 2024

rkooo567 commented Aug 30, 2024

ruisearch42 commented Aug 30, 2024

rkooo567 commented Aug 30, 2024

rkooo567 commented Aug 30, 2024

[aDAG] Support multi-read of the same shm channel #47311

[aDAG] Support multi-read of the same shm channel #47311

Conversation

ruisearch42 commented Aug 24, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

rkooo567 left a comment • edited Loading

Choose a reason for hiding this comment

rkooo567 left a comment

Choose a reason for hiding this comment

stephanie-wang commented Aug 27, 2024

rkooo567 left a comment

Choose a reason for hiding this comment

stephanie-wang left a comment

Choose a reason for hiding this comment

ruisearch42 commented Aug 28, 2024

stephanie-wang commented Aug 28, 2024

ruisearch42 commented Aug 28, 2024 • edited Loading

rkooo567 commented Aug 28, 2024

ruisearch42 commented Aug 29, 2024

rkooo567 left a comment

Choose a reason for hiding this comment

ruisearch42 commented Aug 30, 2024

rkooo567 commented Aug 30, 2024

ruisearch42 commented Aug 30, 2024

rkooo567 commented Aug 30, 2024

rkooo567 commented Aug 30, 2024

ruisearch42 commented Aug 24, 2024 •

edited

Loading

rkooo567 left a comment •

edited

Loading

ruisearch42 commented Aug 28, 2024 •

edited

Loading