-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Optimize block prefetching #35568
Conversation
kicking off the release test here: https://buildkite.com/ray-project/release-tests-pr/builds/39284 |
with stats.iter_wait_s.timer() if stats else nullcontext(): | ||
prefetcher.prefetch_blocks(list(sliding_window)) | ||
prefetcher.prefetch_blocks([next_block]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this change required instead of triggering the fetch of the entire sliding window?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bug previously. There is no need to redundantly prefetching the same blocks (both for wait and actor-based prefetchers).
@amogkam How to check the benchmark results from this? |
looks like we still see the regression. I would expect that with prefetching we should be down to 70 seconds. Running it again to confirm. |
if len(blocks_to_wait) > 0: | ||
ray.wait(blocks_to_wait, num_returns=1, fetch_local=True) | ||
else: | ||
self._condition.wait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catch and log exceptions from this loop?
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Fixed a few bugs. Now
|
Hmm, previously we waited for all blocks queued. That might turn out to be important. I think we should follow the previous call pattern of passing all blocks to be prefetched to the wait call, and using num returns 1 still. |
After reverting to prefetching the entire sliding window every time,
@ericl another question. currently each prefetcher instance will create a new thread (I added a "Prefetcher.stop" method to make sure the thread will stop asap). Do you think this will create too many threads in practice? If so, we may want to use a static thread pool. I think it should be fine. because users are not likely to use multiple "iter_batches" simultaneously. |
Interesting. It could be we auto cancel previous wait requests. Unfortunately the code is very old here. Also agree a thread per active iterator is totally fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome, thanks @raulchen!
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Per offline discussions, some possible explanations are:
Either way, we can merge this PR first to fix the regression. |
## Why are these changes needed? `WaitBlockPrefetcher` will blockingly wait for the first block. When prefetch size is small, this can cause latency on the critical path. This PR moves the wait to a background thread. ## Related issue number closes ray-project#35521 Signed-off-by: e428265 <[email protected]>
Why are these changes needed?
WaitBlockPrefetcher
will blockingly wait for the first block. When prefetch size is small, this can cause latency on the critical path. This PR moves the wait to a background thread.Related issue number
closes #35521
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.