Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Optimize block prefetching #35568

Merged
merged 15 commits into from
Jun 1, 2023
Merged

Conversation

raulchen
Copy link
Contributor

@raulchen raulchen commented May 19, 2023

Why are these changes needed?

WaitBlockPrefetcher will blockingly wait for the first block. When prefetch size is small, this can cause latency on the critical path. This PR moves the wait to a background thread.

Related issue number

closes #35521

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@amogkam
Copy link
Contributor

amogkam commented May 22, 2023

kicking off the release test here: https://buildkite.com/ray-project/release-tests-pr/builds/39284

with stats.iter_wait_s.timer() if stats else nullcontext():
prefetcher.prefetch_blocks(list(sliding_window))
prefetcher.prefetch_blocks([next_block])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this change required instead of triggering the fetch of the entire sliding window?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bug previously. There is no need to redundantly prefetching the same blocks (both for wait and actor-based prefetchers).

@raulchen
Copy link
Contributor Author

kicking off the release test here: https://buildkite.com/ray-project/release-tests-pr/builds/39284

@amogkam How to check the benchmark results from this?

@amogkam
Copy link
Contributor

amogkam commented May 22, 2023

iter-torch-batches-bs-32-prefetch-0-shuffleNone = {'time': 92.82726407200005}
--
  | iter-torch-batches-bs-32-prefetch-0-shuffle64 = {'time': 94.301909722}
  | iter-torch-batches-bs-32-prefetch-1-shuffleNone = {'time': 92.12948181000002}
  | iter-torch-batches-bs-32-prefetch-1-shuffle64 = {'time': 91.936336773}
  | iter-torch-batches-bs-32-prefetch-4-shuffleNone = {'time': 92.30885751599999}
  | iter-torch-batches-bs-32-prefetch-4-shuffle64 = {'time': 92.08876677199999}

looks like we still see the regression. I would expect that with prefetching we should be down to 70 seconds. Running it again to confirm.

if len(blocks_to_wait) > 0:
ray.wait(blocks_to_wait, num_returns=1, fetch_local=True)
else:
self._condition.wait()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catch and log exceptions from this loop?

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 30, 2023
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
@raulchen
Copy link
Contributor Author

raulchen commented Jun 1, 2023

Fixed a few bugs. Now prefetch_batches=1 is more performant. But weirdly, prefetch_batches=4 is the same as no prefetching.

iter-torch-batches-bs-32-prefetch-0-shuffleNone = {'time': 94.05068665299996}
--
iter-torch-batches-bs-32-prefetch-0-shuffle64 = {'time': 88.09606919100008}
iter-torch-batches-bs-32-prefetch-1-shuffleNone = {'time': 65.197580313}
iter-torch-batches-bs-32-prefetch-1-shuffle64 = {'time': 64.72387075799998}
iter-torch-batches-bs-32-prefetch-4-shuffleNone = {'time': 89.3722670200001}
iter-torch-batches-bs-32-prefetch-4-shuffle64 = {'time': 86.70470300799991}

@ericl
Copy link
Contributor

ericl commented Jun 1, 2023

Hmm, previously we waited for all blocks queued. That might turn out to be important. I think we should follow the previous call pattern of passing all blocks to be prefetched to the wait call, and using num returns 1 still.

@raulchen
Copy link
Contributor Author

raulchen commented Jun 1, 2023

After reverting to prefetching the entire sliding window every time, prefetch_batches=4 is effective as well now. I didn't figure out why though.

iter-torch-batches-bs-32-prefetch-0-shuffleNone = {'time': 86.87967626900002}
--
iter-torch-batches-bs-32-prefetch-0-shuffle64 = {'time': 86.52172062299996}
iter-torch-batches-bs-32-prefetch-1-shuffleNone = {'time': 63.789988204}
iter-torch-batches-bs-32-prefetch-1-shuffle64 = {'time': 63.34117246599999}
iter-torch-batches-bs-32-prefetch-4-shuffleNone = {'time': 60.843809012000065}
iter-torch-batches-bs-32-prefetch-4-shuffle64 = {'time': 59.433647494999946}

@ericl another question. currently each prefetcher instance will create a new thread (I added a "Prefetcher.stop" method to make sure the thread will stop asap). Do you think this will create too many threads in practice? If so, we may want to use a static thread pool. I think it should be fine. because users are not likely to use multiple "iter_batches" simultaneously.

@ericl
Copy link
Contributor

ericl commented Jun 1, 2023

Interesting. It could be we auto cancel previous wait requests. Unfortunately the code is very old here.

Also agree a thread per active iterator is totally fine.

Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, thanks @raulchen!

Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
Signed-off-by: Hao Chen <[email protected]>
@raulchen
Copy link
Contributor Author

raulchen commented Jun 1, 2023

Per offline discussions, some possible explanations are:

  1. the blocks that have been prefetched get evicted again. the second "ray.wait" either prevents them from being evicted or fetches them again.
  2. A second ray.wait somehow cancels the previous ray.wait.

Either way, we can merge this PR first to fix the regression.

@raulchen raulchen merged commit aa0d07e into ray-project:master Jun 1, 2023
@raulchen raulchen deleted the wait-prefetch branch June 1, 2023 22:35
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
## Why are these changes needed?

`WaitBlockPrefetcher` will blockingly wait for the first block. When prefetch size is small, this can cause latency on the critical path. This PR moves the wait to a background thread.

## Related issue number

closes ray-project#35521

Signed-off-by: e428265 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Data] Performance regression in iter_batches prefetching
3 participants