[Datasets] Avoid unnecessary reads when truncating a dataset with `ds.limit()` #27343

clarkzinzow · 2022-08-01T22:23:13Z

Datasets currently eagerly kicks off all read tasks when truncating a dataset immediately after a read via ray.data.read_*().limit(); this results in a lot of wasted computation and unnecessary object store bloat, especially when trying to poke at a very small subset of the data.

This PR avoids these unnecessary reads by truncating the blocklist to the minimum number of blocks needed to meet the row limit before doing the actual block splitting, thereby avoiding materialization of unnecessary read tasks in the common splitting path.

Related issue number

Closes #27340

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

jianoaix · 2022-08-02T01:31:12Z

Nice, we should make all transform APIs real and not run as consumption APIs.

python/ray/data/dataset.py

jianoaix · 2022-08-03T20:32:50Z

python/ray/data/_internal/block_list.py

@@ -112,6 +112,30 @@ def split_by_bytes(self, bytes_per_split: int) -> List["BlockList"]:
            )
        return output

+    def truncate_by_rows(self, limit: int) -> "BlockList":


It looks truncate_by_blocks, since this is working at block level with constraint to cover desired num of rows.

We're truncating to the number of rows given, similar to above split_by_bytes where we're splitting by the number of bytes given.

Ok. I think one thing important is that we don't split block in order to make up the desired bytes or rows. I don't have a good naming suggestion though (maybe just compensate it with comments).

That's already indicated in the docstring so I think it should be fine.

python/ray/data/datasource/datasource.py

jianoaix · 2022-08-05T19:54:03Z

python/ray/data/_internal/block_list.py

@@ -112,6 +112,30 @@ def split_by_bytes(self, bytes_per_split: int) -> List["BlockList"]:
            )
        return output

+    def truncate_by_rows(self, limit: int) -> "BlockList":


Ok. I think one thing important is that we don't split block in order to make up the desired bytes or rows. I don't have a good naming suggestion though (maybe just compensate it with comments).

….limit()` (ray-project#27343) Datasets currently eagerly kicks off all read tasks when truncating a dataset immediately after a read via ray.data.read_*().limit(); this results in a lot of wasted computation and unnecessary object store bloat, especially when trying to poke at a very small subset of the data. This PR avoids these unnecessary reads by truncating the blocklist to the minimum number of blocks needed to meet the row limit before doing the actual block splitting, thereby avoiding materialization of unnecessary read tasks in the common splitting path.

…datasets with `.limit()` (#27585) * [Datasets] Avoid unnecessary reads when truncating a dataset with `ds.limit()` (#27343) Datasets currently eagerly kicks off all read tasks when truncating a dataset immediately after a read via ray.data.read_*().limit(); this results in a lot of wasted computation and unnecessary object store bloat, especially when trying to poke at a very small subset of the data. This PR avoids these unnecessary reads by truncating the blocklist to the minimum number of blocks needed to meet the row limit before doing the actual block splitting, thereby avoiding materialization of unnecessary read tasks in the common splitting path. * [Datasets] [Docs] Improve `.limit()` and `.take()` docstrings (#27367) Improve docstrings for .limit() and .take(), making the distinction more clear. Signed-off-by: Clark Zinzow <[email protected]>

….limit()` (ray-project#27343) Datasets currently eagerly kicks off all read tasks when truncating a dataset immediately after a read via ray.data.read_*().limit(); this results in a lot of wasted computation and unnecessary object store bloat, especially when trying to poke at a very small subset of the data. This PR avoids these unnecessary reads by truncating the blocklist to the minimum number of blocks needed to meet the row limit before doing the actual block splitting, thereby avoiding materialization of unnecessary read tasks in the common splitting path. Signed-off-by: Stefan van der Kleij <[email protected]>

clarkzinzow requested review from ericl, scv119, jjyao and jianoaix as code owners August 1, 2022 22:23

clarkzinzow assigned c21, jianoaix and matthewdeng Aug 1, 2022

clarkzinzow force-pushed the datasets/fix/limit-no-redundant-reads branch from b18b99f to 7c0b974 Compare August 1, 2022 22:29

jianoaix reviewed Aug 2, 2022

View reviewed changes

python/ray/data/dataset.py Show resolved Hide resolved

clarkzinzow requested a review from jianoaix August 3, 2022 17:37

jianoaix reviewed Aug 3, 2022

View reviewed changes

clarkzinzow added 3 commits August 4, 2022 20:53

Truncate block lists to minimal blocks before splitting on the limit.

dea0dbf

Fix lint.

8a3f25b

PR feedback.

3dc9eee

clarkzinzow force-pushed the datasets/fix/limit-no-redundant-reads branch from d3847a0 to 3dc9eee Compare August 4, 2022 23:08

clarkzinzow requested a review from jianoaix August 4, 2022 23:36

jianoaix approved these changes Aug 5, 2022

View reviewed changes

clarkzinzow added the v2.0.0-pick label Aug 5, 2022

clarkzinzow merged commit 313d553 into ray-project:master Aug 5, 2022

This was referenced Aug 5, 2022

[Cherry-pick] [AIR - Datasets] Avoid redundant reads when truncating Datasets with .limit(). #27584

Closed

[Cherry-pick] [AIR - Datasets] Avoid redundant reads when truncating datasets with .limit() #27585

Merged

scv119 added the v2.0.0-pick-done label Aug 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Avoid unnecessary reads when truncating a dataset with `ds.limit()` #27343

[Datasets] Avoid unnecessary reads when truncating a dataset with `ds.limit()` #27343

clarkzinzow commented Aug 1, 2022 •

edited

Loading

jianoaix commented Aug 2, 2022

jianoaix Aug 3, 2022

clarkzinzow Aug 4, 2022

jianoaix Aug 5, 2022

clarkzinzow Aug 5, 2022

jianoaix Aug 5, 2022

[Datasets] Avoid unnecessary reads when truncating a dataset with ds.limit() #27343

[Datasets] Avoid unnecessary reads when truncating a dataset with ds.limit() #27343

Conversation

clarkzinzow commented Aug 1, 2022 • edited Loading

Related issue number

Checks

jianoaix commented Aug 2, 2022

jianoaix Aug 3, 2022

Choose a reason for hiding this comment

clarkzinzow Aug 4, 2022

Choose a reason for hiding this comment

jianoaix Aug 5, 2022

Choose a reason for hiding this comment

clarkzinzow Aug 5, 2022

Choose a reason for hiding this comment

jianoaix Aug 5, 2022

Choose a reason for hiding this comment

[Datasets] Avoid unnecessary reads when truncating a dataset with `ds.limit()` #27343

[Datasets] Avoid unnecessary reads when truncating a dataset with `ds.limit()` #27343

clarkzinzow commented Aug 1, 2022 •

edited

Loading