-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Avoid unnecessary reads when truncating a dataset with ds.limit()
#27343
[Datasets] Avoid unnecessary reads when truncating a dataset with ds.limit()
#27343
Conversation
b18b99f
to
7c0b974
Compare
Nice, we should make all transform APIs real and not run as consumption APIs. |
@@ -112,6 +112,30 @@ def split_by_bytes(self, bytes_per_split: int) -> List["BlockList"]: | |||
) | |||
return output | |||
|
|||
def truncate_by_rows(self, limit: int) -> "BlockList": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks truncate_by_blocks, since this is working at block level with constraint to cover desired num of rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're truncating to the number of rows given, similar to above split_by_bytes
where we're splitting by the number of bytes given.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I think one thing important is that we don't split block in order to make up the desired bytes or rows. I don't have a good naming suggestion though (maybe just compensate it with comments).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's already indicated in the docstring so I think it should be fine.
d3847a0
to
3dc9eee
Compare
@@ -112,6 +112,30 @@ def split_by_bytes(self, bytes_per_split: int) -> List["BlockList"]: | |||
) | |||
return output | |||
|
|||
def truncate_by_rows(self, limit: int) -> "BlockList": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I think one thing important is that we don't split block in order to make up the desired bytes or rows. I don't have a good naming suggestion though (maybe just compensate it with comments).
….limit()` (ray-project#27343) Datasets currently eagerly kicks off all read tasks when truncating a dataset immediately after a read via ray.data.read_*().limit(); this results in a lot of wasted computation and unnecessary object store bloat, especially when trying to poke at a very small subset of the data. This PR avoids these unnecessary reads by truncating the blocklist to the minimum number of blocks needed to meet the row limit before doing the actual block splitting, thereby avoiding materialization of unnecessary read tasks in the common splitting path.
…datasets with `.limit()` (#27585) * [Datasets] Avoid unnecessary reads when truncating a dataset with `ds.limit()` (#27343) Datasets currently eagerly kicks off all read tasks when truncating a dataset immediately after a read via ray.data.read_*().limit(); this results in a lot of wasted computation and unnecessary object store bloat, especially when trying to poke at a very small subset of the data. This PR avoids these unnecessary reads by truncating the blocklist to the minimum number of blocks needed to meet the row limit before doing the actual block splitting, thereby avoiding materialization of unnecessary read tasks in the common splitting path. * [Datasets] [Docs] Improve `.limit()` and `.take()` docstrings (#27367) Improve docstrings for .limit() and .take(), making the distinction more clear. Signed-off-by: Clark Zinzow <[email protected]>
….limit()` (ray-project#27343) Datasets currently eagerly kicks off all read tasks when truncating a dataset immediately after a read via ray.data.read_*().limit(); this results in a lot of wasted computation and unnecessary object store bloat, especially when trying to poke at a very small subset of the data. This PR avoids these unnecessary reads by truncating the blocklist to the minimum number of blocks needed to meet the row limit before doing the actual block splitting, thereby avoiding materialization of unnecessary read tasks in the common splitting path. Signed-off-by: Stefan van der Kleij <[email protected]>
Datasets currently eagerly kicks off all read tasks when truncating a dataset immediately after a read via
ray.data.read_*().limit()
; this results in a lot of wasted computation and unnecessary object store bloat, especially when trying to poke at a very small subset of the data.This PR avoids these unnecessary reads by truncating the blocklist to the minimum number of blocks needed to meet the row limit before doing the actual block splitting, thereby avoiding materialization of unnecessary read tasks in the common splitting path.
Related issue number
Closes #27340
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.