[Data] Implement limit physical operator #34705

raulchen · 2023-04-24T02:56:20Z

Why are these changes needed?

Implemented the Limit physical operator for streaming execution.
Added the LimitStage for legacy compatibility.

Note, currently when the limit operator reaches the limit, the upstream operators still won't stop producing data. This will be optimized in a follow-up PR.

Related issue number

#34234

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Hao Chen <[email protected]>

ericl · 2023-04-24T20:01:01Z

python/ray/data/_internal/execution/operators/limit_operator.py

+            # If we don't know the number of rows in the input, try to
+            # split at the maximum number of rows we can consume
+            # (`self._limit - self._consumed_rows`).
+            blocks_splits, metadata_splits = _split_at_indices(


Hmm this is a blocking call right? I think we want to ensure all operators are streaming compatible. Maybe there should be an op that we insert before the Limit that ensures all bundles have num_rows set, fetching it if needed.

Actually, can we assume num_rows is always available? After the read, I think this should always be populated.

raulchen · 2023-04-24T22:11:22Z

python/ray/data/_internal/execution/operators/limit_operator.py

+        return self._buffer.popleft()
+
+    def get_stats(self) -> StatsDict:
+        return {self._name: self._output_metadata}


Not sure if this is the correct way to handle stats. I just follow map_operator. Is there any docs?

Seems reasonable, probably we should improve the docstring in interfaces.py

ericl · 2023-04-24T22:19:42Z

python/ray/data/_internal/execution/operators/limit_operator.py

+                # Slice the last block.
+                num_rows_to_take = self._limit - self._consumed_rows
+                self._consumed_rows = self._limit
+                block = BlockAccessor.for_block(ray.get(block)).slice(


We should still run it in a remote task to avoid needing to fetch the data block locally. This avoids the ray.put() later on.

the purpose of doing this is to avoid putting too much data on the driver side? Or is there any other reason why we shouldn't do ray.put?

Yeah, avoiding fetching large blocks to the driver is important, and it also avoids an extra data copy. If the data is split in a remote task, then only one put() happens, instead of two.

ericl · 2023-04-24T22:20:48Z

python/ray/data/_internal/logical/operators/all_to_all_operator.py

@@ -115,3 +115,18 @@ def __init__(
        )
        self._key = key
        self._aggs = aggs
+
+
+class Limit(AbstractAllToAll):


Should this extend AbstractOneToOne instead of AbstractAllToAll?

This is the logical operator. Not used right now, but I added this for completeness of the Datastream.limit API.
I think limit is also a kind of all-to-all logical operator. So this should make sense? Or do you prefer to have it directly extend to LogicalOperator.

Don't have a strong opinion, but it makes sense that it should extend LogicalOperator to me and not AllToAll, since there's technically not any all:all data dependency between rows.

Makes sense. Fixed.

ericl · 2023-04-24T22:21:19Z

python/ray/data/datastream.py

+        if logical_plan is not None:
+            op = Limit(logical_plan.dag, limit=limit)
+            logical_plan = LogicalPlan(op)
+        return Datastream(plan, self._epoch, self._lazy, logical_plan)


ericl · 2023-04-24T22:25:13Z

I tried this out locally and it seems we don't shutdown the first operator after the limit is reached. I guess that would be in the second PR only?

RAY_DATA_VERBOSE_PROGRESS=1 python

ericl

LGTM assuming you want to do the early shutdown in the second PR. Existing tests will suffice for this refactor.

python/ray/data/_internal/execution/operators/limit_operator.py

raulchen · 2023-04-25T00:10:56Z

LGTM assuming you want to do the early shutdown in the second PR. Existing tests will suffice for this refactor.

Yep, I plan to do that in a second PR. There are multiple ways to do this. I'll comment on the original issue. Let's continue discussing there.

Signed-off-by: Hao Chen <[email protected]>

ericl · 2023-04-25T03:59:09Z

Some tests seem to be failing with "ValueError: The size in bytes of the block must be known: (ObjectRef(00ffffffffffffffffffffffffffffffffffffff0100000056010000), BlockMetadata(num_rows=10, size_bytes=None, schema=None, input_files=[], exec_stats=None))", but I think this is a bug in the test.

The new streaming executor assumes that at least size_bytes is always known for all datasources. So I think we should fill in size_bytes for these tests and/or disable these tests.

Previous the tests passed since we bypassed operator execution entirely with the old limit() impl.

c21 · 2023-04-26T06:04:20Z

Tests are still failing, let's fix them in master.

## Why are these changes needed? Fixes a bug introduced by #34705 ## Related issue number #34234

## Why are these changes needed? - Implemented the Limit physical operator for streaming execution. - Added the `LimitStage` for legacy compatibility. Note, currently when the limit operator reaches the limit, the upstream operators still won't stop producing data. This will be optimized in a follow-up PR. ## Related issue number ray-project#34234 Signed-off-by: Jack He <[email protected]>

## Why are these changes needed? Fixes a bug introduced by ray-project#34705 ## Related issue number ray-project#34234 Signed-off-by: Jack He <[email protected]>

## Why are these changes needed? - Implemented the Limit physical operator for streaming execution. - Added the `LimitStage` for legacy compatibility. Note, currently when the limit operator reaches the limit, the upstream operators still won't stop producing data. This will be optimized in a follow-up PR. ## Related issue number ray-project#34234

## Why are these changes needed? Fixes a bug introduced by ray-project#34705 ## Related issue number ray-project#34234

raulchen requested review from ericl, scv119, c21, amogkam, scottjlee and bveeramani as code owners April 24, 2023 02:56

raulchen added 5 commits April 23, 2023 19:59

wip

babefde

Signed-off-by: Hao Chen <[email protected]>

Implemewnt physical operator

713e986

Signed-off-by: Hao Chen <[email protected]>

refine stage impl

97476f5

Signed-off-by: Hao Chen <[email protected]>

lint

8b16405

Signed-off-by: Hao Chen <[email protected]>

implement logical_plan

314159e

Signed-off-by: Hao Chen <[email protected]>

raulchen force-pushed the optimize-limit branch from e648694 to 7727fad Compare April 24, 2023 02:59

ericl reviewed Apr 24, 2023

View reviewed changes

raulchen changed the title ~~[WIP][Data] Optimize limit operator~~ [Data] Implement limit physical operator Apr 24, 2023

raulchen commented Apr 24, 2023

View reviewed changes

ericl assigned ericl, c21 and scottjlee Apr 24, 2023

ericl reviewed Apr 24, 2023

View reviewed changes

ericl approved these changes Apr 24, 2023

View reviewed changes

ericl reviewed Apr 24, 2023

View reviewed changes

python/ray/data/_internal/execution/operators/limit_operator.py Outdated Show resolved Hide resolved

raulchen added 5 commits April 24, 2023 17:19

refine

8e7e4c1

Signed-off-by: Hao Chen <[email protected]>

comment

8a1f8ed

Signed-off-by: Hao Chen <[email protected]>

support input_rows=None

90b54bd

Signed-off-by: Hao Chen <[email protected]>

optimize

43527da

Signed-off-by: Hao Chen <[email protected]>

Handle stats

5952c1f

Signed-off-by: Hao Chen <[email protected]>

refine

595d110

Signed-off-by: Hao Chen <[email protected]>

raulchen force-pushed the optimize-limit branch from 196fe77 to 595d110 Compare April 25, 2023 00:19

raulchen mentioned this pull request Apr 25, 2023

[data] [streaming] Limit operator shouldn't materialize stream #34234

Closed

extend

3b32d74

Signed-off-by: Hao Chen <[email protected]>

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 25, 2023

raulchen merged commit c4a29b2 into ray-project:master Apr 25, 2023

raulchen deleted the optimize-limit branch April 25, 2023 23:35

raulchen mentioned this pull request Apr 26, 2023

[Data] Fix limit operator #34800

Merged

8 tasks

raulchen added a commit that referenced this pull request Apr 27, 2023

[Data] Fix limit operator (#34800)

f419788

## Why are these changes needed? Fixes a bug introduced by #34705 ## Related issue number #34234

scottjlee mentioned this pull request May 1, 2023

[Dataset] Make limit() lazy and add logical operator for it #32761

Closed

architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request May 16, 2023

[Data] Fix limit operator (ray-project#34800)

cb63c71

## Why are these changes needed? Fixes a bug introduced by ray-project#34705 ## Related issue number ray-project#34234

scottjlee mentioned this pull request May 30, 2023

[Data] Implement Limit Operator Pushdown #35900

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Implement limit physical operator #34705

[Data] Implement limit physical operator #34705

raulchen commented Apr 24, 2023 •

edited

Loading

ericl Apr 24, 2023

ericl Apr 24, 2023

raulchen Apr 24, 2023

ericl Apr 24, 2023

ericl Apr 24, 2023

raulchen Apr 25, 2023

ericl Apr 25, 2023

ericl Apr 24, 2023 •

edited

Loading

raulchen Apr 25, 2023

ericl Apr 25, 2023

raulchen Apr 25, 2023

ericl Apr 24, 2023

ericl commented Apr 24, 2023

ericl left a comment

raulchen commented Apr 25, 2023

ericl commented Apr 25, 2023 •

edited

Loading

c21 commented Apr 26, 2023

[Data] Implement limit physical operator #34705

[Data] Implement limit physical operator #34705

Conversation

raulchen commented Apr 24, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl Apr 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl commented Apr 24, 2023

ericl left a comment

Choose a reason for hiding this comment

raulchen commented Apr 25, 2023

ericl commented Apr 25, 2023 • edited Loading

c21 commented Apr 26, 2023

raulchen commented Apr 24, 2023 •

edited

Loading

ericl Apr 24, 2023 •

edited

Loading

ericl commented Apr 25, 2023 •

edited

Loading