-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Allow unknown estimate of operator output bundles and ProgressBar
totals
#46601
[Data] Allow unknown estimate of operator output bundles and ProgressBar
totals
#46601
Conversation
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
@@ -273,7 +275,7 @@ def num_outputs_total(self) -> int: | |||
return self._estimated_num_output_bundles | |||
if len(self.input_dependencies) == 1: | |||
return self.input_dependencies[0].num_outputs_total() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- should we remove the above if statement? this is the code logic that "assumes that each read task produces exactly one block"
- also, I'm thinking if we should make this method abstract to force each subclass to have a reasonable implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we remove the above if statement? this is the code logic that "assumes that each read task produces exactly one block"
Yeah, let me remove the if block, thanks.
In terms of forcing each operator to implement the method, I think for the majority of operators would simply return self._estimated_num_output_bundles
if available, otherwise None
. As a followup PR, I can refactor the individual logic to calculate self._estimated_num_output_bundles
for operators into the num_outputs_total()
method (for example, move this logic into num_outputs_total()
method). Let me know if you'd rather I include it in this PR instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I included the necessary changes in this PR, turned out to be simpler than I originally thought. For the default behavior, I included in PhysicalOperator.num_outputs_total()
which returns self._estimated_num_output_bundles
. Operators like AllToAllOperator
, Limit
, Union
, and Zip
have their own implementation which adds some more logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks much nicer. Can you update OutputSplitter
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added logic to increment self._estimated_num_output_bundles
in OutputSplitter._get_next_inner()
. let me know if you have different logic in mind.
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
if len(self.input_dependencies) == 1: | ||
return self.input_dependencies[0].num_outputs_total() | ||
raise AttributeError | ||
return self._estimated_num_output_bundles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment that subclasses should either override this method or update _estimated_num_output_bundles
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
Signed-off-by: Scott Lee <[email protected]>
@@ -91,6 +91,9 @@ def has_next(self) -> bool: | |||
def _get_next_inner(self) -> RefBundle: | |||
output = self._output_queue.popleft() | |||
self._metrics.on_output_dequeued(output) | |||
if self._estimated_num_output_bundles is None: | |||
self._estimated_num_output_bundles = 0 | |||
self._estimated_num_output_bundles += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to just inherit num_outputs_total
from the input op.
because output splitter doesn't change the number of blocks.
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Ray Data initially assumes that each read task produces exactly one block. Furthermore, one-to-one operators assume that the number of output blocks is the same as its upstream operator. Neither of these assumptions are always guaranteed to be accurate, which results in inaccurate progress bar estimations and can cause confusion. This PR updates
PhysicalOperator.num_outputs_total()
to allow for unknown estimated number of output bundles, which is the case when no tasks have finished, so it is not possible to provide a reasonable estimate.For example, given the following reproducible script:
We can compare the behavior before and after this PR (video links):
Why are these changes needed?
Related issue number
Closes #46420
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.