-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Bulk executor initial implementation #30903
Changes from 59 commits
9e4451e
8924a89
44578ce
e0a346a
22504c0
0b26570
3f0e0cb
eaa46b0
3162f44
2136170
38ae324
9f24555
f33c772
bf5288f
f5efe2c
e5790dc
5c7e490
d6bee3c
1eb5519
ec66fd0
5aa082b
c8f8c79
5b2f7ec
00025f5
f8570ee
edba805
683f4a1
a9c0bdf
db332e1
07c0c69
e78e800
0fa159e
2a9e0a5
7573f99
c00f867
0ae94c7
8c29abf
e6da60e
a4faedc
6b9105a
a676598
3427f90
224854c
ebf21e3
b5074e0
51ceb36
4dad697
d0769e3
9695a27
aab996e
92f8b61
af3308b
a983e52
da1ab6a
01d4b2c
808c82f
fa7e3ec
44fb0f7
964aaeb
4567839
23eea81
beba2a6
887e4b3
f83edd9
bc8f342
50b456a
bdfef58
d810c61
91b2848
9e706ad
d3e370a
d4f514a
cde12ec
b95a356
1f15cd9
6129d66
ea62366
ab4e5d7
510e748
a6e8a18
7eec78a
bc021c9
cd0a902
718a32e
f3d8a50
3228401
d1a98d6
1a8dc02
203720e
f9850b4
e1d2e89
690cb1d
bf4ef1d
0807aa9
f7cd953
1314dfb
4d94aed
30c4486
1c83066
7cbfea4
f55101d
3410619
9f57758
f20fdc6
0830f1e
a607a3a
e3a6dd7
19f0664
be8b0d5
341acb9
8dacdef
6ea1cb8
9ac348b
597614a
dbc2ebd
458552f
53eb19d
a16e2dc
64849be
88cfd35
965c0de
1dfe172
7d8c2c9
25d0bb2
d789c9c
6cbbe8a
a119fc4
d4d2d0a
4462055
723241c
64a1453
2d554c5
13852ec
12c9eef
aef0530
4242d72
6540381
c446c58
7c28ae3
3467714
a3bfe5b
99e54da
49691ae
9358e1b
199fe0e
c6e6a63
06b1ad7
3867061
7097973
0b74edf
a9a66ab
a265437
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,99 @@ | ||||||||||||||||
import logging | ||||||||||||||||
from typing import Dict, List, Iterator, Optional | ||||||||||||||||
|
||||||||||||||||
import ray | ||||||||||||||||
from ray.data._internal.execution.interfaces import ( | ||||||||||||||||
Executor, | ||||||||||||||||
ExecutionOptions, | ||||||||||||||||
RefBundle, | ||||||||||||||||
PhysicalOperator, | ||||||||||||||||
) | ||||||||||||||||
from ray.data._internal.execution.operators.input_data_buffer import InputDataBuffer | ||||||||||||||||
from ray.data._internal.progress_bar import ProgressBar | ||||||||||||||||
from ray.data._internal.stats import DatasetStats | ||||||||||||||||
|
||||||||||||||||
logger = logging.getLogger(__name__) | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
class BulkExecutor(Executor): | ||||||||||||||||
def __init__(self, options: ExecutionOptions): | ||||||||||||||||
super().__init__(options) | ||||||||||||||||
self._stats = DatasetStats(stages={}, parent=None) | ||||||||||||||||
self._executed = False | ||||||||||||||||
|
||||||||||||||||
def execute( | ||||||||||||||||
self, dag: PhysicalOperator, initial_stats: Optional[DatasetStats] = None | ||||||||||||||||
) -> Iterator[RefBundle]: | ||||||||||||||||
"""Synchronously executes the DAG via bottom-up recursive traversal.""" | ||||||||||||||||
|
||||||||||||||||
assert not self._executed, "Can only call execute once." | ||||||||||||||||
self._executed = True | ||||||||||||||||
if not isinstance(dag, InputDataBuffer): | ||||||||||||||||
logger.info("Executing DAG %s", dag) | ||||||||||||||||
|
||||||||||||||||
if initial_stats: | ||||||||||||||||
self._stats = initial_stats | ||||||||||||||||
|
||||||||||||||||
saved_outputs: Dict[PhysicalOperator, List[RefBundle]] = {} | ||||||||||||||||
|
||||||||||||||||
def execute_recursive(op: PhysicalOperator) -> List[RefBundle]: | ||||||||||||||||
# Avoid duplicate executions. | ||||||||||||||||
if op in saved_outputs: | ||||||||||||||||
return saved_outputs[op] | ||||||||||||||||
|
||||||||||||||||
# Compute dependencies. | ||||||||||||||||
inputs = [execute_recursive(dep) for dep in op.input_dependencies] | ||||||||||||||||
|
||||||||||||||||
# Fully execute this operator. | ||||||||||||||||
logger.debug("Executing op %s", op.name) | ||||||||||||||||
builder = self._stats.child_builder(op.name) | ||||||||||||||||
try: | ||||||||||||||||
for i, ref_bundles in enumerate(inputs): | ||||||||||||||||
for r in ref_bundles: | ||||||||||||||||
op.add_input(r, input_index=i) | ||||||||||||||||
op.inputs_done(i) | ||||||||||||||||
output = _naive_run_until_complete(op) | ||||||||||||||||
finally: | ||||||||||||||||
op.shutdown() | ||||||||||||||||
|
||||||||||||||||
# Cache and return output. | ||||||||||||||||
saved_outputs[op] = output | ||||||||||||||||
op_stats = op.get_stats() | ||||||||||||||||
op_metrics = op.get_metrics() | ||||||||||||||||
if op_stats: | ||||||||||||||||
self._stats = builder.build_multistage(op_stats) | ||||||||||||||||
self._stats.extra_metrics = op_metrics | ||||||||||||||||
return output | ||||||||||||||||
|
||||||||||||||||
return execute_recursive(dag) | ||||||||||||||||
|
||||||||||||||||
def get_stats(self) -> DatasetStats: | ||||||||||||||||
assert self._stats is not None, self._stats | ||||||||||||||||
return self._stats | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
def _naive_run_until_complete(op: PhysicalOperator) -> List[RefBundle]: | ||||||||||||||||
"""Run this operator until completion, assuming all inputs have been submitted. | ||||||||||||||||
|
||||||||||||||||
Args: | ||||||||||||||||
op: The operator to run. | ||||||||||||||||
|
||||||||||||||||
Returns: | ||||||||||||||||
The list of output ref bundles for the operator. | ||||||||||||||||
""" | ||||||||||||||||
output = [] | ||||||||||||||||
tasks = op.get_work_refs() | ||||||||||||||||
if tasks: | ||||||||||||||||
bar = ProgressBar(op.name, total=op.num_outputs_total()) | ||||||||||||||||
while tasks: | ||||||||||||||||
done, _ = ray.wait(tasks, fetch_local=True, timeout=0.1) | ||||||||||||||||
for ready in done: | ||||||||||||||||
op.notify_work_completed(ready) | ||||||||||||||||
tasks = op.get_work_refs() | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: Since
Suggested change
I understand that the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This method is designed to work for "any possible operator" that correctly implements the interface, not just MapOperator. And yes, the operators fully support streaming. |
||||||||||||||||
while op.has_next(): | ||||||||||||||||
bar.update(1) | ||||||||||||||||
output.append(op.get_next()) | ||||||||||||||||
bar.close() | ||||||||||||||||
while op.has_next(): | ||||||||||||||||
output.append(op.get_next()) | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Two questions:
If (1) is the case (i.e. this is not needed), this could be replaced with an assertion that this is in fact the case.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is true for the Map operator, but other operators can have arbitrary behavior, such as no tasks at all. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I get the no tasks case (and if that's the only case then maybe this should be in an
Well, they can't have completely arbitrary behavior; the executor needs to have some contract around when it can expect the operator to have new outputs, and when it can expect the operator's output production to be finished. E.g. one sensible contract between the executor and the operator would be:
For (1), we try to consume the operator outputs whenever a task finishes; for (2), we do a single pass of operator output consumption. Whatever the contract, I'm just arguing for making it as explicit as possible in the executor code, so devs (like me) aren't left wondering "wait why would we need this last operator output consumption if the operator had tasks, shouldn't it have been exhausted at the end of the task loop?" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, good point. I added some comments in the docstring and here to clarify this completion behavior. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It might be more clear to add a method to indicate this completion state of the operator. The added comment at has_next() and get_work_ref() seems a bit off-topic for them. |
||||||||||||||||
return output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ericl I don't think that this needs a timeout, the below loop appears to be a no-op if no new tasks are done. Could you remind me if/why this timeout is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This avoids high-CPU spins, and also gives us a chance to interrupt. Btw, this isn't new code, just copying from the existing implementation of bulk wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm IIRC the
ray.wait()
should result in a condition variable wait somewhere in Ray Core, and shouldn't that not result in a high-CPU spin? Unless we're not compiling Ray with the correct flags on Linux...ray.wait()
is already interruptible, it checks for signals every second regardless of the user-provided timeout IIRC.Understood, but we really don't want to cargo cult suboptimal code into the new execution model, if we can help it. We've already decided that we should use this redesign as an opportunity to address some long-standing tech debt and raise our quality bar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't the new execution model though--- just the legacy shim until we write the streaming executor. Hence, I'd like to avoid unnecessary execution detail changes that could cause behavior changes.
I do think this is the correct choice though for a couple reasons:
Hence, this is a reasonable, if conservative, choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, we can keep to the old implementation for the legacy shim and defer any wait loop tweaks until the streaming executor PR.
I don't think this would be ideal anyways, since we'd just be taking the relatively efficient Core Worker wait + a periodic signal check with an inefficient application-level spin loop. So agreed that this wouldn't be a good route.
Hmm what do you mean by "batching" here?
If you mean batching of multiple "ready" returns,
num_returns
is set to 1 by default, so that kind of batching is already not happening here since whennum_returns=1
, we immediately return a singleready
ref when it becomes available, right? Quick example demonstrating these unbatched semantics:If you are talking about concurrent object pulls (which I don't think you are since that shouldn't matter here), that should have been fixed in Ray Core (with the bug + fix that you discovered) so
len(tasks)
is fetched instead ofnum_returns
. #30724In any case, unlike the stream executor which wants to immediately know that a task has finished in order to (potentially) launch new work (and should therefore have no timeout for its wait loop), the bulk executor only cares about updating the user-facing progress and stats reporting, so batching at a 100ms granularity is pretty reasonable. If that 100ms batching is the intention, then I think that we would want to specify
num_returns
:But yeah, we can just keep the existing bulk execution wait loop behavior, since this is just a legacy shim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch. Seems like I missed this, indeed this is the batching I meant.