-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deflake occasional deadlock in test_dataset.py::test_basic_actors[True] #21970
Conversation
If this is required for pipelining correctness, could the pipeline executor always force reads on the first stage? That way we wouldn't need to require each pipeline creation point to do it. |
a7682d6
to
938c121
Compare
@clarkzinzow fixed, this is now passing reliably and ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, few nits
doesn't read blocks from the datasource until the first transform. | ||
""" | ||
blocks = self.get_internal_block_refs() | ||
bar = ProgressBar("Force reads", len(blocks)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
@@ -12,24 +12,31 @@ | |||
from ray.data.dataset_pipeline import DatasetPipeline | |||
|
|||
|
|||
# Temporarily use an actor here to avoid ownership issues with tasks: | |||
# https://github.com/ray-project/ray/issues/20554 | |||
@ray.remote(num_cpus=0, placement_group=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that @stephanie-wang's PR that ports stage task launching to a threadpool will also fix this by launching all tasks from the driver, and she'll have to resolve some merge conflicts here. #21845
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be an alternative to the thread pools as well, though it's not tolerant of the actor failure.
Separately, @stephanie-wang seems to me that we could deadlock in reconstruction if the stages end up overlapping. We might need to re-think the actor pool implementation to avoid this (i.e., actor workers need to release resources if blocked on fetching data). |
@ericl Great point! |
On second thought, reconstruction isn't really supported with actor pools anyways, so perhaps we just need to make this clear in the documentation. |
@ericl Don't actor tasks with the |
@clarkzinzow I don't think actors support reconstruction at all in any case. |
@ericl Lineage reconstruction doc suggests that what I indicated above is the case, unless I'm missing something.
Not sure if normal tasks submitted by actor methods would be covered at the end of Q1 work, however (multiple owners/borrowed refs in lineage). https://docs.google.com/document/d/1LVk6JFGmdgxzKOoR8lz90TBAIDhJTcsY2dkoCRRLSR8/edit?usp=drivesdk |
Makes sense, but it wouldn't apply here since the actors are destroyed after stage execution. |
Hmm so the actors are destroyed once the executor goes out of scope, so this would be an issue for rewindowing, repeating, or splitting a pipeline. Makes sense. |
|
Why are these changes needed?
This test is flaky. The underlying reason is that Datasets was not forcing eager evaluation of its blocks in the pipeline. This mean that the read stage could get overlapped with the actor pool stage, which can deadlock due to resource contention.
The fix is to force read tasks to finish prior to execution of the second stage, and to execute the stages from actors as a workaround to avoid ownership issues: #20554
To reproduce the hangs/failures: