Deflake occasional deadlock in test_dataset.py::test_basic_actors[True] #21970

ericl · 2022-01-29T02:36:11Z

Why are these changes needed?

This test is flaky. The underlying reason is that Datasets was not forcing eager evaluation of its blocks in the pipeline. This mean that the read stage could get overlapped with the actor pool stage, which can deadlock due to resource contention.

The fix is to force read tasks to finish prior to execution of the second stage, and to execute the stages from actors as a workaround to avoid ownership issues: #20554

To reproduce the hangs/failures:

for i in `seq 1 10`; do pytest -v -s test_dataset.py::test_basic_actors[True]; done

ericl · 2022-01-29T02:36:26Z

FYI @stephanie-wang @clarkzinzow

clarkzinzow · 2022-01-29T02:49:18Z

If this is required for pipelining correctness, could the pipeline executor always force reads on the first stage? That way we wouldn't need to require each pipeline creation point to do it.

…est-basic-hang

ericl · 2022-01-29T20:08:12Z

@clarkzinzow fixed, this is now passing reliably and ready for review.

clarkzinzow

LGTM, few nits

python/ray/data/dataset.py

clarkzinzow · 2022-01-29T20:35:05Z

python/ray/data/dataset.py

+        doesn't read blocks from the datasource until the first transform.
+        """
+        blocks = self.get_internal_block_refs()
+        bar = ProgressBar("Force reads", len(blocks))


clarkzinzow · 2022-01-29T20:38:00Z

python/ray/data/impl/pipeline_executor.py

@@ -12,24 +12,31 @@
    from ray.data.dataset_pipeline import DatasetPipeline


+# Temporarily use an actor here to avoid ownership issues with tasks:
+# https://github.com/ray-project/ray/issues/20554
 @ray.remote(num_cpus=0, placement_group=None)


Note that @stephanie-wang's PR that ports stage task launching to a threadpool will also fix this by launching all tasks from the driver, and she'll have to resolve some merge conflicts here. #21845

This might be an alternative to the thread pools as well, though it's not tolerant of the actor failure.

ericl · 2022-01-29T20:56:17Z

Separately, @stephanie-wang seems to me that we could deadlock in reconstruction if the stages end up overlapping. We might need to re-think the actor pool implementation to avoid this (i.e., actor workers need to release resources if blocked on fetching data).

clarkzinzow · 2022-01-29T21:01:49Z

@ericl Great point!

ericl · 2022-01-29T21:44:24Z

On second thought, reconstruction isn't really supported with actor pools anyways, so perhaps we just need to make this clear in the documentation.

clarkzinzow · 2022-01-29T22:22:00Z

@ericl Don't actor tasks with the max_task_retries option specified support lineage-based reconstruction in current master, where the actor tasks are assumed to be idempotent? I know that actor tasks are not technically in scope for Q1 lineage reconstruction work, but I thought that the outstanding work was just an improved API and that the Datasets actor pool could possibly leverage the existing support.

ericl · 2022-01-29T23:24:13Z

@clarkzinzow I don't think actors support reconstruction at all in any case.

clarkzinzow · 2022-01-29T23:31:09Z

@ericl Lineage reconstruction doc suggests that what I indicated above is the case, unless I'm missing something.

Actor tasks with the decorator max_task_retries are automatically retried if the actor dies. Currently, if lineage reconstruction is enabled, actors with this decorator may also get resubmitted tasks, even if the actor has not died, due to a downstream object being reconstructed. The assumption is that these tasks are idempotent so it is safe to re-execute them.

Not sure if normal tasks submitted by actor methods would be covered at the end of Q1 work, however (multiple owners/borrowed refs in lineage).

https://docs.google.com/document/d/1LVk6JFGmdgxzKOoR8lz90TBAIDhJTcsY2dkoCRRLSR8/edit?usp=drivesdk

ericl · 2022-01-30T00:52:55Z

Makes sense, but it wouldn't apply here since the actors are destroyed after stage execution.

clarkzinzow · 2022-01-30T01:32:00Z

Hmm so the actors are destroyed once the executor goes out of scope, so this would be an issue for rewindowing, repeating, or splitting a pipeline. Makes sense.

bveeramani · 2022-01-30T04:59:10Z

‼️ ACTION REQUIRED ‼️

We've switched our code formatter from YAPF to Black (see #21311).

To prevent issues with merging your code, here's what you'll need to do:

Install Black

pip install -I black==21.12b0

Format changed files with Black

curl -o format-changed.sh https://gist.githubusercontent.com/bveeramani/42ef0e9e387b755a8a735b084af976f2/raw/7631276790765d555c423b8db2b679fd957b984a/format-changed.sh
chmod +x ./format-changed.sh
./format-changed.sh
rm format-changed.sh

Commit your changes.

git add --all
git commit -m "Format Python code with Black"

Merge master into your branch.

git pull upstream master

Resolve merge conflicts (if necessary).

After running these steps, you'll have the updated format.sh.

ericl added 2 commits January 28, 2022 18:32

fix

a564aac

fix

dd27c02

ericl requested a review from scv119 as a code owner January 29, 2022 02:36

clarkzinzow self-assigned this Jan 29, 2022

ericl added 4 commits January 29, 2022 11:59

fix hang

25825ff

add loop test

6772faf

fix

ce0f8c5

Merge branch 'fix-test-basic-hang' of github.com:ericl/ray into fix-t…

938c121

…est-basic-hang

ericl force-pushed the fix-test-basic-hang branch from a7682d6 to 938c121 Compare January 29, 2022 20:05

ericl added 2 commits January 29, 2022 12:05

fix

885d9b1

fix

6d45b17

ericl changed the title ~~[WIP] Try to fix test_dataset.py::test_basic_actors[True]~~ Deflake occasional deadlock in test_dataset.py::test_basic_actors[True] Jan 29, 2022

ericl assigned stephanie-wang Jan 29, 2022

ericl mentioned this pull request Jan 29, 2022

[data] Optimize dataset metadata read/write in Ray client #21939

Merged

6 tasks

clarkzinzow approved these changes Jan 29, 2022

View reviewed changes

fix

654a226

ericl assigned scv119 Jan 29, 2022

ericl added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jan 29, 2022

scv119 approved these changes Jan 30, 2022

View reviewed changes

ericl added 2 commits January 30, 2022 14:14

Merge remote-tracking branch 'upstream/master' into fix-test-basic-hang

9d4c39c

fmt

838bbbe

ericl merged commit fe167c9 into ray-project:master Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deflake occasional deadlock in test_dataset.py::test_basic_actors[True] #21970

Deflake occasional deadlock in test_dataset.py::test_basic_actors[True] #21970

ericl commented Jan 29, 2022 •

edited

Loading

ericl commented Jan 29, 2022

clarkzinzow commented Jan 29, 2022

ericl commented Jan 29, 2022

clarkzinzow left a comment

clarkzinzow Jan 29, 2022

clarkzinzow Jan 29, 2022 •

edited

Loading

ericl Jan 31, 2022

ericl commented Jan 29, 2022

clarkzinzow commented Jan 29, 2022

ericl commented Jan 29, 2022

clarkzinzow commented Jan 29, 2022 •

edited

Loading

ericl commented Jan 29, 2022

clarkzinzow commented Jan 29, 2022 •

edited

Loading

ericl commented Jan 30, 2022

clarkzinzow commented Jan 30, 2022

bveeramani commented Jan 30, 2022

Deflake occasional deadlock in test_dataset.py::test_basic_actors[True] #21970

Deflake occasional deadlock in test_dataset.py::test_basic_actors[True] #21970

Conversation

ericl commented Jan 29, 2022 • edited Loading

Why are these changes needed?

ericl commented Jan 29, 2022

clarkzinzow commented Jan 29, 2022

ericl commented Jan 29, 2022

clarkzinzow left a comment

Choose a reason for hiding this comment

clarkzinzow Jan 29, 2022

Choose a reason for hiding this comment

clarkzinzow Jan 29, 2022 • edited Loading

Choose a reason for hiding this comment

ericl Jan 31, 2022

Choose a reason for hiding this comment

ericl commented Jan 29, 2022

clarkzinzow commented Jan 29, 2022

ericl commented Jan 29, 2022

clarkzinzow commented Jan 29, 2022 • edited Loading

ericl commented Jan 29, 2022

clarkzinzow commented Jan 29, 2022 • edited Loading

ericl commented Jan 30, 2022

clarkzinzow commented Jan 30, 2022

bveeramani commented Jan 30, 2022

‼️ ACTION REQUIRED ‼️

ericl commented Jan 29, 2022 •

edited

Loading

clarkzinzow Jan 29, 2022 •

edited

Loading

clarkzinzow commented Jan 29, 2022 •

edited

Loading

clarkzinzow commented Jan 29, 2022 •

edited

Loading