Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deflake occasional deadlock in test_dataset.py::test_basic_actors[True] #21970

Merged
merged 11 commits into from
Jan 31, 2022

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Jan 29, 2022

Why are these changes needed?

This test is flaky. The underlying reason is that Datasets was not forcing eager evaluation of its blocks in the pipeline. This mean that the read stage could get overlapped with the actor pool stage, which can deadlock due to resource contention.

The fix is to force read tasks to finish prior to execution of the second stage, and to execute the stages from actors as a workaround to avoid ownership issues: #20554

To reproduce the hangs/failures:

for i in `seq 1 10`; do pytest -v -s test_dataset.py::test_basic_actors[True]; done

@ericl ericl requested a review from scv119 as a code owner January 29, 2022 02:36
@ericl
Copy link
Contributor Author

ericl commented Jan 29, 2022

FYI @stephanie-wang @clarkzinzow

@clarkzinzow clarkzinzow self-assigned this Jan 29, 2022
@clarkzinzow
Copy link
Contributor

If this is required for pipelining correctness, could the pipeline executor always force reads on the first stage? That way we wouldn't need to require each pipeline creation point to do it.

@ericl ericl changed the title [WIP] Try to fix test_dataset.py::test_basic_actors[True] Deflake occasional deadlock in test_dataset.py::test_basic_actors[True] Jan 29, 2022
@ericl
Copy link
Contributor Author

ericl commented Jan 29, 2022

@clarkzinzow fixed, this is now passing reliably and ready for review.

Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, few nits

python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
doesn't read blocks from the datasource until the first transform.
"""
blocks = self.get_internal_block_refs()
bar = ProgressBar("Force reads", len(blocks))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@@ -12,24 +12,31 @@
from ray.data.dataset_pipeline import DatasetPipeline


# Temporarily use an actor here to avoid ownership issues with tasks:
# https://github.com/ray-project/ray/issues/20554
@ray.remote(num_cpus=0, placement_group=None)
Copy link
Contributor

@clarkzinzow clarkzinzow Jan 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that @stephanie-wang's PR that ports stage task launching to a threadpool will also fix this by launching all tasks from the driver, and she'll have to resolve some merge conflicts here. #21845

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be an alternative to the thread pools as well, though it's not tolerant of the actor failure.

@ericl
Copy link
Contributor Author

ericl commented Jan 29, 2022

Separately, @stephanie-wang seems to me that we could deadlock in reconstruction if the stages end up overlapping. We might need to re-think the actor pool implementation to avoid this (i.e., actor workers need to release resources if blocked on fetching data).

@clarkzinzow
Copy link
Contributor

@ericl Great point!

@ericl
Copy link
Contributor Author

ericl commented Jan 29, 2022

On second thought, reconstruction isn't really supported with actor pools anyways, so perhaps we just need to make this clear in the documentation.

@clarkzinzow
Copy link
Contributor

clarkzinzow commented Jan 29, 2022

@ericl Don't actor tasks with the max_task_retries option specified support lineage-based reconstruction in current master, where the actor tasks are assumed to be idempotent? I know that actor tasks are not technically in scope for Q1 lineage reconstruction work, but I thought that the outstanding work was just an improved API and that the Datasets actor pool could possibly leverage the existing support.

@ericl
Copy link
Contributor Author

ericl commented Jan 29, 2022

@clarkzinzow I don't think actors support reconstruction at all in any case.

@ericl ericl added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jan 29, 2022
@clarkzinzow
Copy link
Contributor

clarkzinzow commented Jan 29, 2022

@ericl Lineage reconstruction doc suggests that what I indicated above is the case, unless I'm missing something.

Actor tasks with the decorator max_task_retries are automatically retried if the actor dies. Currently, if lineage reconstruction is enabled, actors with this decorator may also get resubmitted tasks, even if the actor has not died, due to a downstream object being reconstructed. The assumption is that these tasks are idempotent so it is safe to re-execute them.

Not sure if normal tasks submitted by actor methods would be covered at the end of Q1 work, however (multiple owners/borrowed refs in lineage).

https://docs.google.com/document/d/1LVk6JFGmdgxzKOoR8lz90TBAIDhJTcsY2dkoCRRLSR8/edit?usp=drivesdk

@ericl
Copy link
Contributor Author

ericl commented Jan 30, 2022

Makes sense, but it wouldn't apply here since the actors are destroyed after stage execution.

@clarkzinzow
Copy link
Contributor

Hmm so the actors are destroyed once the executor goes out of scope, so this would be an issue for rewindowing, repeating, or splitting a pipeline. Makes sense.

@bveeramani
Copy link
Member

‼️ ACTION REQUIRED ‼️

We've switched our code formatter from YAPF to Black (see #21311).

To prevent issues with merging your code, here's what you'll need to do:

  1. Install Black
pip install -I black==21.12b0
  1. Format changed files with Black
curl -o format-changed.sh https://gist.githubusercontent.com/bveeramani/42ef0e9e387b755a8a735b084af976f2/raw/7631276790765d555c423b8db2b679fd957b984a/format-changed.sh
chmod +x ./format-changed.sh
./format-changed.sh
rm format-changed.sh
  1. Commit your changes.
git add --all
git commit -m "Format Python code with Black"
  1. Merge master into your branch.
git pull upstream master
  1. Resolve merge conflicts (if necessary).

After running these steps, you'll have the updated format.sh.

@ericl ericl merged commit fe167c9 into ray-project:master Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants