[Merged by Bors] - Fix asset_debug_server hang. There should be at most one ThreadExecut… #7825

shuoli84 · 2023-02-26T12:56:44Z

…or's ticker for one thread.

Objective

Fix debug_asset_server hang.

Solution

Reuse the thread_local executor for MainThreadExecutor resource, so there will be only one ThreadExecutor for main thread.
If ThreadTickers from same executor, they are conflict with each other. Then only tick one.

…or's ticker for one thread.

hymm

Are you sure both these changes are necessary? The MainThreadExecutor doesn't get cloned into debug asset server app.

crates/bevy_tasks/src/task_pool.rs

hymm · 2023-02-26T22:57:17Z

Is this only a problem with tick and not with run? We technically have more executors than just the thread executors running on each thread. We also have the shared multithreaded executor and the thread local executors too.

Is it possible to write a test in bevy_tasks that shows that this deadlocks before this PR?

james7132 · 2023-02-27T00:55:48Z

@NiklasEi can you check if this fixes your use of the debug asset server? I want to make sure this fixes that key issue before giving this a more thorough review.

shuoli84 · 2023-02-27T01:25:55Z

Are you sure both these changes are necessary? The MainThreadExecutor doesn't get cloned into debug asset server app.

There are mainly two changes:
*If two tickers generated from same executor, then just tick one.
This fixes dead lock for following code, but not enough for asset_debug_server.

use bevy_app::App;
use bevy_ecs::prelude::*;

fn run_sub_app(mut sub_app: NonSendMut<DebugApp>) {
    sub_app.app.update();
}

struct DebugApp {
    app: App,
}

fn main() {
    let mut app = bevy_app::App::new();

    let sub_app = bevy_app::App::new();
    app.insert_non_send_resource(DebugApp { app: sub_app });
    app.add_system(run_sub_app);

    app.update();
}

Share the ThreadExecutor instance. 1 + 2 fixed the deadlock for debug_asset_server.

Is this only a problem with tick and not with run? We technically have more executors than just the thread executors running on each thread. We also have the shared multithreaded executor and the thread local executors too.

Is it possible to write a test in bevy_tasks that shows that this deadlocks before this PR?

Yes, it is possible to write a test to repro the dead lock, i'll give it a try.

This bug is hard to reason, I spend like 20+ hours on it :'). The only fact I am sure of is: "if the Ticker get leaked, then the async_executor enter the "troubled" state, that it can't be notify." But this doesn't promise a deadlock, if the thread can be unparked by any other means, it still able to proceed. I've tried to create a separate thread just unpark the main thread, it also able to run.

Is this only a problem with tick and not with run?

In theory run also affected, it is just a wrapper on top of ticker. It didn't happen maybe just because we don't have such a code path.

EDIT: format

…onflict check, it would block.

shuoli84 · 2023-02-27T07:18:04Z

Just added an example, without the fix, it would block. You can try disable check by returning false in conflict_with.

Cargo.toml

shuoli84 · 2023-02-28T11:00:36Z

I think I figured out the details. The deadlock is caused by following steps.

1 The outer executor, which ticking with ticker.tick().or(ticker.tick()), this is what main branch code works. It actually spins on three futures, ticker_1, ticker_2, and the job_future. Like following code

let forever = async {
    loop {
        ticker_1.tick().or(ticker_2.tick()).await
    }
}
future::block_on(forever, work_future);

2 at first run, ticker_1, ticker_2 and work_future all returns pending, that means this thread is parked on work_future, and ticker_1, ticker_2 are allocated and not freed.
3 then there is a task dispatched to this executor, which also unpark the main thread. So ticker_1 can proceed, it get the task to run, and call runnable.run on it. At the same time, ticker_2 still alive, which means the executor is at the state of notified.
4 The task ticker_1 is running or polling, is the subapp schedule code, which also calls into MultiThreadExecutor, and finally a future::block_on on systems' running. So main thread parked again, it kinda run with same stack as the outer app. The most significant difference is, current executor is in 'notified' state.
5 Here is the last but the most import step, the ComputePool spawn a task (which is the exclusive system running) to the ThreadExecutor and waiting for it. But the ThreadExecutor is already notified, so this spawn could not unpark the main thread. So the deadlock happened.

Back to the code fix, if we replace the ticker_1.tick().or(ticker_2.tick()) with ticker_1.tick(), then at step 3, when ticker_1 get a task, there is no hanging ticker_2, which makes the executor back to unnotified, then the spawning in step 5 able to notify the executor and unpark main thread successfully.

EDIT: format

shuoli84 · 2023-02-28T12:50:04Z

@hymm @james7132 ping

hymm · 2023-02-28T17:46:50Z

I think I figured out the details. The deadlock is caused by following steps.

This explanation makes sense to me. Thanks for figuring it out. So my PR #7564 fixes things by not reusing the executor and so it the inner executor doesn't get into the weird state. While this PR fixes things by not ever having the second ticker in an or. My test code in the other pr's comments never deadlocked, because I needed to add a second executor that the outer schedule is using.

I'm pretty sure I prefer the change in this PR. We don't need to keep recreating the scope executor and we're no longer doing the double ticking which always felt a little weird.

In the longer term, this seems to be a bug in async executor and we should consider upstreaming a fix.

crates/bevy_tasks/src/thread_executor.rs

NiklasEi

@NiklasEi can you check if this fixes your use of the debug asset server? I want to make sure this fixes that key issue before giving this a more thorough review.

I just checked it and yes, this PR also fixes my stuck integration tests 👍

james7132 · 2023-02-28T21:09:50Z

Just wanted to quickly chime in and thank @shuoli84 for digging into this rather complex bug. I'll leave a full review soon. Definitely want this fix before 0.10 goes live.

hymm

Code looks good to me now. The logic for which tickers need to be ticked in scope is getting a little complicated, so it'd be nice to have some unit tests for that, but not going to block on that. The multiple tickers code should be getting removed when we remove !Send resources from the world.

crates/bevy_tasks/src/thread_executor.rs

james7132

Sans a few code quality nits, this looks good to me. Great work!

crates/bevy_tasks/src/task_pool.rs

crates/bevy_tasks/src/thread_executor.rs

shuoli84 · 2023-03-02T04:18:16Z

Just opened a pr #7865, which basically runs the load_gltf example with extra feature, the ci captures the hang. So it should be enough for now. Will continue the test validation with that pr.

hymm · 2023-03-02T06:38:13Z

crates/bevy_tasks/src/task_pool.rs

                let scope_ticker = scope_executor.ticker().unwrap();
-                if let Some(external_ticker) = external_executor.ticker() {
-                    if tick_task_pool_executor {
+                let external_ticker = if !external_executor.is_same(scope_executor) {


Nice change. This is definitely easier to follow.

cart · 2023-03-02T08:40:07Z

bors r+

#7825) …or's ticker for one thread. # Objective - Fix debug_asset_server hang. ## Solution - Reuse the thread_local executor for MainThreadExecutor resource, so there will be only one ThreadExecutor for main thread. - If ThreadTickers from same executor, they are conflict with each other. Then only tick one.

bors · 2023-03-02T08:56:30Z

Pull request successfully merged into main.

Build succeeded:

bevyengine#7825) …or's ticker for one thread. # Objective - Fix debug_asset_server hang. ## Solution - Reuse the thread_local executor for MainThreadExecutor resource, so there will be only one ThreadExecutor for main thread. - If ThreadTickers from same executor, they are conflict with each other. Then only tick one.

Fix asset_debug_server hang. There should be at most one ThreadExecut…

9206146

…or's ticker for one thread.

shuoli84 mentioned this pull request Feb 26, 2023

create a new scope executor for every scope #7564

Closed

fix clippy

3bc434c

james7132 requested a review from hymm February 26, 2023 13:26

james7132 added C-Bug An unexpected or incorrect behavior A-Tasks Tools for parallel and async work labels Feb 26, 2023

alice-i-cecile added this to the 0.10 milestone Feb 26, 2023

fix wasm build

cb19298

hymm reviewed Feb 26, 2023

View reviewed changes

crates/bevy_tasks/src/task_pool.rs Outdated Show resolved Hide resolved

crates/bevy_tasks/src/task_pool.rs Outdated Show resolved Hide resolved

move ticker conflict check out of loop

f15ac4f

james7132 requested a review from NiklasEi February 27, 2023 00:54

Add nested app example to show ThreadExecutor deadlock. Disable the c…

6d22cc5

…onflict check, it would block.

fix ci

6d45a18

hymm reviewed Feb 27, 2023

View reviewed changes

Cargo.toml Outdated Show resolved Hide resolved

shuoli84 added 2 commits February 28, 2023 20:40

clean code for review

f6b7b45

fix ci

46bf6e6

hymm reviewed Feb 28, 2023

View reviewed changes

crates/bevy_tasks/src/thread_executor.rs Outdated Show resolved Hide resolved

NiklasEi reviewed Feb 28, 2023

View reviewed changes

shuoli84 added 2 commits March 1, 2023 10:16

add same executor checking to existing executor branching code

4f9af7f

remove unused code

45beaba

hymm approved these changes Mar 1, 2023

View reviewed changes

crates/bevy_tasks/src/thread_executor.rs Outdated Show resolved Hide resolved

james7132 requested changes Mar 1, 2023

View reviewed changes

crates/bevy_tasks/src/task_pool.rs Outdated Show resolved Hide resolved

crates/bevy_tasks/src/thread_executor.rs Outdated Show resolved Hide resolved

NiklasEi mentioned this pull request Mar 1, 2023

Update to Bevy 0.10 NiklasEi/bevy_asset_loader#106

Merged

8 tasks

fix comment

c2bd158

shuoli84 mentioned this pull request Mar 2, 2023

ci able to override how example runs #7865

Closed

shuoli84 added 2 commits March 2, 2023 14:26

remove default branch from match

90e1d0b

refine code a bit

9312549

shuoli84 requested a review from james7132 March 2, 2023 06:37

james7132 approved these changes Mar 2, 2023

View reviewed changes

james7132 added the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Mar 2, 2023

hymm reviewed Mar 2, 2023

View reviewed changes

cart approved these changes Mar 2, 2023

View reviewed changes

bors bot changed the title ~~Fix asset_debug_server hang. There should be at most one ThreadExecut…~~ [Merged by Bors] - Fix asset_debug_server hang. There should be at most one ThreadExecut… Mar 2, 2023

bors bot closed this Mar 2, 2023

james7132 mentioned this pull request Mar 2, 2023

Feature debug_asset_server causes freeze/infinite loop on startup #7563

Closed

janhohenheim mentioned this pull request Mar 14, 2023

The game sometimes gets stuck loading scene assets janhohenheim/foxtrot#129

Closed

hymm mentioned this pull request Oct 5, 2023

Manually running a schedule in a sub-world hangs #10032

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Merged by Bors] - Fix asset_debug_server hang. There should be at most one ThreadExecut… #7825

[Merged by Bors] - Fix asset_debug_server hang. There should be at most one ThreadExecut… #7825

shuoli84 commented Feb 26, 2023

hymm left a comment

hymm commented Feb 26, 2023 •

edited

Loading

james7132 commented Feb 27, 2023

shuoli84 commented Feb 27, 2023 •

edited

Loading

shuoli84 commented Feb 27, 2023

shuoli84 commented Feb 28, 2023 •

edited

Loading

shuoli84 commented Feb 28, 2023

hymm commented Feb 28, 2023

NiklasEi left a comment

james7132 commented Feb 28, 2023

hymm left a comment •

edited

Loading

james7132 left a comment

shuoli84 commented Mar 2, 2023

hymm Mar 2, 2023

cart commented Mar 2, 2023

bors bot commented Mar 2, 2023

[Merged by Bors] - Fix asset_debug_server hang. There should be at most one ThreadExecut… #7825

[Merged by Bors] - Fix asset_debug_server hang. There should be at most one ThreadExecut… #7825

Conversation

shuoli84 commented Feb 26, 2023

Objective

Solution

hymm left a comment

Choose a reason for hiding this comment

hymm commented Feb 26, 2023 • edited Loading

james7132 commented Feb 27, 2023

shuoli84 commented Feb 27, 2023 • edited Loading

shuoli84 commented Feb 27, 2023

shuoli84 commented Feb 28, 2023 • edited Loading

shuoli84 commented Feb 28, 2023

hymm commented Feb 28, 2023

NiklasEi left a comment

Choose a reason for hiding this comment

james7132 commented Feb 28, 2023

hymm left a comment • edited Loading

Choose a reason for hiding this comment

james7132 left a comment

Choose a reason for hiding this comment

shuoli84 commented Mar 2, 2023

hymm Mar 2, 2023

Choose a reason for hiding this comment

cart commented Mar 2, 2023

bors bot commented Mar 2, 2023

hymm commented Feb 26, 2023 •

edited

Loading

shuoli84 commented Feb 27, 2023 •

edited

Loading

shuoli84 commented Feb 28, 2023 •

edited

Loading

hymm left a comment •

edited

Loading