Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WorkerThreadPool: Revamp interaction with ScriptServer #96959

Merged
merged 4 commits into from
Sep 16, 2024

Conversation

RandomShaper
Copy link
Member

@RandomShaper RandomShaper commented Sep 13, 2024

First, this reverts 2d1dd41 from #96760. The approach there is not enough, given that with separate thread rendering, there's a cyclic dependency chain on shutdown:

  • WorkerThreadPool must finish before ScriptServer.
  • RenderingServer must finish before WorkerThreadPool.
  • ScriptServer must finish before RenderingServer.

A way of breaking the cycle, which this PR implements, is splitting WorkerThreadPool regular lifetime into two three🆕 "runlevels":

  • One that works as usual.
  • One that waits until all the threads are idle, with no more tasks queued, plus adding more tasks blocked (not rejected, just they have to wait). 🆕
  • One that also works normally, but where languages have already been terminated.

I've put the pure refactor into its own commit, which would help a lot for bisect and also for easier reviewing.

It should still fix #95809 (CC @CedNaru).

Fixes #96931.
Fixes #96978. 🆕
Fixes #97073. 🆕🆕


Cherry-picking note: We don't have MutexLock::temp_unlock()/temp_relock() in 4.3. Therefore, the most recent commit in this PR needs to have the following patch applied:

diff --git a/core/object/worker_thread_pool.cpp b/core/object/worker_thread_pool.cpp
index 37a605cb10..2f0e3d4c2a 100644
+++ b/core/object/worker_thread_pool.cpp
@@ -233,11 +233,11 @@ void WorkerThreadPool::_post_tasks(Task **p_tasks, uint32_t p_count, bool p_high
        // in custom builds.
        bool process_on_calling_thread = threads.size() == 0;
        if (process_on_calling_thread) {
-               p_lock.temp_unlock();
+               task_mutex.unlock();
                for (uint32_t i = 0; i < p_count; i++) {
                        _process_task(p_tasks[i]);
                }
-               p_lock.temp_relock();
+               task_mutex.lock();
                return;
        }

@@ -576,9 +576,9 @@ bool WorkerThreadPool::_handle_runlevel(ThreadData *p_thread_data, MutexLock<Bin
                } break;
                case RUNLEVEL_EXIT_LANGUAGES: {
                        if (!p_thread_data->exited_languages) {
-                               p_lock.temp_unlock();
+                               task_mutex.unlock();
                                ScriptServer::thread_exit();
-                               p_lock.temp_relock();
+                               task_mutex.lock();
                                p_thread_data->exited_languages = true;
                                runlevel_data.exit_languages.num_exited_threads++;
                                control_cond_var.notify_all();

@RandomShaper RandomShaper added bug topic:core cherrypick:4.3 Considered for cherry-picking into a future 4.3.x release labels Sep 13, 2024
@RandomShaper RandomShaper added this to the 4.4 milestone Sep 13, 2024
@RandomShaper RandomShaper requested a review from a team as a code owner September 13, 2024 12:57
@CedNaru
Copy link
Contributor

CedNaru commented Sep 13, 2024

Ho so it's not a full revert of the previous PR. We still have the bookkeeping done by the ScriptServer.

My bad, I thought I tested with the multithreaded renderer, but it seems I messed up (Doesn't change that I still tested the branch against 4.3-stable and confirmed the difference in behaviour).

Otherwise, it seems to be an improvement again. Now the WorkerThreadPool calls thread_exit() directly instead of letting the Thread do it itself, just like with thread_enter().

So if I understand, the order of execution is:

  • Godot starts

  • WorkThreadPool is initialized before all servers (including ScriptServer and so no thread_enter() possible)

  • Servers are started, and their loop added as a task to the pool if multithreaded.

  • ScriptServer and its languages are initialized.

  • Whenever a new task is started by the pool or that a long-running one yields, thread_enter() is called if not done yet.

  • Godot is closing and cleanup happens.

  • WorkThreadPool calls exit_languages_threads(), which change the "mode" of the pool. From now, the logic is inverted and any task finishing or yielding will call thread_exit().

  • ScriptServer terminates.

  • The rest of the servers terminate too.

  • WorkThreadPool is deleted last.

Now my doubt: What happens if tasks are still queued when we switch to RUNLEVEL_EXIT_LANGUAGES mode. Won't it mean that those tasks will run after thread_exit() was called. The ScriptServer (and its languages) is still up at this point, it's possible the remaining tasks execute code from another language during that short window.

@clayjohn
Copy link
Member

Testing locally I can confirm that this fixes #96931

@RandomShaper
Copy link
Member Author

@CedNaru, very good assessment. To address the issue about outstanding scripting tasks, I've added a pre-langs-exit runlevel that blocks the addition of new tasks and waits until the queues are empty.

I'm not quite happy with the added complexity even before the last push. For all the reviewers: the new push performs some additional refactoring so the new logic is more self-contained and can be more easily replaced or further unified in the future. I have considered many approaches and this is the only one that would work.

@RandomShaper RandomShaper force-pushed the revamp_languages_exit branch 6 times, most recently from 16d431b to f96ab59 Compare September 16, 2024 12:02
@CedNaru
Copy link
Contributor

CedNaru commented Sep 16, 2024

Indeed, that quite a bit of complexity added to handle the ScriptServer within the pool. If I may ask, wouldn't that be possible to invert the logic and make that the ScriptServer, the one with multiple "run levels" instead of the WorkerThreadPool ?

The core of the issue is that threads are started before the ScriptServer, and terminated after the ScriptServer. There are probably a lot of reasons why the ScriptServer has to be initialized last and ended first (I can already think of a few myself), Wouldn't a two-stage (before and after servers) initialization and termination of the ScriptServer do the trick ?

Sadly, I don't see it being possible without breaking changes to Language (and would probably force languages to register at the core phase at least) but asking anyway.

- The main thread function and the collaborative wait functions have a much more similar structure than earlier, which yields (pun intended) better maintainability.
- Also, there are not assertions anymore about the reason for ending a wait being valid, because spurious awakes can happen and so the assert would fail without that indicating an issue.
@RandomShaper
Copy link
Member Author

Sounds intriguing, but I'm not sure it'd be worth the hassle at this point. I've simplified a bit the code so the end result is not that daunting and, provided it does the trick and maintainability is good, I'd deem this one worth mergeable. Let's see if CI succeeds now (there was a missing initializer causing logic errors).

@adamscott
Copy link
Member

adamscott commented Sep 16, 2024

Tested out a single-thread export with this PR and it fixes the issues that currently occur on master.

@RandomShaper You could add this PR as fixing #97073.

Copy link
Member

@clayjohn clayjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally and confirmed that it is working for the MT servers

@clayjohn
Copy link
Member

I'm going to go ahead and merge this as we accidentally merged #95915 first which will introduce the deadlock to many more projects.

@clayjohn clayjohn merged commit 48403b5 into godotengine:master Sep 16, 2024
20 checks passed
@clayjohn
Copy link
Member

Thank you for the quick turnaround!

@RandomShaper RandomShaper deleted the revamp_languages_exit branch September 17, 2024 07:07
@RandomShaper
Copy link
Member Author

Added a note about cherry-picking for 4.3 to the main description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug cherrypick:4.3 Considered for cherry-picking into a future 4.3.x release topic:core
Projects
None yet
4 participants