-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wasm workers sbrk race #18171
base: main
Are you sure you want to change the base?
Wasm workers sbrk race #18171
Conversation
3391972
to
62d3a97
Compare
Think I found the problem - there's a race condition caused by multiple threads calling sbrk at the same time. Ironically, the problem here is caused by the SAFE_HEAP option itself! The JS-side SAFE_HEAP_LOAD/STORE functions are calling sbrk(0).
To be more explicit, the _sbrk() call above sets
That races with the sbrk_ptr adjustment performed by malloc. So, the race here is
The above also helps explain why commenting out the emscripten_set_main_loop line removes the crash. The root cause here is that sbrk() is not thread safe. In fact, it already provides the option for thread safety, but it's only enabled when __EMSCRIPTEN_PTHREADS__ is defined - in that case it tries to adjust sbrk_ptr using an atomic compare and set, and retries the adjustment when it detects a race. The simple fix is to enable that same thread safety protection when __EMSCRIPTEN_WASM_WORKERS__ is defined. I have submitted a separate PR #18174 , with the suggested fix (not sure if that's the best way to submit such a patch). |
Awesome find! |
There are a couple more emscripten/system/lib/libc/musl/src/stdio/__lockfile.c Lines 4 to 6 in e9ba359
emscripten/system/lib/libc/musl/src/stdio/__lockfile.c Lines 21 to 23 in e9ba359
I wonder if we should use Also, some of these are a bit trickier to resolve, see for example issue #13194. That one specifically requires setting |
For example the code in __lockfile.c does depend on pthreads:
So I think that code should not be switched to use EMSCRIPTEN_SHARED_MEMORY instead of EMSCRIPTEN_PTHREADS. I don't particularly well know what the original use case for the EMSCRIPTEN_SHARED_MEMORY feature (sans EMSCRIPTEN_PTHREADS or EMSCRIPTEN_WASM_WORKERS ) is, so dug back the original motivation comment/description I recall at: where it is stated "We had a user that asked for emscripten to emit atomics-supporting code (memory is shared, atomic operations are used when declared as such) but did not want to use the pthread API." So it seems that in that mode, we would want to enable thread-safe operation of all the locations in EMSCRIPTEN_SHARED_MEMORY builds, but just without depending on any Pthreads machinery. I think those locations may need a non-pthreads lock implementation. |
Also, fix Node.js compatibility while we are here. See: emscripten-core#18171 (comment)
Ah, you're right. We should probably use I just opened a separate PR for that, see #18201. |
This PR is an issue report in the form of a PR.
Contained is a test case that uncovers some kind of race condition in Wasm Workers + sbrk() memory growth usage.
The test case is the same crash as the test case from #18096 (comment) (thanks @debevv for providing this!), but minified to bare bones elements.
The test case initially crashes to an assert in #18170 , but fixing that assert does not fully fix the test, hence posting two different PRs, since #18170 is not enough to fix this test case.
The crash aborts on segfault in WASM_HEAP_STORE_i32_4_4() where the memory store address is above the sbrk end address. This is really peculiar since there is only one Worker that is performing any memory/malloc/sbrk operations in the whole test. Tried to use Chrome's DWARF debugger to figure out what is going wrong in the test case, but unfortunately it is not able to show the interesting information - have to dig deeper with manual prints.
What is most peculiar about the test case, is that it has a strong amount of nondeterminism in it, even though there is really no apparent source of what would cause it. The whole test case is sequential, as there is only one Worker doing linear work, and the main thread is asleep. Still, the Worker crashes always after a seemingly random number of calls to sbrk(), every time on a different count.
Sometimes it crashes immediately, other times it crashes e.g. after ~10 seconds of malloc()ing.
The crash occurs independently of using emmalloc or dlmalloc as the allocator, so probably easier to utilize emmalloc as it is smaller than dlmalloc. But the crash is not about the allocator, but something about sbrk/wasm heap size interaction.
The crash is not an OOM, the segfault occurs well before reaching the 2GB heap growth limit. Sometimes the test does pass without reaching the limit at all.
Printing stuff to console does not disrupt the crash frequency, i.e. it does not seem to be timing sensitive. Hence one can compile with
-sMALLOC=emmalloc-verbose
or-sMALLOC=emmalloc
to observe the issue.Likewise the crash occurs with -O1 ... -O3. But haven't observed it happening with -O0.
And finally, here is the kicker: the crash never happens if one comments out the line
emscripten_set_main_loop([](){}, 0, 1);
on line 27 of the test case, which is odd. The crash should not be about EXIT_RUNTIME at least, since removingemscripten_set_main_loop()
but building with-sEXIT_RUNTIME=0
and callingemscripten_exit_with_live_runtime();
does not observe the crash - so the issue should not be that the main thread would be quitting the runtime on the Worker.PR #18170 would be good to land already now. I'll splice that commit off of this PR afterwards, and land this PR when the test case here is figured out to pass.