Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wasm workers sbrk race #18171

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

juj
Copy link
Collaborator

@juj juj commented Nov 9, 2022

This PR is an issue report in the form of a PR.

Contained is a test case that uncovers some kind of race condition in Wasm Workers + sbrk() memory growth usage.

The test case is the same crash as the test case from #18096 (comment) (thanks @debevv for providing this!), but minified to bare bones elements.

The test case initially crashes to an assert in #18170 , but fixing that assert does not fully fix the test, hence posting two different PRs, since #18170 is not enough to fix this test case.

The crash aborts on segfault in WASM_HEAP_STORE_i32_4_4() where the memory store address is above the sbrk end address. This is really peculiar since there is only one Worker that is performing any memory/malloc/sbrk operations in the whole test. Tried to use Chrome's DWARF debugger to figure out what is going wrong in the test case, but unfortunately it is not able to show the interesting information - have to dig deeper with manual prints.

What is most peculiar about the test case, is that it has a strong amount of nondeterminism in it, even though there is really no apparent source of what would cause it. The whole test case is sequential, as there is only one Worker doing linear work, and the main thread is asleep. Still, the Worker crashes always after a seemingly random number of calls to sbrk(), every time on a different count.

Sometimes it crashes immediately, other times it crashes e.g. after ~10 seconds of malloc()ing.

The crash occurs independently of using emmalloc or dlmalloc as the allocator, so probably easier to utilize emmalloc as it is smaller than dlmalloc. But the crash is not about the allocator, but something about sbrk/wasm heap size interaction.

The crash is not an OOM, the segfault occurs well before reaching the 2GB heap growth limit. Sometimes the test does pass without reaching the limit at all.

Printing stuff to console does not disrupt the crash frequency, i.e. it does not seem to be timing sensitive. Hence one can compile with -sMALLOC=emmalloc-verbose or -sMALLOC=emmalloc to observe the issue.

Likewise the crash occurs with -O1 ... -O3. But haven't observed it happening with -O0.

And finally, here is the kicker: the crash never happens if one comments out the line emscripten_set_main_loop([](){}, 0, 1); on line 27 of the test case, which is odd. The crash should not be about EXIT_RUNTIME at least, since removing emscripten_set_main_loop() but building with -sEXIT_RUNTIME=0 and calling emscripten_exit_with_live_runtime(); does not observe the crash - so the issue should not be that the main thread would be quitting the runtime on the Worker.

image

PR #18170 would be good to land already now. I'll splice that commit off of this PR afterwards, and land this PR when the test case here is figured out to pass.

@mannk123
Copy link
Contributor

mannk123 commented Nov 10, 2022

Think I found the problem - there's a race condition caused by multiple threads calling sbrk at the same time.

Ironically, the problem here is caused by the SAFE_HEAP option itself!

The JS-side SAFE_HEAP_LOAD/STORE functions are calling sbrk(0).

function SAFE_HEAP_STORE(dest, value, bytes, isFloat) {
...
    var brk = _sbrk() >>> 0;
...
}

To be more explicit, the _sbrk() call above sets

*sbrk_ptr = *sbrk_ptr + 0

That races with the sbrk_ptr adjustment performed by malloc.

So, the race here is

  1. worker threads calls malloc, which adjusts sbrk_ptr
  2. main thread does SAFE_HEAP_LOAD/STORE, which reads and sets the same value back to sbrk_ptr, in a non-atomic manner

The above also helps explain why commenting out the emscripten_set_main_loop line removes the crash.

The root cause here is that sbrk() is not thread safe. In fact, it already provides the option for thread safety, but it's only enabled when __EMSCRIPTEN_PTHREADS__ is defined - in that case it tries to adjust sbrk_ptr using an atomic compare and set, and retries the adjustment when it detects a race.

The simple fix is to enable that same thread safety protection when __EMSCRIPTEN_WASM_WORKERS__ is defined. I have submitted a separate PR #18174 , with the suggested fix (not sure if that's the best way to submit such a patch).

@juj
Copy link
Collaborator Author

juj commented Nov 11, 2022

Awesome find!

@kleisauke
Copy link
Collaborator

There are a couple more __EMSCRIPTEN_PTHREADS__ guards within the system libraries after the above PR lands. For example:

int __lockfile(FILE *f)
{
#if defined(__EMSCRIPTEN_PTHREADS__)

void __unlockfile(FILE *f)
{
#if defined(__EMSCRIPTEN_PTHREADS__)

I wonder if we should use __EMSCRIPTEN_SHARED_MEMORY__ everywhere if it's not related to the pthread API?

Also, some of these are a bit trickier to resolve, see for example issue #13194. That one specifically requires setting libc.threaded / libc.need_locks and the file locks for stdin / stdout / stderr in musl when the first worker is started. Though, I'm aware that this can also be done during compile-time, see for e.g. commit 27c2a3a (however, that may lead to redundant file locking if no workers are running).

@juj
Copy link
Collaborator Author

juj commented Nov 14, 2022

There are a couple more EMSCRIPTEN_PTHREADS guards within the system libraries after the above PR lands.
I wonder if we should use EMSCRIPTEN_SHARED_MEMORY everywhere if it's not related to the pthread API?

For example the code in __lockfile.c does depend on pthreads:

int owner = f->lock, tid = __pthread_self()->tid;

So I think that code should not be switched to use EMSCRIPTEN_SHARED_MEMORY instead of EMSCRIPTEN_PTHREADS.

I don't particularly well know what the original use case for the EMSCRIPTEN_SHARED_MEMORY feature (sans EMSCRIPTEN_PTHREADS or EMSCRIPTEN_WASM_WORKERS ) is, so dug back the original motivation comment/description I recall at:

#12833 (comment)

where it is stated

"We had a user that asked for emscripten to emit atomics-supporting code (memory is shared, atomic operations are used when declared as such) but did not want to use the pthread API."

So it seems that in that mode, we would want to enable thread-safe operation of all the locations in EMSCRIPTEN_SHARED_MEMORY builds, but just without depending on any Pthreads machinery. I think those locations may need a non-pthreads lock implementation.

kleisauke added a commit to kleisauke/emscripten that referenced this pull request Nov 14, 2022
@kleisauke
Copy link
Collaborator

Ah, you're right. We should probably use emscripten_wasm_worker_self_id() there.

I just opened a separate PR for that, see #18201.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants