Wasm workers sbrk race #18171

juj · 2022-11-09T19:02:58Z

This PR is an issue report in the form of a PR.

Contained is a test case that uncovers some kind of race condition in Wasm Workers + sbrk() memory growth usage.

The test case is the same crash as the test case from #18096 (comment) (thanks @debevv for providing this!), but minified to bare bones elements.

The test case initially crashes to an assert in #18170 , but fixing that assert does not fully fix the test, hence posting two different PRs, since #18170 is not enough to fix this test case.

The crash aborts on segfault in WASM_HEAP_STORE_i32_4_4() where the memory store address is above the sbrk end address. This is really peculiar since there is only one Worker that is performing any memory/malloc/sbrk operations in the whole test. Tried to use Chrome's DWARF debugger to figure out what is going wrong in the test case, but unfortunately it is not able to show the interesting information - have to dig deeper with manual prints.

What is most peculiar about the test case, is that it has a strong amount of nondeterminism in it, even though there is really no apparent source of what would cause it. The whole test case is sequential, as there is only one Worker doing linear work, and the main thread is asleep. Still, the Worker crashes always after a seemingly random number of calls to sbrk(), every time on a different count.

Sometimes it crashes immediately, other times it crashes e.g. after ~10 seconds of malloc()ing.

The crash occurs independently of using emmalloc or dlmalloc as the allocator, so probably easier to utilize emmalloc as it is smaller than dlmalloc. But the crash is not about the allocator, but something about sbrk/wasm heap size interaction.

The crash is not an OOM, the segfault occurs well before reaching the 2GB heap growth limit. Sometimes the test does pass without reaching the limit at all.

Printing stuff to console does not disrupt the crash frequency, i.e. it does not seem to be timing sensitive. Hence one can compile with -sMALLOC=emmalloc-verbose or -sMALLOC=emmalloc to observe the issue.

Likewise the crash occurs with -O1 ... -O3. But haven't observed it happening with -O0.

And finally, here is the kicker: the crash never happens if one comments out the line emscripten_set_main_loop([](){}, 0, 1); on line 27 of the test case, which is odd. The crash should not be about EXIT_RUNTIME at least, since removing emscripten_set_main_loop() but building with -sEXIT_RUNTIME=0 and calling emscripten_exit_with_live_runtime(); does not observe the crash - so the issue should not be that the main thread would be quitting the runtime on the Worker.

~~PR #18170 would be good to land already now. I'll splice that commit off of this PR afterwards, and land this PR when the test case here is figured out to pass.~~

mannk123 · 2022-11-10T11:00:10Z

Think I found the problem - there's a race condition caused by multiple threads calling sbrk at the same time.

Ironically, the problem here is caused by the SAFE_HEAP option itself!

The JS-side SAFE_HEAP_LOAD/STORE functions are calling sbrk(0).

function SAFE_HEAP_STORE(dest, value, bytes, isFloat) {
...
    var brk = _sbrk() >>> 0;
...
}

To be more explicit, the _sbrk() call above sets

*sbrk_ptr = *sbrk_ptr + 0

That races with the sbrk_ptr adjustment performed by malloc.

So, the race here is

worker threads calls malloc, which adjusts sbrk_ptr
main thread does SAFE_HEAP_LOAD/STORE, which reads and sets the same value back to sbrk_ptr, in a non-atomic manner

The above also helps explain why commenting out the emscripten_set_main_loop line removes the crash.

The root cause here is that sbrk() is not thread safe. In fact, it already provides the option for thread safety, but it's only enabled when __EMSCRIPTEN_PTHREADS__ is defined - in that case it tries to adjust sbrk_ptr using an atomic compare and set, and retries the adjustment when it detects a race.

The simple fix is to enable that same thread safety protection when __EMSCRIPTEN_WASM_WORKERS__ is defined. I have submitted a separate PR #18174 , with the suggested fix (not sure if that's the best way to submit such a patch).

juj · 2022-11-11T15:56:49Z

Awesome find!

kleisauke · 2022-11-13T18:35:47Z

There are a couple more __EMSCRIPTEN_PTHREADS__ guards within the system libraries after the above PR lands. For example:

emscripten/system/lib/libc/musl/src/stdio/__lockfile.c

Lines 4 to 6 in e9ba359

    
           int __lockfile(FILE *f) 
        
           { 
        
           #if defined(__EMSCRIPTEN_PTHREADS__)

emscripten/system/lib/libc/musl/src/stdio/__lockfile.c

Lines 21 to 23 in e9ba359

    
           void __unlockfile(FILE *f) 
        
           { 
        
           #if defined(__EMSCRIPTEN_PTHREADS__)

I wonder if we should use __EMSCRIPTEN_SHARED_MEMORY__ everywhere if it's not related to the pthread API?

Also, some of these are a bit trickier to resolve, see for example issue #13194. That one specifically requires setting libc.threaded / libc.need_locks and the file locks for stdin / stdout / stderr in musl when the first worker is started. Though, I'm aware that this can also be done during compile-time, see for e.g. commit 27c2a3a (however, that may lead to redundant file locking if no workers are running).

juj · 2022-11-14T11:14:11Z

There are a couple more EMSCRIPTEN_PTHREADS guards within the system libraries after the above PR lands.
I wonder if we should use EMSCRIPTEN_SHARED_MEMORY everywhere if it's not related to the pthread API?

For example the code in __lockfile.c does depend on pthreads:

emscripten/system/lib/libc/musl/src/stdio/__lockfile.c

Line 7 in e9ba359

int owner = f->lock, tid = __pthread_self()->tid;

So I think that code should not be switched to use EMSCRIPTEN_SHARED_MEMORY instead of EMSCRIPTEN_PTHREADS.

I don't particularly well know what the original use case for the EMSCRIPTEN_SHARED_MEMORY feature (sans EMSCRIPTEN_PTHREADS or EMSCRIPTEN_WASM_WORKERS ) is, so dug back the original motivation comment/description I recall at:

#12833 (comment)

where it is stated

"We had a user that asked for emscripten to emit atomics-supporting code (memory is shared, atomic operations are used when declared as such) but did not want to use the pthread API."

So it seems that in that mode, we would want to enable thread-safe operation of all the locations in EMSCRIPTEN_SHARED_MEMORY builds, but just without depending on any Pthreads machinery. I think those locations may need a non-pthreads lock implementation.

Also, fix Node.js compatibility while we are here. See: emscripten-core#18171 (comment)

kleisauke · 2022-11-14T13:44:10Z

Ah, you're right. We should probably use emscripten_wasm_worker_self_id() there.

I just opened a separate PR for that, see #18201.

juj mentioned this pull request Nov 9, 2022

Why does this simple program trigger an alignment fault? #18096

Closed

Add test case for sbrk race Wasm Workers crash

62d3a97

juj force-pushed the wasm_workers_sbrk_race branch from 3391972 to 62d3a97 Compare November 9, 2022 21:30

mannk123 mentioned this pull request Nov 10, 2022

enable thread safety for sbrk and emmalloc, when using WASM_WORKERS #18174

Merged

kleisauke added a commit to kleisauke/emscripten that referenced this pull request Nov 14, 2022

Reproduce issue emscripten-core#13194 on the Wasm Workers API

ee08946

Also, fix Node.js compatibility while we are here. See: emscripten-core#18171 (comment)

kleisauke mentioned this pull request Nov 14, 2022

Reproduce issue #13194 on the Wasm Workers API #18201

Draft

kleisauke mentioned this pull request Dec 12, 2022

Writing to std::cout disables wasm workers stdout #18081

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wasm workers sbrk race #18171

Wasm workers sbrk race #18171

juj commented Nov 9, 2022 •

edited

Loading

mannk123 commented Nov 10, 2022 •

edited

Loading

juj commented Nov 11, 2022

kleisauke commented Nov 13, 2022

juj commented Nov 14, 2022

kleisauke commented Nov 14, 2022

Wasm workers sbrk race #18171

Are you sure you want to change the base?

Wasm workers sbrk race #18171

Conversation

juj commented Nov 9, 2022 • edited Loading

mannk123 commented Nov 10, 2022 • edited Loading

juj commented Nov 11, 2022

kleisauke commented Nov 13, 2022

juj commented Nov 14, 2022

kleisauke commented Nov 14, 2022

juj commented Nov 9, 2022 •

edited

Loading

mannk123 commented Nov 10, 2022 •

edited

Loading