Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"memory access out of bounds"/"operation does not support unaligned accesses" when using mutexes/condition_variables #12971

Closed
emaxx-google opened this issue Dec 5, 2020 · 18 comments

Comments

@emaxx-google
Copy link

In a program that uses std::thread, std::mutex and std::condition_variable, I'm frequently observing crashes with different messages but in a roughly the same piece of code. The exceptions observed so far are:

(1):

module.worker.js:24 RuntimeError: operation does not support unaligned accesses
    at __pthread_mutex_lock (module.wasm:wasm-function[13025]:0x1ee6ca)
    at std::__2::__libcpp_mutex_lock(pthread_mutex_t*) (module.wasm:wasm-function[12706]:0x1e72bb)
    at std::__2::mutex::lock() (module.wasm:wasm-function[12714]:0x1e7435)
    at std::__2::unique_lock<std::__2::mutex>::unique_lock(std::__2::mutex&) (module.wasm:wasm-function[476]:0xe414)
    at PushToReadBuffer(unsigned char const*, long long, bool*) (module.wasm:wasm-function[4361]:0x9cdb2)

(2):

RuntimeError: memory access out of bounds
    at __private_cond_signal (module.wasm:wasm-function[13038]:0x1ef24b)
    at pthread_cond_broadcast (module.wasm:wasm-function[13027]:0x1ee708)
    at std::__2::__libcpp_condvar_broadcast(pthread_cond_t*) (module.wasm:wasm-function[11543]:0x1d22ad)
    at std::__2::condition_variable::notify_all() (module.wasm:wasm-function[11542]:0x1d22a4)
    at PushToReadBuffer(unsigned char const*, long long, bool*) (module.wasm:wasm-function[4361]:0x9d047)

(3):

RuntimeError: operation does not support unaligned accesses
    at __private_cond_signal (module.wasm:wasm-function[13038]:0x1ef24b)
    at pthread_cond_broadcast (module.wasm:wasm-function[13027]:0x1ee708)
    at std::__2::__libcpp_condvar_broadcast(pthread_cond_t*) (module.wasm:wasm-function[11543]:0x1d22ad)
    at std::__2::condition_variable::notify_all() (module.wasm:wasm-function[11542]:0x1d22a4)
    at PushToReadBuffer(unsigned char const*, long long, bool*) (module.wasm:wasm-function[4361]:0x9d047)

I'm using Emscripten 2.0.6. I've tried various debugging flags (ASSERTIONS=2, SAFE_HEAP=1, PTHREADS_DEBUG=1, RUNTIME_LOGGING=1, O0, g4), but without luck.

Unfortunately, I don't have a minimal repro yet.

@sbc100
Copy link
Collaborator

sbc100 commented Dec 5, 2020

Is this a regression? (i.e. does the code in question work with older versions of emscripten error-free?)

@emaxx-google
Copy link
Author

emaxx-google commented Dec 5, 2020

I don't know. Basically, I'm migrating codebase that used to work under Native Client; this code was never ran under Emscripten before.
I was trying to write synthetic minified examples to repro this, but I'm hitting only #12970 most of the time.

@makslevental
Copy link

@emaxx-google did you ever resolve this issue? I'm having basically the same problem. One thing I don't understand is that the docs explicitly say that webasm can handle unaligned access

In WebAssembly, unaligned loads and stores will work.

and I'm pretty sure I am not in fact compiling to asm.js

@emaxx-google
Copy link
Author

emaxx-google commented Jun 22, 2021

Nope, I'm still struggling with this and other random crashes (some of them are more understood, like #13194, others are harder to even localize), and trying to debug it causes even more issues (debugger is useless due to crbug.com/1179045, ASan due to #13205 and https://groups.google.com/g/emscripten-discuss/c/QERnFWUT0_A/m/jMbw2VQ1BwAJ, etc.). So we're pretty much stuck with this, despite that NaCl deprecation requires us to find the solution ~soon. I've kept writing small test programs of increasing compelxity, hoping to catch the bug in a small repro some day...

@kripken
Copy link
Member

kripken commented Jun 22, 2021

Wasm supports unaligned accesses in almost all operations, except for but SIMD and some atomics.

To debug that, one option is to print out the address in the crashing function right before the crash. But almost certainly it is a proper crash, and indeed unaligned. Then tracing back to find where it was allocated in the source would be the next step.

@emaxx-google
Copy link
Author

emaxx-google commented Jul 7, 2021

@kripken

To debug that, one option is to print out the address in the crashing function right before the crash. But almost certainly it is a proper crash, and indeed unaligned. Then tracing back to find where it was allocated in the source would be the next step.

Thanks, but this approach is problematic due to #13194. But, yeah, I tried this a few months ago, and in "lucky" attempts in which the printf output didn't get garbled I was able to get some logs - however the root cause is still unclear (the original source code seems innocent, unless of course I'm still overlooking the bug in it).

@kripken
Copy link
Member

kripken commented Jul 8, 2021

If #13194 is blocking you, then until that is fixed I think you can work around that issue by calling a custom function that wraps around fprintf, and that function always locks as it runs. Conceptually

int fprintf_really_locked(FILE, fmt, ...) {
  static mutex_t mutex;
  mutex.lock();
  int ret = fprintf(FILE, fmt, ...);
  mutex.unlock();
  return ret;
}

If you have very many calls to fprintf and it's hard to replace them in the source, you can automate that by actually replacing fprintf itself. If fprintf is defined in the linker inputs, then fprintf.o will not be linked in, and your version will be used. You can clone the fprintf.c file in emscripten's system/ folder, add the mutex there, and include that as a linker input.

@emaxx-google
Copy link
Author

emaxx-google commented Jul 8, 2021

Thank you for the suggestion, @kripken. Unfortunately, I already tried that in the past, and it results in hanging DevTools, so that I can't see any logs at all. (I just quickly tested it again, using newer Emscripten 2.0.24 and Chrome 91, and the hanging issue is still there.)
I gave up trying to make sense of it (could it be a side effect of the same crash, e.g., with the memory corruption causing the main thread to crash before releasing the mutex?..), and returned back to living with #13194, which is unreliable but at least doesn't hang. Still, the original root cause is mysterious - and the lack of good debugging tools makes me continue struggling with this since last year...

@kripken
Copy link
Member

kripken commented Jul 8, 2021

Sorry to hear that @emaxx-google . Has a devtools issue been filed? (seems odd that devtools would hang when the app works otherwise) cc @pfaffe

Separately, is devtools necessary to debug this? If you can run your app in node, that might be easier anyhow.

@emaxx-google
Copy link
Author

emaxx-google commented Jul 8, 2021

I'll try to make a small repro for the devtools issue (IIRC, simple test programs I tried didn't hang, which means the issue is probably related to the memory corruption). Thanks for reminding me of that.

The option with running in node was pretty challenging (since my application is using Extensions APIs that need to be stubbed out under Node, and since I'm using the Closure Library which apparently doesn't play with Node nicely), and unfortunately after making all needed workarounds I got the exact same random crashes as I was observing in DevTools. But if you have some tips regarding extracting additional debugging information from Node, them I'm happy to repeat this exercise.

@emaxx-google
Copy link
Author

emaxx-google commented Jul 9, 2021

Hrm, after commenting out pthread_attr_setstacksize() calls in my code the crashes seem to have disappeared... That's pretty mysterious, given that this call was requesting more stack (262144 instead of the default 81920).

Can there be a bug in the pthread_attr_setstacksize() implementation? Where can one find this function's source code - is it a verbatim copy of Musl's code, or something else?

@emaxx-google
Copy link
Author

Another thing that looks suspicious - pthread_attr_getstacksize() returns 81920 in my program. Does that look right? Based on https://groups.google.com/g/emscripten-discuss/c/MgHWuq2oq7Q and #14177, I presume the default stack should be 5MB, so much bigger than the value I'm getting.

Can it be that pthread_attr_getstacksize/pthread_attr_setstacksize are returning some fake results, e.g., from the library_pthread_stub.c file? (CC @pfaffe who was wondering about the similar.)

@pfaffe
Copy link
Collaborator

pfaffe commented Jul 9, 2021

That seems to be expected. 81920 is the default stack size in musl.

As far as I can see, the DEFAULT_PTHREAD_STACK_SIZE is only applied when you pass a NULL pthread_attr_t* argument, and only on pthread_create.

@emaxx-google
Copy link
Author

I see, thanks. Then I'm out of ideas - how increasing the stack size can lead to memory corruption. I'd appreciate any hints from the Emscripten team for how to proceed with debugging this issue.

A side note: DevTools still "hang" when I'm wrapping each printf/std::cerr usage with a mutex, even now that I don't experience crashes. (But this doesn't seem to be easy to repro in a small test program, so still trying to understand what's the essential ingredient for this bug.)

@kripken
Copy link
Member

kripken commented Jul 9, 2021

If increasing the stack size leads to memory corruption, it might be that the corruption exists before but the stack size adjustment happens to lead to it actually being noticeable - bad luck, basically. So ASan is likely our best bet. I spent some time on #13205 now and opened #14611 which I believe fixes it.

@emaxx-google
Copy link
Author

emaxx-google commented Jul 9, 2021

@kripken: Thanks! Looking forward to the fix to #13205. I'm hoping it'll resolve the "AddressSanitizer: nested bug in the same thread, aborting" error that I'm experiencing right now, which renders ASan useless. (Unfortunately, when I tried to cherry-pick your fix, that error was still present, but I'm still hoping that I just did the cherry-picking wrong.)

P.S. @pfaffe also helped me today to debug my application, significantly improving the debugging ergonomics (viewable source, locals, even in the stdlib, global memory inspection, etc). Sadly, we didn't manage to find the culprit, so either fixing ASan or using some kind of experimental breakpoint-on-write would probably be needed...

@kleisauke
Copy link
Collaborator

You can clone the fprintf.c file in emscripten's system/ folder, add the mutex there, and include that as a linker input.

Note that one can also use the --wrap= linker option, which allows users to easily insert wrapper functions around existing functions. For example:

Details

printf-thread-safe.c:

#include <pthread.h>
#include <stdarg.h>
#include <stdio.h>

static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

int __wrap_printf(const char *format, ...) {
    int rc;

    va_list args;
    va_start(args, format);
    pthread_mutex_lock(&mutex);
    rc = vprintf(format, args);
    pthread_mutex_unlock(&mutex);
    va_end(args);
    return rc;
}

int __wrap_fprintf(FILE *stream, const char *format, ...) {
    int rc;

    va_list args;
    va_start(args, format);
    pthread_mutex_lock(&mutex);
    rc = vfprintf(stream, format, args);
    pthread_mutex_unlock(&mutex);
    va_end(args);
    return rc;
}

testcase.c:

#include <assert.h>
#include <pthread.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

pthread_t thread[2];

char *char_repeat(int n, char c) {
    char *dest = malloc(n + 1);
    memset(dest, c, n);
    dest[n] = '\0';
    return dest;
}

void *thread_main(void *arg) {
    char *msg = char_repeat(100, 'a');
    for (int i = 0; i < 10; ++i)
        printf("%s\n", msg);
    free(msg);
    return 0;
}

int main() {
    int rc = pthread_create(&thread[0], NULL, thread_main, NULL);
    assert(rc == 0);
    rc = pthread_create(&thread[1], NULL, thread_main, NULL);
    assert(rc == 0);
    void *thread_rtn = 0;
    rc = pthread_join(thread[0], &thread_rtn);
    assert(rc == 0);
    assert(thread_rtn == 0);
    rc = pthread_join(thread[1], &thread_rtn);
    assert(rc == 0);
    assert(thread_rtn == 0);

    return 0;
}
$ emcc testcase.c printf-thread-safe.c -o test.js -pthread -sEXIT_RUNTIME -sPTHREAD_POOL_SIZE=2 -Wl,--wrap=printf -Wl,--wrap=fprintf
$ node --experimental-wasm-threads --experimental-wasm-bulk-memory test.js
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

@emaxx-google
Copy link
Author

emaxx-google commented Jul 15, 2021

Thanks everyone for the ideas and tips!

It seems that the problem was, apparently, the too small stack size on one of my application's threads. I'm still trying to understand how much stack exactly do I need and why it's bigger than when running as native code or under NaCl, and why leaving the default (when not calling pthread_attr_setstacksize) was hiding the problem. Another thing that confused me was that ASan doesn't properly detect stack overflows under Emscripten - this is now tracked in #14628.

Given that the original issue reported here in #12971 is resolved (and appears to have nothing to do with mutexes/condition_variables specifically), I'm closing the bug. If anyone else is affected by similar issues they should probably file separate reports to avoid the confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants