Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing thread exit and thread cancellation #10524

Merged
merged 1 commit into from
Feb 16, 2021

Conversation

abujalski
Copy link
Contributor

@abujalski abujalski commented Feb 21, 2020

This PR aim at fixing some pthread implementation defects (in JavaScript glue layer) discovered when running POSIX tests from http://posixtest.sourceforge.net/ compiled to WebAssembly. It is done by fixes thread cancellation/exit methods and also unifying way how thread structures are updated when thread is canceled or exits.

Following problems are addressed:

  1. heap-use-after-free on std::thread heap-use-after-free on std::thread #12985
    Function cleanupThread() from src/library_pthread.js
    unconditionally set pthread.self member of pthread structure to 0.
    Problem was that cleanupThread() function might be called after
    that pthread structure has been deleted. To fix this problem, setting
    pthread.self field is now guarded by check if thread data hasn't
    been already freed.
  2. pthread_cond_wait/2-3 test hang
    • Disabling recursive thread cancellation.
    • Allowing __timedwait_cp to be true cancellation point as _cp
      suffix suggests
  3. pthread_getschedparam/1-3 test hangs sometimes:
    In pthread_barrier_wait adding check if lock is held by main thread
    and waiting on futex in small slices of time there, to check if
    there is some work to do on behalf of Worker Threads.

Copy link
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks for working on this.

.gitignore Outdated Show resolved Hide resolved
tests/runner.py Outdated Show resolved Hide resolved
__pthread_setcancelstate(PTHREAD_CANCEL_MASKED, &cs);
if (cs == PTHREAD_CANCEL_DISABLE) __pthread_setcancelstate(cs, 0);
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its sad to see emscripten differing from upstream here ... can you explain and maybe add comments as to why the default implementation doesn't work for us?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like musl bug. According to POSIX spec https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html "2.9.5 Thread Cancellation" pthread_cond_timedwait and pthread_cond_wait funcitons are cancellation points.

Test pthread_cond_wait/2-3 from Open POSIX suite checks such behavior and cancels thread waiting on conditional variable. With masked cancellation this test hangs.

Copy link
Collaborator

@juj juj Feb 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if there is a musl bug, did you mark this as a #ifndef __EMSCRIPTEN__ with the intent to "let the native side retain its buggy behavior to not break anything there" and fix the bug for Emscripten? Or was the divergence for some other reason?

This code does implement a __pthread_testcancel() call in the beginning of this file. It is not wrong for the implementation to mask cancellation afterwards (to my interpretation of the spec). What kind of behavior are you seeing in the test suite with regards to this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I wanted to mark Emscripten related changes/fixes with this macro.

Problem is that test pthread_cond_wait/2-3 hangs, because there is scenario when thread waiting on condition variable is being cancelled. So if we set PTHREAD_CANCEL_MASKED for such thread it won't be cancelled.

I'll check behaviour of native musl with this test (pthread_cond_wait/2-3), maybe there is some other cancelation mechanism (like signals)...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked musl-1.2.0 behaviour and yes there is a bug there.

Original idea for this patch was to use PTHREAD_CANCEL_MASKED state distinguish:

  • waits on functions which are non-cancellation points (e.g. pthread_mutex_lock) and can
    be canceled only when user wants asynchronous cancellation (i.e. sets thread
    cancelability type to PTHREAD_CANCEL_ASYNCHRONOUS). Such functions call __timedwait() to perform waiting.
  • waits on functions which are defined as cancellation points (pthread_cond_wait).
    Such functions call __timedwait_cp()

This test was done by setting cancellation state to PTHREAD_CANCEL_MASKED and using check:

    int can_cancel_wait = pthread_self()->canceldisable == PTHREAD_CANCEL_ENABLE ||
                         (pthread_self()->canceldisable == PTHREAD_CANCEL_MASKED &&
                          pthread_self()->cancelasync == PTHREAD_CANCEL_ASYNCHRONOUS);

in __timedwait_cp.

Original musl implementation which disables thread cancelability internally during call to waits (__timedwait), disables also user request to allow thread to be asynchronously canceled (Cancelability Type set PTHREAD_CANCEL_ASYNCHRONOUS). This causes that test pthread_setcanceltype/1-1 fails with native musl implementation.

As there are many functions (~30) in musl which in their implementation disables thread cancellation and additionally flag PTHREAD_CANCEL_MASKED was added to avoid some cancellation race in pthread_cond_wait in commit https://git.musl-libc.org/cgit/musl/commit/?id=8741ffe625363a553e8f509dc3ca7b071bdbab47, so in acd89cd I changed test for Emscripten to keep changes in musl minimal.

tests/test_posixtestsuite.py Outdated Show resolved Hide resolved
tests/test_posixtestsuite.py Outdated Show resolved Hide resolved
(function () {
Module['onExit'] = function(status) {
console.log('POSIX test finished with status:' + status);
reportResultToServer(status);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its really a shame this doesn't work like this just out of the box for all browser tests.. but I know its complicated.

# '-I', posix_test_dir('pthread_attr_destroy')])


@requires_threads
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These test are marked as requires_threads but the above ones are not?

I wonder if we could run the other ones outside of browser tests?

For that matter, given that we now support node threads I wonder if we could make all of these just regular tests that run without the browser?

tests/test_posixtestsuite.py Outdated Show resolved Hide resolved
src/library_pthread.js Outdated Show resolved Hide resolved
if (is_main_thread || pthread_self()->cancelasync == PTHREAD_CANCEL_ASYNCHRONOUS) {
int can_cancel_wait = pthread_self()->canceldisable == PTHREAD_CANCEL_ENABLE ||
(pthread_self()->canceldisable == PTHREAD_CANCEL_MASKED &&
pthread_self()->cancelasync == PTHREAD_CANCEL_ASYNCHRONOUS);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's store pthread_self() to its own variable, there is no guarantee that it would be inlined.

@@ -80,9 +83,13 @@ int __timedwait(volatile int *addr, int val,
clockid_t clk, const struct timespec *at, int priv)
{
int cs, r;
__pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, &cs);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, was this something that I had incorrectly added outside an #ifdef __EMSCRIPTEN__ block?

Copy link
Contributor Author

@abujalski abujalski Feb 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, now this call has been moved to #else block.

'-I', posix_test_dir('pthread_create')])


# Test throws exception: need to fix thread function si
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is soo much copy-paste in this file. :( The original test suite enumerated files in directories to run all tests in the suite, can we follow that approach to avoid hand-maintaining this file for each test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 4c4fd41 I've refactored this to generate test methods based on list of valid Test Cases.

My idea is to have all valid TCs (valid in browser environment) listed in this file. In this approach we avoid running TCs which won't work in with Emscripten (i.e. ones testing fork and signals). WDYT?

@juj
Copy link
Collaborator

juj commented Feb 24, 2020

Open POSIX test suite is licensed under GPLv2, https://github.com/juj/posixtestsuite/blob/master/COPYING , which to my interpretation is not compatible with Emscripten licensing.

While my interpretation is that using the test suite does not require user to license the project it is used for under GPLv2 (similar to the way that using Linux or its GPLv2 tools does not require one to license whatever those tools are used for under GPLv2), but when developing the test suite, any changes to the test suite must be released under GPLv2. Because if this requirement, it is a big gray area to marry Emscripten's test suite tightly together with the Open POSIX test suite; or have a GPLv2 submodule in a repository that is not licensed under GPLv2. This was the reason why the original work was a maintained in a separate repository - keep all changes applied to it public and released under GPLv2, while retaining a strict boundary to what is Emcripten and what is the test suite.

Hence we should probably not merge these test suites directly together; though this is great work here you are doing!

@abujalski
Copy link
Contributor Author

While my interpretation is that using the test suite does not require user to license the project it is used for under GPLv2 (similar to the way that using Linux or its GPLv2 tools does not require one to license whatever those tools are used for under GPLv2), but when developing the test suite, any changes to the test suite must be released under GPLv2.

@juj Yes, but in my understanding only changes to test suite (GPLv2) code itself. We could release any changes to Emscripten with current licensing model

This was the reason why the original work was a maintained in a separate repository - keep all changes applied to it public and released under GPLv2, while retaining a strict boundary to what is Emcripten and what is the test suite.

@juj With submodule POSIX test suite is sill a separate repository, only change is that now emscripten repo has named reference to it. We are not bundling Open POSIX test suite with emscritpen.

This is my understanding, unfortunately I'm not a lawyer, so as you mentioned it is grey area to mary these two repos.

My first idea was to use code similar to emsdk to checkout testsuite directly form scripts running tests (see dcefb9c)

@kripken
Copy link
Member

kripken commented Feb 25, 2020

How does the submodule work here - I think this is the first submodule we have. Do we need to get people (and our CI) to do submodule init etc. in order to run the tests? Or are these tests optional and not run by default?

About the license issue, how about putting it under the third_party/ subfolder, maybe third_party/tests/? Then it's more clearly not a core part of the project, just separate code we invoke, but are not designed around / entertwined (and that if in the future we find a license issue, we can just remove the subfolder in a clear way).

@kripken
Copy link
Member

kripken commented Feb 25, 2020

Oh wait, it's already in third_party, I misread the diff! So that part sounds good, sorry.

@sbc100
Copy link
Collaborator

sbc100 commented Feb 25, 2020

I think this tests will be optional, so only those who want to run them would run the submodule init.

@kripken
Copy link
Member

kripken commented Feb 25, 2020

Sounds fine to me.

@abujalski
Copy link
Contributor Author

I removed adding POSIX test suite as a submodule from this PR (in 5de2332) and added test to check my fixes to browser test suite (in 9f530b5).

I'll create separate PR to (re)add changes related to POSIX test suite.

@kleisauke
Copy link
Collaborator

Is there any interest in rebasing this PR now that #12923 is merged? I had an attempt to rebase this, see:
master...kleisauke:pr-10524-update

(It fixes a couple of test failures on the Open POSIX test suite, but it also seems to have introduced two regressions)

@abujalski
Copy link
Contributor Author

(It fixes a couple of test failures on the Open POSIX test suite, but it also seems to have introduced two regressions)

@kleisauke Can you tell me what are the regressions?

@kleisauke
Copy link
Collaborator

@abujalski Sorry for not clarifying that. It seems to cause test failures on these two test cases:
https://github.com/kleisauke/emscripten/blob/4818ef3ca759e0e3f162847665c9ba981e8b8769/tests/test_posixtest.py#L60-L62

I still need to double check whether this is actually caused by this PR.

@abujalski
Copy link
Contributor Author

I've rebased this branch.

However I need to test it and verify that it doesn't cause any regressions in tests.

@sbc100
Copy link
Collaborator

sbc100 commented Dec 3, 2020

Oh no, I'm really sorry but I completely forgot about this PR and I my recent change duplicates some it : #12923. Hopefully its just the testsuite integration stuff that was duplicated and all the substantive changes here are still good.

src/library_pthread.js Outdated Show resolved Hide resolved
src/library_pthread.js Outdated Show resolved Hide resolved
src/library_pthread.js Outdated Show resolved Hide resolved
@kleisauke
Copy link
Collaborator

@abujalski test_pthread_setcanceltype_1_1 is the only reproducible testcase that has caused a test failure on the pr-10524-update branch. Although, I'm not sure if this is a false positive since the previous return code was PTS_UNRESOLVED instead of the expected PTS_PASS.

Details
$ EMTEST_VERBOSE=1 python3 tests/runner.py posixtest.test_pthread_setcanceltype_1_1

Before:

-- begin program output --
Test PASSED
Error in pthread_mutex_lock()
: No error information
-- end program output --
ok

https://github.com/juj/posixtestsuite/blob/26372421f53aeeeeeb4b23561c417886f1930ef6/conformance/interfaces/pthread_setcanceltype/1-1.c#L62-L69

After:

-- begin program output --
Test FAILED: Cancel request timed out
-- end program output --
FAIL

https://github.com/juj/posixtestsuite/blob/26372421f53aeeeeeb4b23561c417886f1930ef6/conformance/interfaces/pthread_setcanceltype/1-1.c#L125-L131

And the test_pthread_cond_signal_1_1 seems to be flaky sometimes on Node (not sure if this is caused by this PR).

Details
[Thread 0x0x7019f0] started and locked the mutex
[Thread 0x0x7019f0] is waiting for the cond
[Thread 0x0x901ce8] started and locked the mutex
[Thread 0x0x901ce8] is waiting for the cond
[Thread 0x0xb02000] started and locked the mutex
[Thread 0x0xb02000] is waiting for the cond
[Main thread] signals a condition
[Thread 0x0x7019f0] was wakened and acquired the mutex again
[Thread 0x0x7019f0] released the mutex
[Main thread] 1 waiters were wakened
Calling stub instead of sigaction()
[Main thread] signals to wake up the next thread
[Thread 0x0x901ce8] was wakened and acquired the mutex again
[Main thread] signals to wake up the next thread
[Thread 0x0x901ce8] was wakened and acquired the mutex again
[Main thread] signals to wake up the next thread
[Main thread] signals to wake up the next thread
[Thread 0x0x901ce8] released the mutex
main thead called exit: noExitRuntime=undefined
[Thread 0x0xb02000[Main thread] had to signal the condition 

@sbc100 Most of the tests that were disabled were only failing on Node, see commit kleisauke@6d62b9c. Do you have any idea why this works without problems in the browser?

@sbc100
Copy link
Collaborator

sbc100 commented Dec 3, 2020

@abujalski test_pthread_setcanceltype_1_1 is the only reproducible testcase that has caused a test failure on the pr-10524-update branch. Although, I'm not sure if this is a false positive since the previous return code was PTS_UNRESOLVED instead of the expected PTS_PASS.

Details
And the test_pthread_cond_signal_1_1 seems to be flaky sometimes on Node (not sure if this is caused by this PR).

Details
@sbc100 Most of the tests that were disabled were only failing on Node, see commit kleisauke@6d62b9c. Do you have any idea why this works without problems in the browser?

I think its probably the differences between web workers and node workers. They are surprisingly different IIUC.

You should mind submitting that as a PR?

// cp suffix in the function name means "cancellation point", so this wait can be cancelled
// by the users unless current threads cancelability is set to PTHREAD_CANCEL_DISABLE
// which may be either done by the user of __timedwait() function.
if (is_main_thread || pthread_self()->canceldisable != PTHREAD_CANCEL_DISABLE) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is causing 'test_pthread_setcanceltype_1_1' test to fail.

Call chain:

pthread_mutex_lock --> __timedwait() --> __pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, &cs);

In test there are following flags set:

	pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL);
	pthread_setcanceltype(PTHREAD_CANCEL_ASYNCHRONOUS, NULL);

It looks that this condition has to be modified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated test in 0eb64ed

@kleisauke
Copy link
Collaborator

I think its probably the differences between web workers and node workers. They are surprisingly different IIUC.

You should mind submitting that as a PR?

I just submitted PR #12963 which should fix this.

src/library_pthread.js Outdated Show resolved Hide resolved
@abujalski
Copy link
Contributor Author

abujalski commented Jan 8, 2021

Hi @kleisauke,
Thanks for information, and investigation :)

Sorry for late reply, but I've missed second problem :(

@abujalski test_pthread_setcanceltype_1_1 is the only reproducible testcase that has caused a test failure on the pr-10524-update branch. Although, I'm not sure if this is a false positive since the previous return code was PTS_UNRESOLVED instead of the expected PTS_PASS.

This was regression, fixed. See #10524 (comment)

And the test_pthread_cond_signal_1_1 seems to be flaky sometimes on Node (not sure if this is caused by this PR).

Indeed it looks flaky. But what is bit odd is that this tests failed once when I run whole test suite, but when I've run just this particular test it passes. I've looped this particular tests for 1000 times and each one of them passed.

I'll investigate this further.

@kleisauke
Copy link
Collaborator

This was regression, fixed. See #10524 (comment)

I can confirm that commit 0eb64ed fixes that, thanks.

Indeed it looks flaky. But what is bit odd is that this tests failed once when I run whole test suite, but when I've run just this particular test it passes. I've looped this particular tests for 1000 times and each one of them passed.

I'll investigate this further.

I've observed the same behavior of that test, it's really curious. Note that this test failure does not seem to be caused by this PR, as it appears to be happening on the master branch as well, see:
61721ba
https://circleci.com/gh/emscripten-core/emscripten/385709

Thanks for investigating this further!

@abujalski
Copy link
Contributor Author

Indeed it looks flaky. But what is bit odd is that this tests failed once when I run whole test suite, but when I've run just this particular test it passes. I've looped this particular tests for 1000 times and each one of them passed.
I'll investigate this further.

I've observed the same behavior of that test, it's really curious. Note that this test failure does not seem to be caused by this PR, as it appears to be happening on the master branch as well, see:
61721ba
https://circleci.com/gh/emscripten-core/emscripten/385709

Thanks for investigating this further!

I've checked that test_pthread_cond_signal_1_1 tests fails only when POSIX test suite is run in parallel (i.e. by ParallelTestSuite). When I either run this test alone or whole test suite sequentially this test passes.

This test awaits for some action to happen in side threads. However time threshold is quite low and it looks that when there are other tests/tasks running in background side threads didn't get CPU on time.

There can be two possible ways to fix this issue:

  1. Run whole test suite sequentially,
  2. I can try to increase timeouts in this test and see if this helps

Can you tell me which one is preferred?

@sbc100
Copy link
Collaborator

sbc100 commented Jan 12, 2021

Indeed it looks flaky. But what is bit odd is that this tests failed once when I run whole test suite, but when I've run just this particular test it passes. I've looped this particular tests for 1000 times and each one of them passed.
I'll investigate this further.

I've observed the same behavior of that test, it's really curious. Note that this test failure does not seem to be caused by this PR, as it appears to be happening on the master branch as well, see:
61721ba
https://circleci.com/gh/emscripten-core/emscripten/385709
Thanks for investigating this further!

I've checked that test_pthread_cond_signal_1_1 tests fails only when POSIX test suite is run in parallel (i.e. by ParallelTestSuite). When I either run this test alone or whole test suite sequentially this test passes.

This test awaits for some action to happen in side threads. However time threshold is quite low and it looks that when there are other tests/tasks running in background side threads didn't get CPU on time.

There can be two possible ways to fix this issue:

  1. Run whole test suite sequentially,
  2. I can try to increase timeouts in this test and see if this helps

Can you tell me which one is preferred?

Sounds like the test is flaky. I would mark it as such (with @disabled(<link_to_bug>) or something like that) and we we can consider fixing the test (to be less timing dependent) upstream if we think its important enough (or if we think we lack coverage of this features via other tests).

@abujalski
Copy link
Contributor Author

I've checked that test_pthread_cond_signal_1_1 tests fails only when POSIX test suite is run in parallel (i.e. by ParallelTestSuite). When I either run this test alone or whole test suite sequentially this test passes.
This test awaits for some action to happen in side threads. However time threshold is quite low and it looks that when there are other tests/tasks running in background side threads didn't get CPU on time.
There can be two possible ways to fix this issue:

  1. Run whole test suite sequentially,
  2. I can try to increase timeouts in this test and see if this helps

Can you tell me which one is preferred?

Sounds like the test is flaky. I would mark it as such (with @disabled(<link_to_bug>) or something like that) and we we can consider fixing the test (to be less timing dependent) upstream if we think its important enough (or if we think we lack coverage of this features via other tests).

Done in separate PR #13259

@abujalski abujalski force-pushed the samsung_pthread_fixes2 branch 2 times, most recently from 30db8fc to c7879b1 Compare January 14, 2021 15:09
Copy link
Collaborator

@juj juj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice fixes!

Copy link
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with better PR title and description.

src/library_pthread.js Outdated Show resolved Hide resolved
tests/pthread/test_pthread_cancel_cond_wait.cpp Outdated Show resolved Hide resolved
tests/pthread/test_pthread_cancel_cond_wait.cpp Outdated Show resolved Hide resolved
// Lock mutex to ensure that thread is waiting
pthread_mutex_lock(&mutex);

EM_ASM(out('Canceling thread..'););
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just use printf or puts instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

printf() and puts() locks stdout file. I wanted to avoid such locks here, that's why I've used EM_ASM here.

Replacing it with emscripten_log(EM_LOG_CONSOLE, ...) will be ok?

tests/pthread/test_pthread_cancel_cond_wait.cpp Outdated Show resolved Hide resolved
REPORT_RESULT(res);
#endif
return res != 1;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious how if this new test is testing similar things to the newly enabled posixtestsuite tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually yes, this test is supposed to check similar scenario to one tested in pthread_cond_wait_2_3 test. Although in this test scenario itself is simplified compared to test from posixtestsuite. Main purpose of this test was to check cancelling thread waiting on cv scenario without adding posixtestsuite.

I don't know if posixtestsuite is run by CI but if so then this test can be removed. Do you think it is better to keep this tests or remove it as redundant?

@@ -8080,7 +8080,6 @@ def test_pthread_c11_threads(self):
self.set_setting('INITIAL_MEMORY', '64mb')
self.do_run_in_out_file_test('tests', 'pthread', 'test_pthread_c11_threads.c')

@no_asan('flakey errors that must be fixed, https://github.com/emscripten-core/emscripten/issues/12985')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the PR title and description including specifically how this change address this issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure no problem. I've updated them. Is my description clear and understandable?

tests/pthread/test_pthread_cancel_cond_wait.cpp Outdated Show resolved Hide resolved
@abujalski abujalski changed the title Samsung pthread fixes2 Fixing thread exit and thread cancellation Feb 5, 2021
Copy link
Collaborator

@juj juj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@juj
Copy link
Collaborator

juj commented Feb 15, 2021

@sbc100 had some outstanding questions about the test suite, not sure if those were resolved to his liking, so letting him hit merge if so.

Copy link
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with a couple more questions/nits

I'm ok with keeping the (possibly redundant) test.

@@ -186,19 +204,8 @@ var LibraryPThread = {
Atomics.store(HEAPU32, (tb + {{{ C_STRUCTS.pthread.profilerBlock }}} ) >> 2, 0);
_free(profilerBlock);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this cleanup of profilerBlock happens for threadExit but not for threadCancel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the only reason for that was fact that in original code it is not called in threadCancel.

Fixed - moved to runExitHandlersAndDeinitThread function.

src/library_pthread.js Show resolved Hide resolved
@@ -494,9 +495,9 @@ var LibraryPThread = {
$cleanupThread: function(pthread_ptr) {
if (ENVIRONMENT_IS_PTHREAD) throw 'Internal Error! cleanupThread() can only ever be called from main application thread!';
if (!pthread_ptr) throw 'Internal Error! Null pthread_ptr in cleanupThread!';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines should probably be in #if ASSERTIONS shouldn't that? Although unrelated to this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard to tell that, as $killThread, $cancelThread and $spawnThread functions has similar checks not guarded by #if ASSERTIONS.

#ifdef __EMSCRIPTEN__
emscripten_futex_wait(&inst->finished, 1, INFINITY);
int is_main_thread = emscripten_is_main_runtime_thread();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not be emscripten_is_main_browser_thread. IIUC its only the browser thread that can't block. This would also seem to match __timedwait.c.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remembered a comment about emscripten_is_main_browser_thread() versus emscripten_is_main_runtime_thread(), see: #12963 (comment).

Therefore shouldn't the __timedwait_cp function instead check for emscripten_is_main_runtime_thread() since that function is a cancellation point?

(note that emscripten_is_main_browser_thread() also returns true on non-worker threads on Node, I attempted to change this but unfortunately it didn't work out)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it seems that both pthread_barrier_wait.c and __timedwait.c should use the same thing. Having them differ seems bad. Perhaps land this change with emscripten_is_main_browser_thread and then we can consider a separate PR to change them both to emscripten_is_main_runtime_thread as a followup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, catch. Fixed.

Copy link
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! The details in that comment sounds like there are some future improvments we could make here but this looks good.

src/library_pthread.js Outdated Show resolved Hide resolved
This patch fixes thread cancellation/exit method and also unifies why
how thread structures are updated when thread is canceled or exits.

Following problems are addressed:
1. heap-use-after-free on std::thread emscripten-core#12985
   Function `cleanupThread()` from `src/library_pthread.js`
   unconditionally set `pthread.self` member of pthread structure to 0.
   Problem was that `cleanupThread()` function might be called after
   that pthread structure has been deleted. To fix this problem, setting
   `pthread.self` field is now guarded by check if thread data hasn't
   been already freed.
2. pthread_cond_wait/2-3 test hang
  - Disabling recursive thread cancellation.
  - Allowing __timedwait_cp to be true cancellation point as _cp
    suffix suggests
3. pthread_getschedparam/1-3 test hangs sometimes:
   In pthread_barrier_wait adding check if lock is held by main thread
   and waiting on futex in small slices of time there, to check if
   there is some work to do on behalf of Worker Threads.

Signed-off-by: Adam Bujalski <[email protected]>
@sbc100 sbc100 merged commit 3059251 into emscripten-core:master Feb 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants