Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a race condition in pthread_mutex_timedlock.c #12245

Closed
wants to merge 3 commits into from
Closed

Conversation

kripken
Copy link
Member

@kripken kripken commented Sep 17, 2020

pthread_mutex_lock gives no time when it calls this function. In that case
we don't loop, we just do a single wait forever. In a rare race condition,
the condition we care about may be set right before the wait, and the wait
does not know to look for it (it looks for a wake event, it doesn't read the
memory to check the value).

Instead, just busy-wait. As this is for pthread_mutex_lock, the normal use
case is probably something that needs to be fast anyhow.

@kripken kripken requested a review from juj September 17, 2020 01:58
@sbc100
Copy link
Collaborator

sbc100 commented Sep 17, 2020

I'm not sure turning all pthread_mutex_lock calls into busy loops in an acceptable solution. There could be threads that wait for long periods on locks, no?

@sbc100
Copy link
Collaborator

sbc100 commented Sep 17, 2020

Is this bug that effect musl in general or just emscripten?

@sbc100
Copy link
Collaborator

sbc100 commented Sep 17, 2020

Also does the the same rice condition apply to timelock calls that do have timeout?

What if I call timedlock with a 30 minute timeout? I could hit the race condition and end up waiting 30 minutes, right?

kripken added a commit that referenced this pull request Sep 17, 2020
@kleisauke
Copy link
Collaborator

I wonder if PR #10524 could also help resolve these race conditions? Since it ran the entire Open POSIX Test Suite in WebAssembly.

@sbc100
Copy link
Collaborator

sbc100 commented Sep 17, 2020

I also wonder if this a real bug in musl if it might have been fixed upstream already?

@kripken
Copy link
Member Author

kripken commented Sep 17, 2020

@kleisauke Good idea, but it doesn't look like #10524 can help here - it fixes other issues like thread cancellation, but not core mutex operations or our proxying logic.

@sbc100 I did look at upstream musl, and the code has not changed significantly, so it's not fixed upstream AFAICT.

I don't know if this only affects us or musl in general (it would be incredibly hard to test a native build of musl in a reliable enough way on a variant of #12258!).

Overall I think the diagnoses in these three PRs is incorrect as has been pointed out. However, they are all necessary to fix #12258, and they each definitely fix a specific deadlock I encountered while debugging that testcase. So I guess we need to debug those three deadlocks more. I am a little unsure how best to do that though - how can I debug whether Atomic.wait is actually atomic as it claims to be?

The only good news here is that this is likely not urgent, as these corner cases are very hard to hit. They are also all quite old, so I don't think we have any recent regression here.

@kripken
Copy link
Member Author

kripken commented Sep 22, 2020

I have found the actual cause here, and will open a refactoring PR and then a fix PR shortly.

@kripken kripken closed this Sep 22, 2020
@kripken kripken deleted the pthread1 branch September 22, 2020 23:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants