Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fallback to a read lock on fork preparation #356

Merged
merged 3 commits into from
Mar 19, 2024
Merged

Conversation

simo5
Copy link
Member

@simo5 simo5 commented Mar 7, 2024

The reason to acquire a write lock is that there is a corner case where
the forking application is going to actually create additional threads
in an atfork() handler that runs before ours instead of doing something
more sensible like an immediate exec().
If such an application were to create threads then there is a chance
one of those threads could cause a call into pkcs11-provider that would
end up acquiring a read lock, which we will destroy as part of the post
fork atfork() handler.
Note that pthread creation or even pthread lock initialization functions
are not considered async-sginal-safe fucntions, so this is all a shot in
the dark anyway ...

Given this premise, falling back to a read lock, which will prevent
modification of the slots is sufficient in most cases.

Although there is a slight chance again that resetting the locks can
cause a thread to obtain a write lock while another thread holds a now
defunct read lock, such that there is modification of the slots while
a separate thread is reading this information, this is so rare it is
not worth dealing with (and there is nothing we can do anyway).
At least two threads won't be writing at the same time, that could be
really problematic.

Use of atfork() is also so rare that the chance of this happening in any
reasonable application can be considered infinitesimally small.
  • test fork deadlock

Fixes #355

@simo5 simo5 requested a review from Jakuje March 7, 2024 21:49
@simo5
Copy link
Member Author

simo5 commented Mar 7, 2024

Actually this patch may still not be ok, because the read lock could be nested in the same thread as the code calling the callback and already holding a read lock ...

@ifranzki
Copy link
Contributor

ifranzki commented Mar 8, 2024

I have tested this change with Apache and yes, it does avoid the deadlock.

Copy link
Contributor

@Jakuje Jakuje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@ifranzki
Copy link
Contributor

ifranzki commented Mar 8, 2024

Small typo in commit title:
"Just acquire a read lock on for preparation"
should probably read "... on fork preparation"

@simo5
Copy link
Member Author

simo5 commented Mar 8, 2024

@ifranzki does apache fork from the thread where the callback is running, or does it fork from another thread?

@simo5
Copy link
Member Author

simo5 commented Mar 8, 2024

Ok so I verified that holding a read lock multiple times with pthread_rwlock_rdlock() is just fine, therefore the code will work as is.

However I think I may still change this patch to attempt to acquire a write lock, and fallback to a mere read lock only if that fails ...

@simo5
Copy link
Member Author

simo5 commented Mar 8, 2024

I added a trywrlock() and also added a much deeper explanation in the commit message of my reasoning why I do what I do now in the code.
Will modify the PR description accordingly as well.

@simo5 simo5 changed the title Just acquire a read lock on for preparation. Fallback to a read lock on fork preparation Mar 8, 2024
@ifranzki
Copy link
Contributor

does apache fork from the thread where the callback is running, or does it fork from another thread?

It does fork from the thread where the callback is running. You can see this in the backtrace in the issue:
#8 0x000003fff7d31a76 in apr_proc_create () from /lib/s390x-linux-gnu/libapr-1.so.0
This is doing the fork, it is a function of the Apache portable runtime (APR), here is the code (look for apr_proc_create):
https://svn.apache.org/viewvc/apr/apr/trunk/threadproc/unix/proc.c?view=markup

Also, the backtrace shows that its executing in the process's main thread, i.e. the backtrack shows main() as very first function in the backtrace.

@simo5
Copy link
Member Author

simo5 commented Mar 11, 2024

Thanks for confirming, I think I am ok with the current code.
I just nee dot find time to write a test so we do not regress.

@simo5
Copy link
Member Author

simo5 commented Mar 15, 2024

Finally found time to add the test.
The test fails without the commit that fixes the deadlock.
Tested failure locally by reverting the (now-last) commit.

This test will ensure we do not inadvertently reintroduce a deadlock in the future.

@simo5 simo5 force-pushed the fork_deadlock branch 3 times, most recently from 3f707ed to e21e6dd Compare March 15, 2024 15:34
@Jakuje
Copy link
Contributor

Jakuje commented Mar 18, 2024

Address sanitizer / CI with Address Sanitizer (fedora) (pull_request) Failing after 360m

I ran diff through the logs for fedora in this run and last working in main https://github.com/latchset/pkcs11-provider/actions/runs/8266635712/job/22615078760 and the only difference in the dependencies is unbound from 1.19.1-2 to 1.19.1-4.

Running the address sanitizer checks locally on master shows no issues, while running it in this branch shows two memory leaks:

Direct leak of 88 byte(s) in 1 object(s) allocated from:
    #0 0x7f742b2d92ef in malloc (/usr/lib64/libasan.so.8.0.0+0xd92ef) (BuildId: 7fcb7759bc17ef47f9682414b6d99732d6a6ab0c)
    #1 0x7f742adb5334 in CRYPTO_zalloc (/lib64/libcrypto.so.3+0x1b5334) (BuildId: 5d012dcc6f62f35dabb8a129d641e3df7b731f11)
    #2 0x7f742ae172dc in UI_create_method (/lib64/libcrypto.so.3+0x2172dc) (BuildId: 5d012dcc6f62f35dabb8a129d641e3df7b731f11)
    #3 0x4023f7 in main /home/jjelen/devel/pkcs11-provider/tests/tfork_deadlock.c:127
    #4 0x7f742aa46149 in __libc_start_call_main (/lib64/libc.so.6+0x28149) (BuildId: 7ea8d85df0e89b90c63ac7ed2b3578b2e7728756)
    #5 0x7f742aa4620a in __libc_start_main_impl (/lib64/libc.so.6+0x2820a) (BuildId: 7ea8d85df0e89b90c63ac7ed2b3578b2e7728756)
    #6 0x402cd4 in _start (/home/jjelen/devel/pkcs11-provider/tests/tfork_deadlock+0x402cd4) (BuildId: 4227a9452263f96357af137df2d8392b07b00ea6)

Indirect leak of 14 byte(s) in 1 object(s) allocated from:
    #0 0x7f742b2d92ef in malloc (/usr/lib64/libasan.so.8.0.0+0xd92ef) (BuildId: 7fcb7759bc17ef47f9682414b6d99732d6a6ab0c)
    #1 0x7f742adb6672 in CRYPTO_strdup (/lib64/libcrypto.so.3+0x1b6672) (BuildId: 5d012dcc6f62f35dabb8a129d641e3df7b731f11)
    #2 0x7f742ae172f4 in UI_create_method (/lib64/libcrypto.so.3+0x2172f4) (BuildId: 5d012dcc6f62f35dabb8a129d641e3df7b731f11)
    #3 0x4023f7 in main /home/jjelen/devel/pkcs11-provider/tests/tfork_deadlock.c:127
    #4 0x7f742aa46149 in __libc_start_call_main (/lib64/libc.so.6+0x28149) (BuildId: 7ea8d85df0e89b90c63ac7ed2b3578b2e7728756)
    #5 0x7f742aa4620a in __libc_start_main_impl (/lib64/libc.so.6+0x2820a) (BuildId: 7ea8d85df0e89b90c63ac7ed2b3578b2e7728756)
    #6 0x402cd4 in _start (/home/jjelen/devel/pkcs11-provider/tests/tfork_deadlock+0x402cd4) (BuildId: 4227a9452263f96357af137df2d8392b07b00ea6)

Can you try to fix them and rebuild it?

@simo5 simo5 force-pushed the fork_deadlock branch 2 times, most recently from 0b3c4f7 to a410b57 Compare March 19, 2024 16:33
A free() was used indtead of the proper OPENSSL_free()

While there reformatted the code to better conform to our style.

Signed-off-by: Simo Sorce <[email protected]>
The reason to acquire a write lock is that there is a corner case where
the forking application is going to actually create additional threads
in an atfork() handler that runs before ours instead of doing something
more sensible like an immediate exec().
If such an application were to create threads then there is a chance
one of those threads could cause a call into pkcs11-provider that would
end up acquiring a read lock, which we will destroy as part of the post
fork atfork() handler.
Note that pthread creation or even pthread lock initialization functions
are not considered async-signal-safe fucntions, so this is all a shot in
the dark anyway ...

Given this premise, falling back to a read lock, which will prevent
modification of the slots is sufficient in most cases.

Although there is a slight chance again that resetting the locks can
cause a thread to obtain a write lock while another thread holds a now
defunct read lock, such that there is modification of the slots while
a separate thread is reading this information, this is so rare it is
not worth dealing with (and there is nothing we can do anyway).
At least two threads won't be writing at the same time, that could be
really problematic.

Use of atfork() is also so rare that the chance of this happening in any
reasonable application can be considered infinitesimally small.

Signed-off-by: Simo Sorce <[email protected]>
@simo5
Copy link
Member Author

simo5 commented Mar 19, 2024

@Jakuje I sneaked in a commit to fix a Coverity regression in another PR I merged today.
However the Fedora Sanitizer seem randomly failing again.
Are you ok to push regardless ?

@simo5
Copy link
Member Author

simo5 commented Mar 19, 2024

Merging as the address sanitizer issue is a problem in the Fedora build not in the code.

@simo5 simo5 merged commit 673e4fd into latchset:main Mar 19, 2024
21 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deadlock on P11PROV_SLOTS_CTX->rwlock while in passphrase callback
3 participants