Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ALPN callback function sometimes leads to segfault in node.js >= 18.13.0 #47207

Closed
gugu opened this issue Mar 21, 2023 · 50 comments · Fixed by #49635
Closed

ALPN callback function sometimes leads to segfault in node.js >= 18.13.0 #47207

gugu opened this issue Mar 21, 2023 · 50 comments · Fixed by #49635
Labels
tls Issues and PRs related to the tls subsystem.

Comments

@gugu
Copy link

gugu commented Mar 21, 2023

Version

18.15.0

Platform

Linux 2a53a1799e0b 5.15.0-67-generic #74-Ubuntu SMP Wed Feb 22 14:14:39 UTC 2023 x86_64 GNU/Linux

Subsystem

tls

What steps will reproduce the bug?

I see this from logs of the my node.js server. I did not find a way to reproduce yet, and need some help with that. I've attached stacktrace, but did not yet found ALPN header value, which causes this error. According to my investigations, function SelectALPNCallback calls SSL_select_next_proto with NULL instead of correct pointer. Looks like some maliciously crafted ALPN header can lead to such error

How often does it reproduce? Is there a required condition?

It is a rare case I capture from logs (around 1 req/million). I can add some code to get more information about the bug, but don't know what to do

What is the expected behavior? Why is that the expected behavior?

Do not produce segfault

What do you see instead?

Segfoult with stacktrace:

PID 9 received SIGSEGV for address: 0x0
/srv/shorturl_redirector/node_modules/segfault-handler/build/Release/segfault-handler.node(+0x3236)[0x7f86dc0d0236]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x13140)[0x7f86dedcf140]
node(SSL_select_next_proto+0x4c)[0x17f0bfc]
node[0xd243e0]
node(tls_handle_alpn+0x53)[0x1834f83]
node(tls_parse_all_extensions+0x143)[0x18176f3]
node(tls_post_process_client_hello+0x70)[0x1835200]
node[0x1822b74]
node(ssl3_read_bytes+0x320)[0x1811050]
node(ssl3_read+0x60)[0x17e0c40]
node(SSL_read+0x87)[0x17ee3f7]
node(_ZN4node6crypto7TLSWrap8ClearOutEv+0x77)[0xd2c8f7]
node(_ZN4node6crypto7TLSWrap12OnStreamReadElRK8uv_buf_t+0xf8)[0xd2d5f8]
node(_ZN4node15LibuvStreamWrap8OnUvReadElPK8uv_buf_t+0x89)[0xc6f959]
node[0xc6fd68]
node[0x1676f67]
node[0x1677790]
node[0x167d534]
node(uv_run+0x14e)[0x166b95e]
node(_ZN4node13SpinEventLoopEPNS_11EnvironmentE+0x14d)[0xabda2d]
node(_ZN4node16NodeMainInstance3RunEv+0xf4)[0xbc1874]
node(_ZN4node22LoadSnapshotDataAndRunEPPKNS_12SnapshotDataEPKNS_20InitializationResultE+0xb4)[0xb36434]
node(_ZN4node5StartEiPPc+0x2df)[0xb3a02f]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea)[0x7f86dec0ad0a]
node(_start+0x2e)[0xaba37e]
Segmentation fault (core dumped)

Additional information

No response

@gugu
Copy link
Author

gugu commented Mar 21, 2023

And another one few second ago:

PID 8 received SIGSEGV for address: 0x400000001081
/srv/shorturl_redirector/node_modules/segfault-handler/build/Release/segfault-handler.node(+0x3236)[0x7fbc68045236]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x13140)[0x7fbc6ad44140]
node(SSL_select_next_proto+0x4c)[0x17f0bfc]
node[0xd243e0]
node(tls_handle_alpn+0x53)[0x1834f83]
node(tls_parse_all_extensions+0x143)[0x18176f3]
node(tls_post_process_client_hello+0x70)[0x1835200]
node[0x1822b74]
node(ssl3_read_bytes+0x320)[0x1811050]
node(ssl3_read+0x60)[0x17e0c40]
node(SSL_read+0x87)[0x17ee3f7]
node(_ZN4node6crypto7TLSWrap8ClearOutEv+0x77)[0xd2c8f7]
node(_ZN4node6crypto7TLSWrap12OnStreamReadElRK8uv_buf_t+0xf8)[0xd2d5f8]
node(_ZN4node15LibuvStreamWrap8OnUvReadElPK8uv_buf_t+0x89)[0xc6f959]
node[0xc6fd68]
node[0x1676f67]
node[0x1677790]
node[0x167d534]
node(uv_run+0x14e)[0x166b95e]
node(_ZN4node13SpinEventLoopEPNS_11EnvironmentE+0x14d)[0xabda2d]
node(_ZN4node16NodeMainInstance3RunEv+0xf4)[0xbc1874]
node(_ZN4node22LoadSnapshotDataAndRunEPPKNS_12SnapshotDataEPKNS_20InitializationResultE+0xb4)[0xb36434]
node(_ZN4node5StartEiPPc+0x2df)[0xb3a02f]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xea)[0x7fbc6ab7fd0a]
node(_start+0x2e)[0xaba37e]
Segmentation fault (core dumped)

@bnoordhuis bnoordhuis added the tls Issues and PRs related to the tls subsystem. label Mar 22, 2023
@gugu
Copy link
Author

gugu commented Mar 22, 2023

It's 2AM and I finally extracted core dump from kubernetes. Does not tell much because I need debug symbols for node 18.15.0

(gdb) bt
#0  0x00000000017f0bfc in SSL_select_next_proto ()                                                                                                                                           
#1  0x0000000000d243e0 in node::crypto::(anonymous namespace)::SelectALPNCallback(ssl_st*, unsigned char const**, unsigned char*, unsigned char const*, unsigned int, void*) ()              
#2  0x0000000001834f83 in tls_handle_alpn ()                                                                                                                                                 
#3  0x00000000018176f3 in tls_parse_all_extensions ()                                                                                                                                        
#4  0x0000000001835200 in tls_post_process_client_hello ()                                                                                                                                   
#5  0x0000000001822b74 in state_machine.part ()
#6  0x0000000001811050 in ssl3_read_bytes ()
#7  0x00000000017e0c40 in ssl3_read ()
#8  0x00000000017ee3f7 in SSL_read ()
#9  0x0000000000d2c8f7 in node::crypto::TLSWrap::ClearOut() ()
#10 0x0000000000d2d5f8 in node::crypto::TLSWrap::OnStreamRead(long, uv_buf_t const&) ()
#11 0x0000000000c6f959 in node::LibuvStreamWrap::OnUvRead(long, uv_buf_t const*) ()
#12 0x0000000000c6fd68 in node::LibuvStreamWrap::ReadStart()::{lambda(uv_stream_s*, long, uv_buf_t const*)#2}::_FUN(uv_stream_s*, long, uv_buf_t const*) ()
#13 0x0000000001676f67 in uv__read (stream=stream@entry=0x7f4994e92e90) at ../deps/uv/src/unix/stream.c:1201
#14 0x0000000001677790 in uv__stream_io (loop=<optimized out>, w=0x7f4994e92f18, events=1) at ../deps/uv/src/unix/stream.c:1270
#15 0x000000000167d534 in uv__io_poll (loop=loop@entry=0x526dfc0 <default_loop_struct>, timeout=<optimized out>) at ../deps/uv/src/unix/epoll.c:374
#16 0x000000000166b95e in uv_run (loop=0x526dfc0 <default_loop_struct>, mode=UV_RUN_DEFAULT) at ../deps/uv/src/unix/core.c:406
#17 0x0000000000abda2d in node::SpinEventLoop(node::Environment*) ()
#18 0x0000000000bc1874 in node::NodeMainInstance::Run() ()
#19 0x0000000000b36434 in node::LoadSnapshotDataAndRun(node::SnapshotData const**, node::InitializationResult const*) ()
#20 0x0000000000b3a02f in node::Start(int, char**) ()
#21 0x00007f4ee8745d0a in __libc_start_main (main=0xab1f70 <main>, argc=3, argv=0x7ffdb3a07ad8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 

and

(gdb) info frame
Stack level 1, frame at 0x7ffdb39ffe50:
 rip = 0xd243e0 in node::crypto::(anonymous namespace)::SelectALPNCallback(ssl_st*, unsigned char const**, unsigned char*, unsigned char const*, unsigned int, void*); saved rip = 0x1834f83
 called by frame at 0x7ffdb39ffe90, caller of frame at 0x7ffdb39ffe40
 Arglist at 0x7ffdb39ffe38, args: 
 Locals at 0x7ffdb39ffe38, Previous frame's sp is 0x7ffdb39ffe50
 Saved registers:
  rbp at 0x7ffdb39ffe40, rip at 0x7ffdb39ffe48

@bnoordhuis
Copy link
Member

A disassemble and info registers of the top frame might be enough for me to tell what's going on, and maybe a hex dump of the top 32 or 64 bytes of the stack.

@gugu
Copy link
Author

gugu commented Mar 23, 2023

Dump of assembler code for function SSL_select_next_proto:
   0x00000000017f0bb0 <+0>:     push   %rbp
   0x00000000017f0bb1 <+1>:     mov    %rsp,%rbp
   0x00000000017f0bb4 <+4>:     push   %r15
   0x00000000017f0bb6 <+6>:     mov    %r8,%r15
   0x00000000017f0bb9 <+9>:     push   %r14
   0x00000000017f0bbb <+11>:    mov    %r9d,%r14d
   0x00000000017f0bbe <+14>:    push   %r13
   0x00000000017f0bc0 <+16>:    push   %r12
   0x00000000017f0bc2 <+18>:    push   %rbx
   0x00000000017f0bc3 <+19>:    sub    $0x48,%rsp
   0x00000000017f0bc7 <+23>:    mov    %rdi,-0x60(%rbp)
   0x00000000017f0bcb <+27>:    mov    %rsi,-0x68(%rbp)
   0x00000000017f0bcf <+31>:    mov    %rdx,-0x58(%rbp)
   0x00000000017f0bd3 <+35>:    mov    %ecx,-0x48(%rbp)
   0x00000000017f0bd6 <+38>:    movl   $0x0,-0x44(%rbp)
   0x00000000017f0bdd <+45>:    test   %ecx,%ecx
   0x00000000017f0bdf <+47>:    je     0x17f0c90 <SSL_select_next_proto+224>
   0x00000000017f0be5 <+53>:    mov    -0x44(%rbp),%eax
   0x00000000017f0be8 <+56>:    mov    -0x58(%rbp),%rdi
   0x00000000017f0bec <+60>:    xor    %r12d,%r12d
   0x00000000017f0bef <+63>:    mov    %rax,%rcx
   0x00000000017f0bf2 <+66>:    add    %rdi,%rax
   0x00000000017f0bf5 <+69>:    mov    %rax,-0x50(%rbp)
   0x00000000017f0bf9 <+73>:    lea    0x1(%rcx),%ebx
=> 0x00000000017f0bfc <+76>:    movzbl (%rax),%eax
   0x00000000017f0bff <+79>:    add    %rdi,%rbx
   0x00000000017f0c02 <+82>:    mov    %rax,%r13
   0x00000000017f0c05 <+85>:    mov    %rax,-0x38(%rbp)
   0x00000000017f0c09 <+89>:    mov    %rbx,-0x40(%rbp)
   0x00000000017f0c0d <+93>:    test   %r14d,%r14d
   0x00000000017f0c10 <+96>:    jne    0x17f0c22 <SSL_select_next_proto+114>
   0x00000000017f0c12 <+98>:    jmp    0x17f0c78 <SSL_select_next_proto+200>
   0x00000000017f0c14 <+100>:   nopl   0x0(%rax)
   0x00000000017f0c18 <+104>:   lea    0x1(%r12,%rbx,1),%r12d
   0x00000000017f0c1d <+109>:   cmp    %r12d,%r14d
   0x00000000017f0c20 <+112>:   jbe    0x17f0c78 <SSL_select_next_proto+200>
   0x00000000017f0c22 <+114>:   mov    %r12d,%eax
   0x00000000017f0c25 <+117>:   movzbl (%r15,%rax,1),%ebx
   0x00000000017f0c2a <+122>:   cmp    %r13b,%bl
   0x00000000017f0c2d <+125>:   jne    0x17f0c18 <SSL_select_next_proto+104>
   0x00000000017f0c2f <+127>:   lea    0x1(%r12),%esi
   0x00000000017f0c34 <+132>:   mov    -0x38(%rbp),%rdx
   0x00000000017f0c38 <+136>:   mov    -0x40(%rbp),%rdi
   0x00000000017f0c3c <+140>:   add    %r15,%rsi
   0x00000000017f0c3f <+143>:   call   0xa84da0 <memcmp@plt>
   0x00000000017f0c44 <+148>:   test   %eax,%eax
(gdb) info registers
rax            0x75684f4f7156766e  8460099102637979246
rbx            0x1                 1
rcx            0x0                 0
rdx            0x75684f4f7156766e  8460099102637979246
rsi            0x7ffdb39ffe57      140727617060439
rdi            0x75684f4f7156766e  8460099102637979246
rbp            0x7ffdb39ffe30      0x7ffdb39ffe30
rsp            0x7ffdb39ffdc0      0x7ffdb39ffdc0
r8             0x7f499574c150      139954016797008
r9             0xc                 12
r10            0x75684f4f7156766e  8460099102637979246
r11            0x27ab84dcf749      43617621964617
r12            0x0                 0
r13            0x80                128
r14            0xc                 12
r15            0x7f499574c150      139954016797008
rip            0x17f0bfc           0x17f0bfc <SSL_select_next_proto+76>
eflags         0x10202             [ IF RF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
k0             0xfefef000          4278120448
k1             0xfc000000          4227858432
k2             0xffffffff          4294967295
k3             0xffffffff          4294967295
k4             0xffffffff          4294967295
k5             0xffffffff          4294967295
k6             0xffffffff          4294967295
k7             0x7                 7

@gugu
Copy link
Author

gugu commented Mar 23, 2023

ALPN callback

(gdb) info frame
Stack level 1, frame at 0x7ffdb39ffe50:
 rip = 0xd243e0 in node::crypto::(anonymous namespace)::SelectALPNCallback(ssl_st*, unsigned char const**, unsigned char*, unsigned char const*, unsigned int, void*); saved rip = 0x1834f83
 called by frame at 0x7ffdb39ffe90, caller of frame at 0x7ffdb39ffe40
 Arglist at 0x7ffdb39ffe38, args: 
 Locals at 0x7ffdb39ffe38, Previous frame's sp is 0x7ffdb39ffe50
 Saved registers:
  rbp at 0x7ffdb39ffe40, rip at 0x7ffdb39ffe48
(gdb) x/8xw 0x7ffdb39ffe38
0x7ffdb39ffe38: 0x00d243e0      0x00000000      0xb39ffe80      0x00007ffd
0x7ffdb39ffe48: 0x01834f83      0x00000000      0x0000001c      0x00000000
(gdb) 

@gugu
Copy link
Author

gugu commented Mar 23, 2023

OpenSSL:

(gdb) info frame
Stack level 0, frame at 0x7ffdb39ffe40:
 rip = 0x17f0bfc in SSL_select_next_proto; saved rip = 0xd243e0
 called by frame at 0x7ffdb39ffe50
 Arglist at 0x7ffdb39ffe30, args: 
 Locals at 0x7ffdb39ffe30, Previous frame's sp is 0x7ffdb39ffe40
 Saved registers:
  rbx at 0x7ffdb39ffe08, rbp at 0x7ffdb39ffe30, r12 at 0x7ffdb39ffe10, r13 at 0x7ffdb39ffe18, r14 at 0x7ffdb39ffe20, r15 at 0x7ffdb39ffe28, rip at 0x7ffdb39ffe38
(gdb) x/16wx 0x7ffdb39ffe30
0x7ffdb39ffe30: 0xb39ffe40      0x00007ffd      0x00d243e0      0x00000000
0x7ffdb39ffe40: 0xb39ffe80      0x00007ffd      0x01834f83      0x00000000
0x7ffdb39ffe50: 0x0000001c      0x00000000      0x00000000      0x00000000
0x7ffdb39ffe60: 0x95ca0898      0x00007f49      0x95ca09d8      0x00007f49

@bnoordhuis
Copy link
Member

I can't be 100% sure because the data is buried deeper in the stack than I anticipated (isn't it always like that?) but it looks like the ALPN string from the ClientHello packet is zero-sized.

SSL_select_next_proto() selects the first item from the client's string if there's no matching entry in the server's string but yeah, that won't work if the client's string is empty.

Do you have the opportunity to try out a patch locally? Does this one-liner fix it?

diff --git a/src/crypto/crypto_tls.cc b/src/crypto/crypto_tls.cc
index f14adec767a..fea3186c17e 100644
--- a/src/crypto/crypto_tls.cc
+++ b/src/crypto/crypto_tls.cc
@@ -225,6 +225,8 @@ int SelectALPNCallback(
     const unsigned char* in,
     unsigned int inlen,
     void* arg) {
+  if (inlen == 0) return SSL_TLSEXT_ERR_ALERT_FATAL;
+
   TLSWrap* w = static_cast<TLSWrap*>(arg);
   const std::vector<unsigned char>& alpn_protos = w->alpn_protos_;

@gugu
Copy link
Author

gugu commented Mar 23, 2023

No, but I'll try to patch openssl locally to try to reproduce this error on my dev end. If it is true it looks like a possibility DoS attack on node.js

@gugu
Copy link
Author

gugu commented Mar 23, 2023

Found another one, tls client:

> s = tls.connect('short.io', 443, {ALPNProtocols: [""]})
node[138836]: ../src/crypto/crypto_tls.cc:1533:static void node::crypto::TLSWrap::SetALPNProtocols(const v8::FunctionCallbackInfo<v8::Value>&): Assertion `SetALPN(w->ssl_, args[0])' failed.
 1: 0xb06730 node::Abort() [node]
 2: 0xb067ae  [node]
 3: 0xca27ea node::crypto::TLSWrap::SetALPNProtocols(v8::FunctionCallbackInfo<v8::Value> const&) [node]
 4: 0xd3e33e  [node]
 5: 0xd3f75f v8::internal::Builtin_HandleApiCall(int, unsigned long*, v8::internal::Isolate*) [node]
 6: 0x15da1b9  [node]

@gugu
Copy link
Author

gugu commented Mar 23, 2023

I can't be 100% sure because the data is buried deeper in the stack than I anticipated (isn't it always like that?) but it looks like the ALPN string from the ClientHello packet is zero-sized.

SSL_select_next_proto() selects the first item from the client's string if there's no matching entry in the server's string but yeah, that won't work if the client's string is empty.

Do you have the opportunity to try out a patch locally? Does this one-liner fix it?

diff --git a/src/crypto/crypto_tls.cc b/src/crypto/crypto_tls.cc
index f14adec767a..fea3186c17e 100644
--- a/src/crypto/crypto_tls.cc
+++ b/src/crypto/crypto_tls.cc
@@ -225,6 +225,8 @@ int SelectALPNCallback(
     const unsigned char* in,
     unsigned int inlen,
     void* arg) {
+  if (inlen == 0) return SSL_TLSEXT_ERR_ALERT_FATAL;
+
   TLSWrap* w = static_cast<TLSWrap*>(arg);
   const std::vector<unsigned char>& alpn_protos = w->alpn_protos_;

inlen equal to zero does not cause any issues, the problem is somewhere in out variable I think. Instead of prod I'm trying to reproduce it in this simple program:

#include <iostream>
#include <openssl/ssl.h>
using namespace std;
int main() {
        unsigned char *out;
        unsigned char outlen;
        const char *server = "\x08http/1.1";
        const char *client = "\x08http/1.1";
        cout << strlen(server) << endl;
        int status = SSL_select_next_proto(&out, &outlen, (const unsigned char*)server, strlen(server), (const unsigned char*)client, strlen(client));
        cout << status << endl;
}

UPD: Setting ALPNProtocols: [] in createServer solves the issue for me

@bnoordhuis
Copy link
Member

UPD: Setting ALPNProtocols: [] in createServer solves the issue for me

What was it set to before? Which createServer method are you using (net/http/https/http2)?

@gugu
Copy link
Author

gugu commented Mar 24, 2023

it is https.createServer(), default value is ['http/1.1']

@bnoordhuis
Copy link
Member

FWIW, I've not been able to reproduce the crash (so far at least.)

I tried crafting a ClientHello with an empty ALPN extension record but I get back a handshake_failure alert and the connection is subsequently closed. No crash.

@tniessen
Copy link
Member

I tried various combinations of invalid ALPN extensions last week and also wasn't able to reproduce a crash on the server side. Is there any chance you might be able to share a traffic dump of just the TLS handshake that causes the server to crash @gugu?

@gugu
Copy link
Author

gugu commented Mar 27, 2023

It required me to deploy crashing version of the app to production. If I can get debug symbols of the version released on nodejs.org - I can get much more data. But if it is not possible - then I'll deploy. One more thing from the available debug symbols of libuv:

#13 0x0000000001676f67 in uv__read (stream=stream@entry=0x7f4994e92e90) at ../deps/uv/src/unix/stream.c:1201

(gdb) info locals
buflen = 65536
buf = {base = 0x7f499466abb0 "\026\003\001\002", len = 65536}
nread = <optimized out>
msg = {msg_name = 0x5c11278, msg_namelen = 96539248, msg_iov = 0x0, msg_iovlen = 0, msg_control = 0x0, msg_controllen = 0, msg_flags = 0}
cmsg_space = '\000' <repeats 40 times>, "\210\022\301\005\000\000\000\000@2\301\005\000\000\000\000\200\071Z\344N\177\000\000!\000\000\000\000\000\000\000\200\022\301\005\000\000\000\000\002\000\000\000\000\000\000\000`E\240\263\375\177\000\000@E\240\263\375\177\000\000\215\350\254", '\000' <repeats 13 times>, "\002", '\000' <repeats 19 times>, "\230\206\vA@E\240\263\375\177\000\000\200\071Z\344N\177\000\000\200\022\301\005\000\000\000\000\200\071Z\344N\177\000\000\200\022\301\005\000\000\000\000\260\361\353\345N\177\000\000\300E\240\263\375\177\000\000"...
count = <optimized out>
err = <optimized out>
is_ipc = 0
__PRETTY_FUNCTION__ = "uv__read"

@bnoordhuis
Copy link
Member

Just a random observation but this:

buf = {base = 0x7f499466abb0 "\026\003\001\002", len = 65536}

Means the buffer contains a handshake frame that's between 512 and 767 bytes big. That's about average for a ClientHello.

(That \002 is the upper octet of a 16 bits big-endian value so at best it's 2*256 + 255 bytes big.)

If you can get me those bytes, I can probably figure out what's wrong.

@gugu
Copy link
Author

gugu commented Apr 4, 2023

I forgot about network byte order, now things looking much better. ALPN is completely ok. Here is what I've described so far:

16 - message type
0301 - message version
0200 - content length


01 - client hello message
0001fc - client hello length
0303 - client hello protocol version
125c42b2d5c2396d61c9be6c6e189eefb211d64bdafe85ba6131cabc8aedb2f8 - random value
20 - session id length
8a08d47ae46281fefdfb732c131ecd7c58dba0fb95f842a292ef6ac584f723c4 - session id
0020 - cipher suites length
dada - GREASE compatibility check
1301 - TLS_AES_128_GCM_SHA256 (All of them are decoded by copilot, did not check if it is correct)
1302 - TLS_AES_256_GCM_SHA384
1303 - TLS_CHACHA20_POLY1305_SHA256
c02b - TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
c02f - TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
c02c - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
c030 - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
cca9 - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
cca8 - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
c013 - TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA
c014 - TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA
009c - TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256
009d - TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384
002f - TLS_ECDHE_ECDSA_WITH_NULL_SHA
0035 - TLS_ECDHE_ECDSA_WITH_RC4_128_SHA
01 - number of compression methods to follow
00 - compression method "null"
0193 - length of extensions
8a8a0000 - GREASE compatibility check
0000 - server name extension
000e 000c 00 0009 xxxxxxxxxxxxxxxxxx - SNI (hostname can share if needed)
0017 0000 xxxx 0001 - extended master secret, session ticket
0000 0a00 0a - signature algorithms
0008 xxxx - server_authz, have no idea what is it and don't want to know
0010 000e 000c 02 683208 687474702f312e31 - ALPN (h2, http/1.1)

my decoding journey continues

@gugu
Copy link
Author

gugu commented Apr 4, 2023

16 - message type
0301 - message version
0200 - content length


01 - client hello message
0001fc - client hello length
0303 - client hello protocol version
125c42b2d5c2396d61c9be6c6e189eefb211d64bdafe85ba6131cabc8aedb2f8 - random value
20 - session id length
8a08d47ae46281fefdfb732c131ecd7c58dba0fb95f842a292ef6ac584f723c4 - session id
0020 - cipher suites length
dada - GREASE compatibility check
1301 - TLS_AES_128_GCM_SHA256 (All of them are decoded by copilot, did not check if it is correct)
1302 - TLS_AES_256_GCM_SHA384
1303 - TLS_CHACHA20_POLY1305_SHA256
c02b - TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
c02f - TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
c02c - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
c030 - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
cca9 - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
cca8 - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
c013 - TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA
c014 - TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA
009c - TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256
009d - TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384
002f - TLS_ECDHE_ECDSA_WITH_NULL_SHA
0035 - TLS_ECDHE_ECDSA_WITH_RC4_128_SHA
01 - number of compression methods to follow
00 - compression method "null"
0193 - length of extensions
8a8a0000 - GREASE compatibility check
0000 - server name extension
000e 000c 00 0009 xxxxxxxxxxxxxxxxxx - SNI xxxx.xxxx
0017 0000 ff01 0001 - extended master secret, session ticket
     0000 0a00 0a - signature algorithms
0008 eaea 001d 0017 0018 000b 0002 0100 0023 0000 - server_authz, have no idea what is it
0010 000e 000c 02 683208 687474702f312e31 - ALPN (h2, http/1.1)
0005 0005 0100000000 # OCSP
000d 0012 0010 040308040401050308050501080606010012 00000033002b0029eaea000100
001d 00203f782c482d92e2d7f8e7dcfee21effeaa128b30614bc01f200e051cb57ceaa6b002d00020101002b000b0ababa0304030303020301001b0003020002446900050003026832aaaa000100001500ca00
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000f3e774

Here is a full message. I tried to repeat it locally in a single thread in a loop 1M iterations, nothing fails. I'll try running in parallel

@bnoordhuis
Copy link
Member

Just a quick note but this:

0017 0000 ff01 0001 - extended master secret, session ticket
0000 0a00 0a - signature algorithms

Isn't right. 0017 0000 is the extended master secret extension, ff01 0001 00 is Renegotiation Indication with a (expected for a new connection) empty payload, and then 000a is EC Supported Groups.

@gugu
Copy link
Author

gugu commented Apr 4, 2023

I've checked 3 dumps, all of them have perfectly valid ALPN record. My guess now it is some use after free in ALPN callback

@gugu
Copy link
Author

gugu commented Apr 5, 2023

One more note: %rax points to incorrect memory address.
Can not reproduce it locally (still happens in prod). My C++ knowledge is poor and my asm is even worse, can:

  1. TLSWrap* w = static_cast<TLSWrap*>(arg); - arg be destroyed when callback happens?
  2. w->alpn_protos_ be destroyed when callback happens?
  3. out point to incorrect place? (I think no because it prepared by OpenSSL before each callback call)

Fails always this line:

=> 0x00000000017f0bfc <+76>:    movzbl (%rax),%eax

PS: I see src: optimize ALPN callback commit in node 18.13 (bf17f8d), I'll try to deploy 18.12 and tell the status

@gugu
Copy link
Author

gugu commented Apr 5, 2023

@bnoordhuis yep, reverting to version before bf17f8d fixes the problem (18.12.0)

@gugu gugu changed the title Maliciously crafted ALPN header [looks like] can lead to segfault ALPN callback function sometimes leads to segfault in node.js >= 18.13.0 Apr 5, 2023
@gugu
Copy link
Author

gugu commented Apr 5, 2023

image

I think reverting this change will help

@bnoordhuis
Copy link
Member

It seems really unlikely to me that bf17f8d is the culprit but maybe it exposes a preexisting condition.

Apropros that diff, can you explain why you think it makes a difference?

@gugu
Copy link
Author

gugu commented Apr 6, 2023

It makes a difference according to my production as well, 18.12.0 has 0 core dumps since yesterday on 10 servers. Default ALPN produced core dump around once per hour, empty ALPN - once per few hours.

  1. According to core dump it tries to write in piece of memory not available for the application. When I try to check %rax registry, it says that memory at that address in not avaialble
  2. The issue does not happen in stress test, but on production there are a lot of memory allocations and gc is frequently called
  3. This comment describes workaround of exactly the problem happens on production

I think it can be either GC running in a wrong time or client closing connection immediately after sending client hello. My service is a link shortener and we have tons of short living requests

@bnoordhuis
Copy link
Member

But why specifically do you think retaining a reference to options.ALPNProtocols as ssl._secureContext.alpnBuffer makes a difference? Nothing references it.

@gugu
Copy link
Author

gugu commented Apr 6, 2023

Okay, I can not prove 100% it is this line. Do you have config params you use to build nodejs.org/download/... builds? I'll try to create builder docker container and try to remove this line

@LewisWolfgang
Copy link

@bnoordhuis : My knowledge of openSSL internals is a bit rusty, but I believe the problem here was introduced with bf17f8d.
Previously the TLSWrap pointer was stored with the SSL object and used via (SSL_get_app_data(s)).

In the new code, the TLSWrap pointer is stored as a fixed arg on the SSL_CTX object, which IIUC, is shared by multiple TLSWrap objects. (So, if the last one to update it goes away, bad things happen.)

Like the other commenters, we have been unable to reproduce this under controlled circumstances, but it happened routinely when we tries to deploy node 18.17.1 in a production environment.
A deep dive into the resulting core dump showed that at the time of the crash, the TLSWrap and SSL objects involved looked fine (they were not deleted or anything), but ssl->ctx->ext.alpn_select_cb_arg was pointing to some memory that no longer appeared to be a TLSWrap object.

bnoordhuis added a commit to bnoordhuis/io.js that referenced this issue Sep 13, 2023
Retrieve the TLSWrap from the SSL object, not SSL_CTX.

A SSL_CTX object is the parent of zero or more SSL objects. TLSWrap is a
wrapper around SSL, SecureContext around SSL_CTX.

Node.js normally uses a SecureContext per TLSWrap but it is possible to
use a SecureContext object more than once.

It was therefore possible for an ALPN callback to use the wrong
(possibly already freed) TLSWrap object.

Having said that, while the bug is clear once you see it, I'm not able
to trigger it (and hence no test, not for lack of trying.)

None of the bug reporters were able to reliably reproduce it either so
the stars probably need to align just right in order to hit it.

Fixes: nodejs#47207
@bnoordhuis
Copy link
Member

Right, that sounds plausible. Can you see if #49635 fixes it for you? I tried writing a regression test but I'm not actually able to trigger the bug.

@LewisWolfgang
Copy link

Thank you so much for responding quickly.

As I said, I've been unable to repro this under test conditions. But maybe now that we know what's happening, it may be possible to contrive a situation to reproduce it...
It's a pretty narrow window of time for a second connection to come in, change the callback arg, finish up, get GC'ed, and the memory get reused, all before the client hello is parsed. I'm going to try slamming a server with a combination of connections with very small and very large client hellos to see if the small ones can corrupt things for the large ones....

In terms of a more typical regression test... I think it's possible to drive this machinery from the JS side of things without using any actual sockets/networking... it might be possible to do it that way, but honestly, I'm not familiar enough with the code to know how to do that.

@bnoordhuis
Copy link
Member

My test is basically: start two servers with different ALPN callback but sharing the same SecureContext through server.addContext('*', context), then connect to the first one.

You'd expect to end up in the second one's ALPN callback but that doesn't happen (or vice versa.)

@LewisWolfgang
Copy link

I'm still a bit fuzzy on how all this works, but doesn't the callback arg in the SecureContext get updated whenever a connection comes in?
So in order to get the wrong one, a connection on the other one would need to come in between the socket open and the parsing of client hello?

@bnoordhuis
Copy link
Member

Could be. I tried stress testing but that didn't change anything.

@LewisWolfgang
Copy link

I still haven't been able to reproduce the issue. After reading the code more carefully, I don't believe the SecureContext objects should be reused in our application. (It's a pretty basic websocket server, as far as this stuff goes.)
I am a bit confused about the fact that the context can be changed in the SNI callback. I'm having trouble following the logic there and knowing when it might be used. (We aren't doing anything with virtual servers, but I'm not sure if the callback may be called based on some errant/malicious client input...). But this is probably a red herring...

@LewisWolfgang
Copy link

@bnoordhuis : I can confirm that if I do just a basic server like:

      const httpB = https.createServer(
        {
          key: key,
          cert: cert,
          ciphers: CIPHER_SUITE,
        }
      );
      httpB.listen(48000, () => {
        console.log(`HACK server listening on 48000 with tls`);
      });

And have 2 client connections do the following:
open connection A
wait 1 second
open connection B
send client hello on B
close B
wait 50 seconds
send client hello on A

Run this with gdb attached, and break on node::crypto::(anonymous namespace)::SelectALPNCallback.
They both get the same value in r9 (the callback arg stored in SSL_CTX).

@LewisWolfgang
Copy link

And IIUC, that agrees with the theory... the TLSWrap objects will share an SSL_CTX... it's the one associated with server._sharedCreds in the JS side of things.

@bnoordhuis
Copy link
Member

@LewisWolfgang that looks plausible, but it's probably too non-determinate to turn into a reliable regression test (not that I mind being proven wrong)

@LewisWolfgang
Copy link

Yeah. In an ideal world, objects (TLSWrap) associated with a "child" SSL object shouldn't be making changes to a "parent", shared SSL_CTX object. But as long as all the child objects are making the same change (just a pointer to the class callback), things are fine, and it was working this way for a long time. (Writing a pointer to a specific child instance into the parent has the bad consequences that one might expect...)
I can imagine that in the longer term, some sort of refactoring to avoid the "child writing into parent" pattern could be a good idea. In the short term, I think your PR #49635 is the way to go.

bnoordhuis added a commit to bnoordhuis/io.js that referenced this issue Sep 21, 2023
Retrieve the TLSWrap from the SSL object, not SSL_CTX.

A SSL_CTX object is the parent of zero or more SSL objects. TLSWrap is a
wrapper around SSL, SecureContext around SSL_CTX.

Node.js normally uses a SecureContext per TLSWrap but it is possible to
use a SecureContext object more than once.

It was therefore possible for an ALPN callback to use the wrong
(possibly already freed) TLSWrap object.

Having said that, while the bug is clear once you see it, I'm not able
to trigger it (and hence no test, not for lack of trying.)

None of the bug reporters were able to reliably reproduce it either so
the stars probably need to align just right in order to hit it.

Fixes: nodejs#47207
@sempi
Copy link

sempi commented Sep 30, 2023

We can confirm that this issue still exists on node v18.18.0 and node v20.8.0.

We can also confirm that setting ALPNProtocols: [] does NOT resolve the issue.

About 1 of 2M HTTP requests in a representative production environment crashes the node.js process. Our production workload includes HTTP connection from almost all active browser versions. We do not believe this are malicious attempts as the failure rate follows seasonal patterns so its just a function of how many connections are made overall.

@sempi
Copy link

sempi commented Oct 1, 2023

We can confirm that the fix resolves this crash in node 20.8.0.

#49635

nodejs-github-bot pushed a commit that referenced this issue Oct 4, 2023
Retrieve the TLSWrap from the SSL object, not SSL_CTX.

A SSL_CTX object is the parent of zero or more SSL objects. TLSWrap is a
wrapper around SSL, SecureContext around SSL_CTX.

Node.js normally uses a SecureContext per TLSWrap but it is possible to
use a SecureContext object more than once.

It was therefore possible for an ALPN callback to use the wrong
(possibly already freed) TLSWrap object.

Having said that, while the bug is clear once you see it, I'm not able
to trigger it (and hence no test, not for lack of trying.)

None of the bug reporters were able to reliably reproduce it either so
the stars probably need to align just right in order to hit it.

Fixes: #47207
PR-URL: #49635
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Minwoo Jung <[email protected]>
targos pushed a commit that referenced this issue Oct 28, 2023
Retrieve the TLSWrap from the SSL object, not SSL_CTX.

A SSL_CTX object is the parent of zero or more SSL objects. TLSWrap is a
wrapper around SSL, SecureContext around SSL_CTX.

Node.js normally uses a SecureContext per TLSWrap but it is possible to
use a SecureContext object more than once.

It was therefore possible for an ALPN callback to use the wrong
(possibly already freed) TLSWrap object.

Having said that, while the bug is clear once you see it, I'm not able
to trigger it (and hence no test, not for lack of trying.)

None of the bug reporters were able to reliably reproduce it either so
the stars probably need to align just right in order to hit it.

Fixes: #47207
PR-URL: #49635
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Minwoo Jung <[email protected]>
alexfernandez pushed a commit to alexfernandez/node that referenced this issue Nov 1, 2023
Retrieve the TLSWrap from the SSL object, not SSL_CTX.

A SSL_CTX object is the parent of zero or more SSL objects. TLSWrap is a
wrapper around SSL, SecureContext around SSL_CTX.

Node.js normally uses a SecureContext per TLSWrap but it is possible to
use a SecureContext object more than once.

It was therefore possible for an ALPN callback to use the wrong
(possibly already freed) TLSWrap object.

Having said that, while the bug is clear once you see it, I'm not able
to trigger it (and hence no test, not for lack of trying.)

None of the bug reporters were able to reliably reproduce it either so
the stars probably need to align just right in order to hit it.

Fixes: nodejs#47207
PR-URL: nodejs#49635
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Minwoo Jung <[email protected]>
targos pushed a commit that referenced this issue Nov 11, 2023
Retrieve the TLSWrap from the SSL object, not SSL_CTX.

A SSL_CTX object is the parent of zero or more SSL objects. TLSWrap is a
wrapper around SSL, SecureContext around SSL_CTX.

Node.js normally uses a SecureContext per TLSWrap but it is possible to
use a SecureContext object more than once.

It was therefore possible for an ALPN callback to use the wrong
(possibly already freed) TLSWrap object.

Having said that, while the bug is clear once you see it, I'm not able
to trigger it (and hence no test, not for lack of trying.)

None of the bug reporters were able to reliably reproduce it either so
the stars probably need to align just right in order to hit it.

Fixes: #47207
PR-URL: #49635
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Minwoo Jung <[email protected]>
@bnoordhuis
Copy link
Member

We can confirm that the fix resolves this crash in node 20.8.0.

For posterity: the fix hasn't been released in v20.x yet. v20.10.0 is about to be released and contains the fix.

@nromito
Copy link

nromito commented Nov 16, 2023

@bnoordhuis are there plans to backport this to 18.x?

debadree25 pushed a commit to debadree25/node that referenced this issue Apr 15, 2024
Retrieve the TLSWrap from the SSL object, not SSL_CTX.

A SSL_CTX object is the parent of zero or more SSL objects. TLSWrap is a
wrapper around SSL, SecureContext around SSL_CTX.

Node.js normally uses a SecureContext per TLSWrap but it is possible to
use a SecureContext object more than once.

It was therefore possible for an ALPN callback to use the wrong
(possibly already freed) TLSWrap object.

Having said that, while the bug is clear once you see it, I'm not able
to trigger it (and hence no test, not for lack of trying.)

None of the bug reporters were able to reliably reproduce it either so
the stars probably need to align just right in order to hit it.

Fixes: nodejs#47207
PR-URL: nodejs#49635
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Minwoo Jung <[email protected]>
sercher added a commit to sercher/graaljs that referenced this issue Apr 25, 2024
Retrieve the TLSWrap from the SSL object, not SSL_CTX.

A SSL_CTX object is the parent of zero or more SSL objects. TLSWrap is a
wrapper around SSL, SecureContext around SSL_CTX.

Node.js normally uses a SecureContext per TLSWrap but it is possible to
use a SecureContext object more than once.

It was therefore possible for an ALPN callback to use the wrong
(possibly already freed) TLSWrap object.

Having said that, while the bug is clear once you see it, I'm not able
to trigger it (and hence no test, not for lack of trying.)

None of the bug reporters were able to reliably reproduce it either so
the stars probably need to align just right in order to hit it.

Fixes: nodejs/node#47207
PR-URL: nodejs/node#49635
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Minwoo Jung <[email protected]>
sercher added a commit to sercher/graaljs that referenced this issue Apr 25, 2024
Retrieve the TLSWrap from the SSL object, not SSL_CTX.

A SSL_CTX object is the parent of zero or more SSL objects. TLSWrap is a
wrapper around SSL, SecureContext around SSL_CTX.

Node.js normally uses a SecureContext per TLSWrap but it is possible to
use a SecureContext object more than once.

It was therefore possible for an ALPN callback to use the wrong
(possibly already freed) TLSWrap object.

Having said that, while the bug is clear once you see it, I'm not able
to trigger it (and hence no test, not for lack of trying.)

None of the bug reporters were able to reliably reproduce it either so
the stars probably need to align just right in order to hit it.

Fixes: nodejs/node#47207
PR-URL: nodejs/node#49635
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Minwoo Jung <[email protected]>
@magnetnation
Copy link

We are using v20.11.0. The error is still present.

@ivosh
Copy link

ivosh commented May 17, 2024

I confirm this bug is present with Node 18.18.2 and can be reproduced with 60% probability on the production system where there is quite a heavy load. I cannot reproduce it at all on the pre-production system which is almost identical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tls Issues and PRs related to the tls subsystem.
Projects
None yet
Development

Successfully merging a pull request may close this issue.