Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSEvents-related SIGBUS crashes on macOS 10.15.7 #37697

Closed
gluxon opened this issue Mar 10, 2021 · 4 comments
Closed

FSEvents-related SIGBUS crashes on macOS 10.15.7 #37697

gluxon opened this issue Mar 10, 2021 · 4 comments
Labels
macos Issues and PRs related to the macOS platform / OSX.

Comments

@gluxon
Copy link
Contributor

gluxon commented Mar 10, 2021

What steps will reproduce the bug?

I'm running the TypeScript compiler in watch mode against a large proprietary monorepo. TypeScript is configured with TSC_NONPOLLING_WATCHER, so it uses FSEvents to watch for source code changes.

Unfortunately I haven't reproduced this in a more isolated setting yet without proprietary source code. I'm currently working on that and will update this issue as soon as one is available. I opted to create this issue early so others have a place to comment if they see similar problems.

How often does it reproduce? Is there a required condition?

I can consistently reproduce this in tsc with the aforementioned proprietary repo:

TSC_NONPOLLING_WATCHER=true node --max-old-space-size=6144 node_modules/typescript/bin/tsc --build --watch --preserveWatchOutput --pretty --verbose --extendedDiagnostics

After TypeScript sets up FSEvent watchers, making a change to any source code file (with touch) triggers:

[3:33:50 AM] Building project 'tsconfig.json'...

[3:34:12 AM] Updating unchanged output timestamps of project 'tsconfig.json'...

[1]    47110 bus error  node --max-old-space-size=6144  --build --watch

I apologize for the non-actionable repro for now. Although this is primarily observable through TypeScript, I do believe this is a Node.js bug. See Additional information below.

What is the expected behavior?

Node.js does not crash.

What do you see instead?

Here's a crash report summary from Console.app.

Process:               node [42748]
Path:                  /usr/local/Cellar/node/15.11.0/bin/node
Identifier:            node
Version:               0
Code Type:             X86-64 (Native)
Parent Process:        ??? [42749]
Responsible:           node [42748]
User ID:               1082347596
System Integrity Protection: enabled

Crashed Thread:        7

Exception Type:        EXC_BAD_ACCESS (SIGBUS)
Exception Codes:       KERN_PROTECTION_FAILURE at 0x00007000058e6ff8
Exception Note:        EXC_CORPSE_NOTIFY

Termination Signal:    Bus error: 10
Termination Reason:    Namespace SIGNAL, Code 0xa
Terminating Process:   exc handler [42748]
Thread 7 Crashed:
0   libsystem_pthread.dylib       	0x00007fff70093bc7 ___chkstk_darwin + 55
1   com.apple.CoreServices.FSEvents	0x00007fff37510dfe _Xcallback_rpc + 233
2   com.apple.CoreServices.FSEvents	0x00007fff37510cf7 FSEventsD2F_server + 55
3   com.apple.CoreServices.FSEvents	0x00007fff3751392b FSEventsClientProcessMessageCallback + 46
4   com.apple.CoreFoundation      	0x00007fff35e22b05 __CFMachPortPerform + 250
5   com.apple.CoreFoundation      	0x00007fff35df4304 __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE1_PERFORM_FUNCTION__ + 41
6   com.apple.CoreFoundation      	0x00007fff35df4250 __CFRunLoopDoSource1 + 541
7   com.apple.CoreFoundation      	0x00007fff35df2d79 __CFRunLoopRun + 2270
8   com.apple.CoreFoundation      	0x00007fff35df1e3e CFRunLoopRunSpecific + 462
9   com.apple.CoreFoundation      	0x00007fff35e7a489 CFRunLoopRun + 40
10  node                          	0x00000001007d17a4 uv__cf_loop_runner + 70
11  libsystem_pthread.dylib       	0x00007fff70098109 _pthread_start + 148
12  libsystem_pthread.dylib       	0x00007fff70093b8b thread_start + 15

Additional information

From the stack trace, it looks like the pthread responsible for handling FSEvents notifications from macOS is overflowing its stack, which triggers kernel-level stack canary checks. (Edit: This was actually a more simple stack bounds check. See comments below.) I noticed that this is similar to an issue previously seen on Node.js 0.10.18 and OS X 10.9.

nodejs/node-v0.x-archive#6251

The fix from back then was to simply increase the stack size.

joyent/libuv#964

   if (attr != NULL)
-    if (pthread_attr_setstacksize(attr, 3 * PTHREAD_STACK_MIN))
+    if (pthread_attr_setstacksize(attr, 4 * PTHREAD_STACK_MIN))
       abort();

The limit hasn't changed since:

if (attr != NULL)
if (pthread_attr_setstacksize(attr, 4 * PTHREAD_STACK_MIN))
abort();

I recompiled Node.js with 5 * PTHREAD_STACK_MIN and still saw crashes. 8 * PTHREAD_STACK_MIN eliminated the problem though. Removing the pthread_attr_setstacksize call entirely (so the default macOS pthread stack size is used) also fix the issue.

I'm hesitant to suggest either of those changes as a real fix since this appears to be dependent on how many fs.watch calls are made.

Other related issues

nodejs/help#3168 looks similar, but was due to an issue with the fsevents npm module. This issue is happening in Node.js core. (Specifically fs.)

I found libuv/libuv#669 interesting as well, which changed libuv worker threads on macOS to use RLIMIT_STACK.

@gluxon
Copy link
Contributor Author

gluxon commented Mar 10, 2021

Additional data point that may be helpful. I'm seeing 24,680 bytes on the stack from thread_start to ___chkstk_darwin.

(lldb) bt
* thread #8, stop reason = signal SIGSTOP
    frame #0: 0x00007fff70093bc7 libsystem_pthread.dylib`___chkstk_darwin + 55
    frame #1: 0x00007fff70093b90 libsystem_pthread.dylib`thread_start + 20
    frame #2: 0x00007fff37510dfe FSEvents`_Xcallback_rpc + 233
    frame #3: 0x00007fff37510cf7 FSEvents`FSEventsD2F_server + 55
    frame #4: 0x00007fff3751392b FSEvents`FSEventsClientProcessMessageCallback + 46
    frame #5: 0x00007fff35e22b05 CoreFoundation`__CFMachPortPerform + 250
    frame #6: 0x00007fff35df4304 CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE1_PERFORM_FUNCTION__ + 41
    frame #7: 0x00007fff35df4250 CoreFoundation`__CFRunLoopDoSource1 + 541
    frame #8: 0x00007fff35df2d79 CoreFoundation`__CFRunLoopRun + 2270
    frame #9: 0x00007fff35df1e3e CoreFoundation`CFRunLoopRunSpecific + 462
    frame #10: 0x00007fff35e7a489 CoreFoundation`CFRunLoopRun + 40
    frame #11: 0x00000001007d17a4 node`uv__cf_loop_runner + 70
    frame #12: 0x00007fff70098109 libsystem_pthread.dylib`_pthread_start + 148
  * frame #13: 0x00007fff70093b8b libsystem_pthread.dylib`thread_start + 15
(lldb) register read rbp
     rbp = 0x000070001211dff0
(lldb) frame select 0
frame #0: 0x00007fff70093bc7 libsystem_pthread.dylib`___chkstk_darwin + 55
libsystem_pthread.dylib`___chkstk_darwin:
->  0x7fff70093bc7 <+55>: testq  %rcx, -0x8(%rcx)
    0x7fff70093bcb <+59>: cmpq   $0x1000, %rax             ; imm = 0x1000
    0x7fff70093bd1 <+65>: jb     0x7fff70093bed            ; <+93>
    0x7fff70093bd3 <+67>: pushq  %rax
(lldb) expr 0x000070001211dff0-$rsp
(unsigned long) $22 = 24680

Most of this is FSEvents FSEventsClientProcessMessageCallback, which has 16,400 bytes.

(lldb) bt
* thread #8, stop reason = signal SIGSTOP
    frame #0: 0x00007fff70093bc7 libsystem_pthread.dylib`___chkstk_darwin + 55
    frame #1: 0x00007fff70093b90 libsystem_pthread.dylib`thread_start + 20
    frame #2: 0x00007fff37510dfe FSEvents`_Xcallback_rpc + 233
    frame #3: 0x00007fff37510cf7 FSEvents`FSEventsD2F_server + 55
  * frame #4: 0x00007fff3751392b FSEvents`FSEventsClientProcessMessageCallback + 46
    frame #5: 0x00007fff35e22b05 CoreFoundation`__CFMachPortPerform + 250
    frame #6: 0x00007fff35df4304 CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE1_PERFORM_FUNCTION__ + 41
    frame #7: 0x00007fff35df4250 CoreFoundation`__CFRunLoopDoSource1 + 541
    frame #8: 0x00007fff35df2d79 CoreFoundation`__CFRunLoopRun + 2270
    frame #9: 0x00007fff35df1e3e CoreFoundation`CFRunLoopRunSpecific + 462
    frame #10: 0x00007fff35e7a489 CoreFoundation`CFRunLoopRun + 40
    frame #11: 0x00000001007d17a4 node`uv__cf_loop_runner + 70
    frame #12: 0x00007fff70098109 libsystem_pthread.dylib`_pthread_start + 148
    frame #13: 0x00007fff70093b8b libsystem_pthread.dylib`thread_start + 15
(lldb) expr $rbp-$rsp
(unsigned long) $19 = 16400

Interestingly 24,680 is still less than 4 * PTHREAD_STACK_MIN (or 32,768). I'm guessing the stack for this pthread starts earlier.

@gluxon
Copy link
Contributor Author

gluxon commented Mar 11, 2021

[10:06:54 PM] Updating unchanged output timestamps of project 'tsconfig.json'...

Process 21622 stopped
* thread #8, stop reason = EXC_BAD_ACCESS (code=2, address=0x70000d71dff8)
    frame #0: 0x00007fff70093bc7 libsystem_pthread.dylib`___chkstk_darwin + 55
libsystem_pthread.dylib`___chkstk_darwin:
->  0x7fff70093bc7 <+55>: testq  %rcx, -0x8(%rcx)
    0x7fff70093bcb <+59>: cmpq   $0x1000, %rax             ; imm = 0x1000
    0x7fff70093bd1 <+65>: jb     0x7fff70093bed            ; <+93>
    0x7fff70093bd3 <+67>: pushq  %rax
Target 0: (node) stopped.
(lldb) p (size_t)pthread_get_stackaddr_np((void*)pthread_self()) - $rsp
(unsigned long) $0 = 24696

Hm. The 25 KB figure I gave in the last comment seems to be correct. p (size_t)pthread_get_stackaddr_np((void*)pthread_self()) - $rsp returned 24,696 bytes, which should be an accurate estimate of the stack usage.

@gluxon
Copy link
Contributor Author

gluxon commented Mar 11, 2021

Just discovered that ___chkstk_darwin is open source and commented. It's not checking for a stack canary like I thought. It instead takes an argument (%rax) to determine if any new allocations would cause the current stack to overflow, before performing the stack allocation that would actually make it overflow.

____chkstk_darwin: // %rax == alloca size
	pushq  %rcx
	leaq   0x10(%rsp), %rcx

	// validate that the frame pointer is on our stack (no alt stack)
	cmpq   %rcx, %gs:_PTHREAD_STRUCT_DIRECT_STACKADDR_OFFSET
	jb     Lprobe
	cmpq   %rcx, %gs:_PTHREAD_STRUCT_DIRECT_STACKBOTTOM_OFFSET
	jae    Lprobe

	// validate alloca size
	subq   %rax, %rcx
	jb     Lcrash
	cmpq   %rcx, %gs:_PTHREAD_STRUCT_DIRECT_STACKBOTTOM_OFFSET
	ja     Lcrash

	popq   %rcx
	retq

https://github.com/apple/darwin-libpthread/blob/c60d249cc84dfd6097a7e71c68a36b47cbe076d1/src/pthread_asm.s#L74-L75

The process above seems to be crashing on an allocation size of 25,008 bytes from the repro above.

24,696 + 25,008 is slightly larger than 6 * PTHREAD_STACK_MIN (49,152), which explains why a multiplier of 8 works but 5 does not.

Going to look into what the 25,008 byte allocation is for. It's probably an array of signals from FSEvents. If so, I'd be curious if the maximum size for this is documented anywhere.

@gluxon
Copy link
Contributor Author

gluxon commented Mar 12, 2021

I think I've gotten to the bottom of this crash. The FSEvents`implementation_callback_rpc function stores a variable length* stack array to perform string prefix removal on each eventPath array element. In my case the char** eventPaths array had 3125 elements in it, which accounted for a giant 25,000 byte allocation.

Psuedo-code for the problematic block looks like this:

void implementation_callback_rpc(char** eventPaths, long nEventPaths) {
  // ...
  char** strippedEventPaths = alloca(nEventPaths)
  for (int i = 0; i < nEventPaths; i++) {
    strippedEventPaths[i] = caller_path(eventPaths[i])
  }
  // ...
}

The chkstk_darwin function turned out to just be an inlined alloca call.

So what does caller_path do? My rudimentary understanding from reading the disassembly is:

char* caller_path(char* eventPath) {
  if (startsWith(eventPath, "/System/Volumes/Data/") {
    // ...
    return eventPath + strlen("/System/Volumes/Data/")
  }
  // ...
  return eventPath;
}

So it basically just strips a constant prefix path. I was curious if there's any significance to /System/Volumes/Data/ and found this StackExchange question: https://apple.stackexchange.com/questions/367158/whats-system-volumes-data

Catalina introduces a new file system layout. Where Mojave and earlier had one filesystem that combines the system and user data, Catalina has a read-only system volume and a read-write user volume interleaved on a folder by folder basis using firmlinks.

Which is fascinating. Multiple coworkers reported that this Node.js crash only started happening after upgrading to 10.15/Catalina. I bet this stack allocation loop isn't present in 10.14 since /System/Volumes/Data/ wasn't a concept yet.

* Technically Variable-Length Arrays (VLAs) and alloca are different, but the difference isn't significant here.

gluxon added a commit to gluxon/node that referenced this issue Mar 12, 2021
Fixes nodejs#37697.

This fixes `SIGBUS` crashes on macOS 10.15 due to FSEvents code
attempting to allocate large arrays on the stack.

The existing size (`4 * PTHREAD_STACK_MIN` or 32KB) causes a stack
overflow when more than ~1000 events are received at once.
gluxon added a commit to gluxon/node that referenced this issue Mar 12, 2021
Fixes nodejs#37697.

This fixes `SIGBUS` crashes on macOS 10.15 due to FSEvents code
attempting to allocate large arrays on the stack.

The existing size (`4 * PTHREAD_STACK_MIN` or 32KB) causes a stack
overflow when more than ~1000 events are received at once.
gluxon added a commit to gluxon/node that referenced this issue Mar 12, 2021
Fixes nodejs#37697.

This fixes `SIGBUS` crashes on macOS 10.15 due to FSEvents code
attempting to allocate large arrays on the stack.

The existing size (`4 * PTHREAD_STACK_MIN` or 32KB) causes a stack
overflow when more than ~1000 events are received at once.
gluxon added a commit to gluxon/node that referenced this issue Mar 12, 2021
This fixes `SIGBUS` crashes on macOS 10.15 due to FSEvents code
attempting to allocate large arrays on the stack.

The existing size (`4 * PTHREAD_STACK_MIN` or 32KB) causes a stack
overflow when more than ~1000 events are received at once.

Fixes: nodejs#37697
Refs: joyent/libuv#964
gluxon added a commit to gluxon/libuv that referenced this issue Mar 12, 2021
This fixes `SIGBUS` crashes on macOS 10.15 due to FSEvents code
attempting to allocate large arrays on the stack.

The existing size (`4 * PTHREAD_STACK_MIN` or 32KB) causes a stack
overflow when more than ~1000 events are received at once. Removing the
manual `pthread_attr_setstacksize` call here defaults to a stack size of
512KB.

Fixes: nodejs/node#37697
Refs: joyent/libuv#964
gluxon added a commit to gluxon/libuv that referenced this issue Mar 12, 2021
This fixes `SIGBUS` crashes on macOS 10.15 due to FSEvents code
attempting to allocate large arrays on the stack.

The existing size (`4 * PTHREAD_STACK_MIN` or 32KB) causes a stack
overflow when more than ~1000 events are received at once. Removing the
manual `pthread_attr_setstacksize` call here defaults to a stack size of
512KB.

Fixes: nodejs/node#37697
Refs: joyent/libuv#964
gluxon added a commit to gluxon/libuv that referenced this issue Mar 12, 2021
This fixes `SIGBUS` crashes on macOS 10.15 due to FSEvents code
attempting to allocate large arrays on the stack.

The existing size (`4 * PTHREAD_STACK_MIN` or 32KB) causes a stack
overflow when more than ~1000 events are received at once. Removing the
manual `pthread_attr_setstacksize` call here defaults to a stack size of
512KB.

Fixes: nodejs/node#37697
Refs: joyent/libuv#964
gluxon added a commit to gluxon/libuv that referenced this issue Mar 12, 2021
This fixes `SIGBUS` crashes on macOS 10.15 due to FSEvents code
attempting to allocate large arrays on the stack.

The existing size (`4 * PTHREAD_STACK_MIN` or 32KB) causes a stack
overflow when more than ~1000 events are received at once. Removing the
manual `pthread_attr_setstacksize` call here defaults to a stack size of
512KB.

Fixes: nodejs/node#37697
Refs: joyent/libuv#964
gluxon added a commit to gluxon/libuv that referenced this issue Mar 12, 2021
This fixes `SIGBUS` crashes on macOS 10.15 due to an FSEvents change
that causes it to allocate large arrays on the stack.

The existing size (`4 * PTHREAD_STACK_MIN` or 32KB) causes a stack
overflow when more than ~1000 events are received at once. Removing the
manual `pthread_attr_setstacksize` call here defaults to a stack size of
512KB.

Fixes: nodejs/node#37697
Refs: joyent/libuv#964
@PoojaDurgad PoojaDurgad added the macos Issues and PRs related to the macOS platform / OSX. label Mar 12, 2021
gluxon added a commit to gluxon/libuv that referenced this issue Mar 13, 2021
This fixes `SIGBUS` crashes on macOS 10.15 due a new change in
`FSEvents.framework` that makes it allocate a large stack array for
event paths. (See the linked nodejs/node issue for details on the
`FSEvents.framework` memory requirements change itself.)

The existing size (`4 * PTHREAD_STACK_MIN` or 32KB) causes a stack
overflow when more than ~1000 events are received at once. Setting this
to `uv__thread_stack_size()` increases it to 8192KB (by default) on
64-bit machines. This value can be configured at runtime on macOS with
`ulimit -s <size-kb>`.

The 32KB limit was originally added to reduce virtual memory
fragmentation on 32-bit systems, which is not a concern on 64-bit
systems.

Fixes: nodejs/node#37697
Refs: joyent/libuv#964
gluxon added a commit to gluxon/libuv that referenced this issue Mar 13, 2021
This fixes `SIGBUS` crashes on macOS 10.15 due a new change in
`FSEvents.framework` that makes it allocate a large stack array for
event paths. (See the linked nodejs/node issue for details on the
`FSEvents.framework` memory requirements change itself.)

The existing size (`4 * PTHREAD_STACK_MIN` or 32KB) causes a stack
overflow when more than ~1000 events are received at once. Setting this
to `uv__thread_stack_size()` increases it to 8192KB (by default) on
64-bit machines. This value can be configured at runtime on macOS with
`ulimit -s <size-kb>`.

The 32KB limit was originally added to reduce virtual memory
fragmentation on 32-bit systems, which is not a concern on 64-bit
systems.

Fixes: nodejs/node#37697
Refs: joyent/libuv#964
JeffroMF pushed a commit to JeffroMF/libuv that referenced this issue May 16, 2022
This fixes `SIGBUS` crashes on macOS 10.15 due a new change in
`FSEvents.framework` that makes it allocate a large stack array for
event paths. (See the linked nodejs/node issue for details on the
`FSEvents.framework` memory requirements change itself.)

The existing size (`4 * PTHREAD_STACK_MIN` or 32KB) causes a stack
overflow when more than ~1000 events are received at once. Setting this
to `uv__thread_stack_size()` increases it to 8192KB (by default) on
64-bit machines. This value can be configured at runtime on macOS with
`ulimit -s <size-kb>`.

The 32KB limit was originally added to reduce virtual memory
fragmentation on 32-bit systems, which is not a concern on 64-bit
systems.

Fixes: nodejs/node#37697
Refs: joyent/libuv#964
PR-URL: libuv#3132
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: Santiago Gimeno <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
macos Issues and PRs related to the macOS platform / OSX.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants