Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 BUG: wrangler dev --experimental-local slows down over time and leaks memory #2386

Closed
IgorMinar opened this issue Dec 10, 2022 · 27 comments · Fixed by cloudflare/workerd#803 or #3585
Assignees
Labels
bug Something that isn't working miniflare Relating to Miniflare

Comments

@IgorMinar
Copy link
Contributor

What version of Wrangler are you using?

2.6.2

What operating system are you using?

Mac

Describe the Bug

If I test my simple worker with wrangler dev --experimental-local and monitor the browser devtools network tab I see that after the initial (cold) requests then next requests return in 2ms.

If I keep on refreshing the browser, I see that time slowly starts to creep up, 4ms, 6ms, 8ms, and if I keep on refreshing the browser tab a lot, it slows down to 50-70ms and keeps on slowing down further with additional http requests.

At the same time I observe that the memory usage increases from 17MB to 34MB, 50MB, 75MB, all the way to over 1GB.

@IgorMinar IgorMinar added the bug Something that isn't working label Dec 10, 2022
@workers-devprod workers-devprod moved this to Untriaged in workers-sdk Dec 10, 2022
@IgorMinar
Copy link
Contributor Author

The issue is very easy to repro when load testing the local server with hey:

$ hey -n 1000 http://127.0.0.1:8787/                                                                                                  (master|…6)

Summary:
  Total:	0.8201 secs
  Slowest:	0.7063 secs
  Fastest:	0.0024 secs
  Average:	0.0334 secs
  Requests/sec:	1219.3387


$ hey -n 1000 http://127.0.0.1:8787/                                                                                                  (master|…6)

Summary:
  Total:	5.1450 secs
  Slowest:	4.2399 secs
  Fastest:	0.0153 secs
  Average:	0.2053 secs
  Requests/sec:	194.3653
  
 
$ hey -n 1000 http://127.0.0.1:8787/                                                                                                  (master|…6)

Summary:
  Total:	17.4745 secs
  Slowest:	16.9587 secs
  Fastest:	0.0227 secs
  Average:	0.8165 secs
  Requests/sec:	57.2262


$ hey -n 1000 http://127.0.0.1:8787/                                                                                                  (master|…6)

Summary:
  Total:	32.9817 secs
  Slowest:	2.6376 secs
  Fastest:	0.0976 secs
  Average:	1.6490 secs
  Requests/sec:	30.3199

Interestingly, I occasionally see the workerd crash with the following:

workerd/server/server.c++:967: info: Inspector client attaching [core:user:workers-hot]
workerd/io/worker.c++:999: info: console warning; message = There was an error while reporting an unhandled promise rejection: RangeError: Maximum call stack size exceeded
*** Received signal #10: Bus error: 10
stack: 

@penalosa
Copy link
Contributor

Could you post the script of your worker, and your wrangler.toml, if possible?

@IgorMinar
Copy link
Contributor Author

Hey Samuel, apologies for late response. This was a simple "hello world" worker with no bindings or special config.

@mrbbot mrbbot moved this from Untriaged to Backlog in workers-sdk Jan 27, 2023
@mrbbot mrbbot added the miniflare Relating to Miniflare label Jan 27, 2023
@mrbbot
Copy link
Contributor

mrbbot commented Feb 21, 2023

Hey @IgorMinar! I've just tried to reproduce this with the latest wrangler@beta, and I'm not seeing the response slowdown over time anymore, even with large -n. Would you be able to try out npx wrangler@beta dev --experimental-local? Memory usage does creep up (not as high as you were seeing), but this might just be slow GC, as it usually comes down after a while.

That being said, I do still get *** Received signal #10: Bus error: 10 stack: (indicating memory access out of process address space) if I try -n 10000 a few times, so there's still something strange happening here. Notably, this also happens when you run workerd serve samples/helloworld/config.capnp in the workerd repo, so I don't think this is a Wrangler/Miniflare problem.

@IgorMinar
Copy link
Contributor Author

IgorMinar commented Feb 23, 2023 via email

@lrapoport-cf
Copy link
Contributor

@IgorMinar were you able to confirm whether this is still a problem?

@IgorMinar
Copy link
Contributor Author

oops.. I dropped the ball on this one. I'll retest today. My apologies.

@lrapoport-cf
Copy link
Contributor

lrapoport-cf commented Apr 7, 2023

hey @IgorMinar! :) just bumping this -- did you have a chance to retest?

@IgorMinar
Copy link
Contributor Author

I was not able to verify/test because I got blocked on #3007 which seems to suggest that --experimental-local is currently broken in the 2.14.0 release for some if not for all users.

@re-thc
Copy link

re-thc commented May 3, 2023

Getting a similar issue on the latest wrangler. Don't know if it "slows down" but memory usage quickly goes to 10->20->30gb and it doesn't come back down unless killed.

@avaer
Copy link

avaer commented Jun 6, 2023

Can report this is still the case in wrangler 3.0.1 on MacOS.

Note that this does not require the wrangler server to be used at all. It will leak just by starting it.

Command:

wrangler dev --port 2222

I also get this error after some time; I'm not sure if it's related to the leak.

Failed to get worker definitions TypeError: fetch failed

      at fetch (/Users/a/webaverse/node_modules/wrangler/wrangler-dist/cli.js:16584:17)
      at async getRegisteredWorkers
  (/Users/a/webaverse/node_modules/wrangler/wrangler-dist/cli.js:124280:22)
      at async getBoundRegisteredWorkers
  (/Users/a/webaverse/node_modules/wrangler/wrangler-dist/cli.js:124299:29) {
    cause: HeadersTimeoutError: Headers Timeout Error
        at Timeout.onParserTimeout [as callback]
  (/Users/a/webaverse/node_modules/wrangler/wrangler-dist/cli.js:8431:33)
        at Timeout.onTimeout [as _onTimeout]
  (/Users/a/webaverse/node_modules/wrangler/wrangler-dist/cli.js:681:17)
        at listOnTimeout (node:internal/timers:573:17)
        at process.processTimers (node:internal/timers:514:7) {
      code: 'UND_ERR_HEADERS_TIMEOUT'
    }
  }

@csabatuz-chess
Copy link

csabatuz-chess commented Jun 7, 2023

I'd like to double down on this issue, as it's currently blocking our effort to use wrangler dev in a local development / integration testing environment.

3.1.0 MacOS

For measurement I used

hey -c 1 -n 10000 http://127.0.0.1:8787/

Bare minimum worker, no deps, no bindings

export default {
  async fetch(request: Request, env: any, executionCtx: ExecutionContext) {
    return new Response("Hello", {status: 200})
  },
}

Observed
Very progressive slowdown, and in the end crash

[mf:inf] GET / 200 OK (18ms)
[mf:inf] GET / 200 OK (20ms)
[mf:inf] GET / 200 OK (20ms)
[mf:inf] GET / 200 OK (20ms)
*** Received signal #10: Bus error: 10
stack: 
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ [b] open a browser, [d] open Devtools, [l] turn off local mode, [c] clear console, [x] to exit                                                                                                               │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Did a control measurement with Service Worker format vs Module worker

addEventListener('fetch', (event) => {
  event.respondWith(new Response("Hello", {status: 200}))
})

The worker performs a lot better with wrangler dev when in the Service Worker format, but crashes all the same with the same error, just later (after way more requests)

I'll perform an experiment with emulated amd64 arch a bit later

@lrapoport-cf lrapoport-cf added the v3 label Jun 9, 2023
@Cragady
Copy link

Cragady commented Jun 10, 2023

Hopefully this will help in some small way, but I've had issues with the slowdown and the bus 10 errors as well.

Started at 3.0.1, and now on 3.1.0 on MAC

Both gave me a Bus error: 10 message along with massive slow-downs out the gate.

I used this command, softwareupdate --install "Command Line Tools for Xcode-14.3", and that seemed to get rid of the bus 10 error. The speed was still abysmal, taking a couple of minutes to load in vs the fractions of a second that it takes on prod.

Command used:

wrangler dev --log-level debug --ip 0.0.0.0 --port 3000 --env development

I did, however, get a different error than the bus 10 error after the CLI XCode update:

workerd/jsg/util.c++:278: info: exception = kj/async-io-unix.c++:778: disconnected: ::writev(fd, iov.begin(), iov.size()): Broken pipe; iovTotal = 1108; iov.size() = 1
 stack: 10210f62f 1021100d3 10124d3cc 1020e7df8 1020ec88c 1020e37a8 1020f2c54 1009f95e8 1020f16d0 100df1130
 workerd/io/worker.c++:1703: info: uncaught exception; source = Uncaught (in promise); exception = Error: Network connection lost.
 workerd/io/worker.c++:1703: info: uncaught exception; source = Uncaught (in response); exception = Error: Network connection lost.
 workerd/jsg/util.c++:278: info: exception = kj/async-io-unix.c++:778: disconnected: ::writev(fd, iov.begin(), iov.size()): Broken pipe; iovTotal = 1108; iov.size() = 1
 stack: 10210f62f 1021100d3 10124d3cc 1020e7df8 1020e802f 1020d349c 1020e7df8 1020e802f 1020d349c 1020e7df8 1020e802f 1020d349c 1020f9314 1020f16d0 100f77494 1009f95e8 100e9e7c0 100dd8868 100eaf900 100d40c08 100df01c8 100b074b4
 workerd/io/worker.c++:1703: info: uncaught exception; source = Uncaught (in promise); exception = Error: Network connection lost.
 workerd/io/worker.c++:1703: info: uncaught exception; source = Uncaught (async); exception = Error: Network connection lost.
 ✘ [ERROR] Uncaught (in promise) Error: Network connection lost.
 
 
 ✘ [ERROR] Uncaught (async) Error: Network connection lost.

For additional context, the application runs as expected on the miniflare 2.X versions.

Command used with miniflare:

miniflare --live-reload --debug --port 3000 --env .env --modules <script path>

I hope this isn't just adding static

@csabatuz-chess
Copy link

csabatuz-chess commented Jun 12, 2023

I can confirm what @Cragady reported:

  • The Bus 10 crash is apparently specific to OSX (not occuring in e.g. linux-arm64 docker container)
  • The slowdown is persistent on multiple CPU architectures (tried Linux on Intel CPU, OSX on Apple M1 (arm64), and some forms of virtualization), using a simple Module Worker format helloworld_esm worker from workerd github repo samples
  • When running the same worker in latest Miniflare 2.x.x, the issue is not occuring.

However:

  • I was unable to reproduce a slowdown by using workerd directly (I used same workerd from node_modules that wrangler dev was running, even tried to match the command line as much as I can, enable inspector and such)
    This way I could not reproduce any slowdown (the bus 10 error occured on OSX, but that was indeed just noise and not seen it when switching to Docker)

IMO the main issue here is the slowdown and outrageous memory pressure accompanying it

@ohodson
Copy link

ohodson commented Jun 14, 2023

We were looking into the signal #10: Bus error in workerd yesterday.

This appears to be limited to workerd on OS X and only for optimized builds (bazel build -c opt). It hasn't failed this way with fastbuild or dbg builds.

xcodebuild -version reports Xcode 14.3.1 Build version 14E300c.

The good news is that we have a stack trace and can pinpoint where the failure occurs, the bad news is that it's an optimized build and so there's only partial symbols along with registers and memory. We'll figure something out :-)


With the slowdown and memory pressure issue, you can capture heap snapshots with devtools by enabling the inspector in workerd, e.g. workerd serve -i0.0.0.0 . If you open https://devtools.devprod.cloudflare.dev/js_app?ws=localhost:9229/main (or chrome://inspect) in Chrome on the dev machine, then go to the Memory tab and take snapshot. You can keep capturing heap snapshots between runs of hey to see what is growing.

The heap looks stable across batches of 10000 requests from hey using the workerd helloworld_esm sample on OS X M1 and Linux X64 (opt and dbg) with both the last workerd release and tip-of-tree. The overall process size does not grow without bound. This check is probably too brief, but might be exposed with a config and scripts that exercise more functionality.

ohodson added a commit to cloudflare/workerd that referenced this issue Jun 14, 2023
The root cause of bus errors from workerd opt builds are being
investigated. In the interim, use fastbuild instead.

The impact of this workaround is that the uncompressed macOS arm64
binary goes from 61MB to 155MB. The compressed binary increases from
21MB to 35MB.

Bug: cloudflare/workers-sdk#2386
Bug: CUSTESC-29094
ohodson added a commit to cloudflare/workerd that referenced this issue Jun 14, 2023
Allows running optimized builds in the vscode debugger (gdb, lldb).

Bug: cloudflare/workers-sdk#2386
ohodson added a commit to cloudflare/workerd that referenced this issue Jun 14, 2023
Allows running optimized builds in the vscode debugger (gdb, lldb).

Bug: cloudflare/workers-sdk#2386
ohodson added a commit to cloudflare/workerd that referenced this issue Jun 14, 2023
The root cause of bus errors from workerd opt builds are being
investigated. In the interim, use fastbuild instead.

The impact of this workaround is that the uncompressed macOS arm64
binary goes from 61MB to 155MB. The compressed binary increases from
21MB to 35MB.

Bug: cloudflare/workers-sdk#2386
Bug: CUSTESC-29094
ohodson added a commit to cloudflare/workerd that referenced this issue Jun 14, 2023
Allows running optimized builds in the vscode debugger (gdb, lldb).

Bug: cloudflare/workers-sdk#2386
ohodson added a commit to cloudflare/workerd that referenced this issue Jun 14, 2023
Allows running optimized builds in the vscode debugger (gdb, lldb).

Bug: cloudflare/workers-sdk#2386
ohodson added a commit to cloudflare/workerd that referenced this issue Jun 22, 2023
This reverts commit d0ac3f3.

The preceding commit adding --single-threaded-gc addresses the issue.

Bug: CUSTESC-29094
Bug: cloudflare/workers-sdk#2386
ohodson added a commit to cloudflare/workerd that referenced this issue Jun 22, 2023
Consecutive requests could lead to calling into the isolate whilst
code pages are being collected. This problem has only been observed
on macOS arm64.

Bug: CUSTESC-29094
Bug: cloudflare/workers-sdk#2386
Review: #803
ohodson added a commit to cloudflare/workerd that referenced this issue Jun 22, 2023
This reverts commit d0ac3f3.

The preceding commit adding --single-threaded-gc addresses the issue.

Bug: CUSTESC-29094
Bug: cloudflare/workers-sdk#2386
ohodson added a commit to cloudflare/workerd that referenced this issue Jun 22, 2023
Consecutive requests could lead to calling into the isolate whilst
code pages are being collected. This problem has only been observed
on macOS arm64.

Bug: CUSTESC-29094
Bug: cloudflare/workers-sdk#2386
Review: #803
ohodson added a commit to cloudflare/workerd that referenced this issue Jun 22, 2023
This reverts commit d0ac3f3.

The preceding commit adding --single-threaded-gc addresses the issue.

Bug: CUSTESC-29094
Bug: cloudflare/workers-sdk#2386
@ohodson ohodson removed their assignment Jun 23, 2023
@ohodson
Copy link

ohodson commented Jun 23, 2023

Hi @lrapoport-cf, the fix to workerd is in and available for the next workers-sdk release. We can close this ticket when that happens. (I've unassigned the ticket trying to re-assign it, but don't appear to have permissions to do that).

@lrapoport-cf
Copy link
Contributor

thanks @ohodson ! i've linked your fix to the ticket and reassigned :)

@daniloteodoro
Copy link

Looking at a merged PR in the comments above, it seems like there is a fix 🎉 Do we have an expect date for this to be released?

@mrbbot
Copy link
Contributor

mrbbot commented Jun 30, 2023

Hey everyone! 👋 We've just released a new beta version of Wrangler that includes a fix in workerd for this issue. Try it out with npx wrangler@beta dev. Please let us know if you encounter any issues. 🙂

@mrbbot mrbbot closed this as completed Jun 30, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in workers-sdk Jun 30, 2023
@BillBrower-Shopify
Copy link

@mrbbot wrangler@beta seems to fix the signal #10: Bus error but we still get 500s due to the maximum call stack size being exceeded after about ~2.5 minutes of constant requests in our reproduction repo. The response times also climb from ~50ms at the start of the stress test to about ~350ms by the time 500s appear.

Screenshot 2023-06-30 at 10 07 35 AM

@mrbbot
Copy link
Contributor

mrbbot commented Jun 30, 2023

@BillBrower-Shopify thanks for letting us know. Will see if we can translate this to a workerd-only reproduction. 👍

@mrbbot
Copy link
Contributor

mrbbot commented Jun 30, 2023

@BillBrower-Shopify I've been able to reproduce this just using workerd and handed it off to the runtime team. I'm OOO all of next week so won't be able to respond for a bit.

@BillBrower-Shopify
Copy link

Thanks @mrbbot! Is there a separate Github issue we can use to track progress on that work?

@o-az
Copy link

o-az commented Jul 2, 2023

Can confirm wrangler@beta seems to fix it. Before beta I was using ^3.1.1 and kept getting the same error
CleanShot 2023-07-01 at 17 23 18@2x

@ohodson
Copy link

ohodson commented Jul 3, 2023

@BillBrower-Shopify I've filed cloudflare/workerd#831 against workerd to investigate the 500 error that you are seeing. Thanks.

mrbbot added a commit that referenced this issue Jul 11, 2023
Previously, the middleware loader for module workers would register
all middleware on each request. These registrations would be in
addition to previous requests' registrations, meaning the middleware
chain grew linearly with each request. This eventually lead to a
stack overflow, as the middleware dispatcher is recursive. It also
significantly slowed down requests over time. Editing code and
reloading the dev server would reset the middleware chain.

This change ensures middleware is only registered on the first
request.

Closes #2386
mrbbot added a commit that referenced this issue Jul 11, 2023
Previously, the middleware loader for module workers would register
all middleware on each request. These registrations would be in
addition to previous requests' registrations, meaning the middleware
chain grew linearly with each request. This eventually lead to a
stack overflow, as the middleware dispatcher is recursive. It also
significantly slowed down requests over time. Editing code and
reloading the dev server would reset the middleware chain.

This change ensures middleware is only registered on the first
request.

Closes #2386
@mrbbot
Copy link
Contributor

mrbbot commented Jul 13, 2023

@BillBrower-Shopify the maximum call stack exceeded error and slowdowns should've been resolved with #3585, and released in [email protected]. That seems to have fixed the issue with your reproduction, but let us know if you have any other problems. 👍

lrapoport-cf pushed a commit that referenced this issue Aug 11, 2023
Previously, the middleware loader for module workers would register
all middleware on each request. These registrations would be in
addition to previous requests' registrations, meaning the middleware
chain grew linearly with each request. This eventually lead to a
stack overflow, as the middleware dispatcher is recursive. It also
significantly slowed down requests over time. Editing code and
reloading the dev server would reset the middleware chain.

This change ensures middleware is only registered on the first
request.

Closes #2386
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that isn't working miniflare Relating to Miniflare
Projects
None yet