TLSWrap objects grow with HTTP keep-alive enabled until out of memory error occurs #33468

mgilbey · 2020-05-19T09:50:28Z

Version:

$ node -p process.versions
{
  node: '14.2.0',
  v8: '8.1.307.31-node.33',
  uv: '1.37.0',
  zlib: '1.2.11',
  brotli: '1.0.7',
  ares: '1.16.0',
  modules: '83',
  nghttp2: '1.40.0',
  napi: '6',
  llhttp: '2.0.4',
  openssl: '1.1.1g',
  cldr: '36.1',
  icu: '66.1',
  tz: '2019c',
  unicode: '13.0'
}

Platform: Darwin <host> 19.4.0 Darwin Kernel Version 19.4.0: Wed Mar 4 22:28:40 PST 2020; root:xnu-6153.101.6~15/RELEASE_X86_64 x86_64 and node docker image node:13.13.0-alpine
Subsystem: https

What steps will reproduce the bug?

I have created a script to reproduce at https://github.com/mgilbey/node-tls-wrap

The script runs a TLS server and https Agent to be self contained, but only the Agent is needed to reproduce the problem with a HTTPS server to connect to. The classes are only used to help identify which objects are holding references in the heap snapshots.

The script will open 100 connections, send and receive two HTTP requests on each connection (demonstrating keep-alive functions as expected) and then lets the sockets close due to the keep-alive timeout. This repeats five times, leaving 501 TLSWrap objects in memory. The scripts forces a garbage collection, and then collects a heap dump before exiting.

The agent reports the expected number of active / free sockets (never more than 100).

How often does it reproduce? Is there a required condition?

The script will consistently reproduce the issue.

I believe the salient points for reproduction are:

https Agent with keepAlive enabled
The number of open sockets never exceeds the maxSockets Agent option.
Sockets are closed due to the keepAlive timeout specified by the Agent being reached.

What is the expected behavior?

TLSWrap and TLSSocket objects are garbage collected when the socket from the agent is closed due to the keep-alive timeout, and no further keep-alive socket re-use is possible.

What do you see instead?

TLSWrap and TLSSocket objects will grow in memory until the old-space heap limits are reached at e.g. 10 or 300 MB limits. The number of TLSWrap and TLSSocket objects in memory will exceed the maxSockets, maxFreeSockets or maxCachedSessions Agent options.

An allocation timeline and heap snapshot are also included in the repository.

Each TLSWrap object has four retainers, although none appear to be the https Agent.

Additional information

To demonstrate the process crashing due to heap limits, run with --max-old-space-size=10 and increase the number of iterations on line 128 to 50. You can increase heap and the number of iterations should you be more patient.

The issue does not reproduce with the http Agent.

The cause appears to be very similar to resolved issue #19859 that affected the http Agent, particularly the reference to the open sockets being less than the maxSockets option.

The text was updated successfully, but these errors were encountered:

addaleax · 2020-05-19T12:51:13Z

@mgilbey Can you check whether this was introduced between v13.8.0 and v13.9.0? I’m kind of suspecting 9fdb6e6 here, based on the retainer output.

If yes, #31745 might fix this, although it requires some updating as indicated in the PR. I’ll get on that.

mgilbey · 2020-05-19T13:44:27Z

I've checked versions 13.0.0, 13.8.0 and 13.9.0 but all fail to close connections after the keep-alive timeout has elapsed, meaning that they bypass the test entirely, and leak unclosed sockets instead of closed ones.

I went back to 12.6.3 and it behaves as expected with sockets closed after the keep-alive timeout elapses and does not retain TLSWrap objects. When it completes the snapshot only references a single TLSWrap object, which would be expected when both the client and server are destroyed.

Conversely the 13.x versions above retain the TLSWrap objects even after client, server and all sockets are closed, which suggests that the retention was before 9fdb6e6.

omsmith · 2020-06-02T12:54:15Z

Ran into what appears to be the same or similar issue when updating node.js to 14.3 (from 8.x) in one of our services. In our case we're also using keep-alive and to exacerbate the issue, we deliberately destroy sockets 10 minutes after first use to ensure we're responding to DNS changes of servers that may not necessarily close connections.

We've set up the service to take heapsnapshots every hour. Here's a comparison of one of the workers between hour 1 and hour 2 of operation, June 2 00:48Z and 01:48Z respectively. In this case I don't have an hour 3, because a graceful restart was necessary ahead of any OOM issues.

I compiled and deployed a package of v14.x with 6f6bf01 cherry-picked on top. This appears to have resolved the issue, and the service thankfully made it through the night without intervention. Certainly some slightly different load profiles due to difference in time, but meaningful nonetheless. Here's a comparison of one of the workers between hour 1 and hour 9 of operation.

Thank you @addaleax, and thanks @mgilbey for reporting the issue - hope it's fixed for you as well.

Since f59ec2a, `BaseObject` instances were tracked in heap snapshots through their associated `CleanupHookCallback`s which were stored on the `Environment`; however, this is inaccurate, because: - Edges in heap dumps imply a keeps-alive relationship, but cleanup hooks do not keep the `BaseObject`s that they point to alive. - It loses information about whether `BaseObject` instances are GC roots: Even weak `BaseObject`s are now, practically speaking, showing up as hanging off a GC root when that isn’t actually the case (e.g. in the description of nodejs#33468). Thus, this is a partial revert of f59ec2a. Refs: nodejs#33468 Refs: nodejs#27018

Since f59ec2a, `BaseObject` instances were tracked in heap snapshots through their associated `CleanupHookCallback`s which were stored on the `Environment`; however, this is inaccurate, because: - Edges in heap dumps imply a keeps-alive relationship, but cleanup hooks do not keep the `BaseObject`s that they point to alive. - It loses information about whether `BaseObject` instances are GC roots: Even weak `BaseObject`s are now, practically speaking, showing up as hanging off a GC root when that isn’t actually the case (e.g. in the description of #33468). Thus, this is a partial revert of f59ec2a. Refs: #33468 Refs: #27018 PR-URL: #33809 Reviewed-By: James M Snell <[email protected]> Reviewed-By: Colin Ihrig <[email protected]>

abartomeu · 2020-06-15T05:31:13Z

Since I upgraded from 12.16.3 to 12.17 I have had a memory leak in production that I am trying to figure out. I took a memory snapshot in the staging environment and I can see 898 TLSWrap objects with the same 4 retainers as mentioned by @mgilbey. I couldn't validate yet that is the leak because I was only able to take one snapshot but if it is, it was introduced in 12.17.

christhegrand · 2020-06-16T18:24:25Z

We also started seeing this after upgrading from 12.16 to 12.17.

abartomeu · 2020-06-18T01:32:44Z

I have done further testing on this issue. I have taken a memory snapshot in after 24 hours and I see 64000 TLSWrap objects using about 400MB of memory. In previous experiments I have seen that the number of these TLSWrap objects is always growing. I also tested that even 10 mins after I stopped all traffic to my service, and Http keepalive should have expired, the TLSWrap objects were still around.

Is there any chances we could get this fixed?

Since f59ec2a, `BaseObject` instances were tracked in heap snapshots through their associated `CleanupHookCallback`s which were stored on the `Environment`; however, this is inaccurate, because: - Edges in heap dumps imply a keeps-alive relationship, but cleanup hooks do not keep the `BaseObject`s that they point to alive. - It loses information about whether `BaseObject` instances are GC roots: Even weak `BaseObject`s are now, practically speaking, showing up as hanging off a GC root when that isn’t actually the case (e.g. in the description of #33468). Thus, this is a partial revert of f59ec2a. Refs: #33468 Refs: #27018 PR-URL: #33809 Reviewed-By: James M Snell <[email protected]> Reviewed-By: Colin Ihrig <[email protected]>

addaleax · 2020-06-19T13:59:52Z

@christhegrand @abartomeu I think it’s a good sign that for @omsmith this has been fixed through e8cc269.

I’ve opened a backport PR for v12.x to expedite getting the fix in: #33962

It would be very helpful if you could build Node.js from source for that PR and confirm that it does solve the issue for you as well.

abartomeu · 2020-06-22T22:57:09Z

Hi @addaleax, we have tried the patched version for 12h and the memory trend seems similar to that in v12.16.3. That alone would justify promoting this fix. I will leave it running for at least one more day and will update it.

abartomeu · 2020-06-24T00:40:18Z

About 30h running and memory continues to be similar to that in v12.16.3. I haven't noticed any other change in any other performance parameters (CPU usage, response time, etc)

@addaleax, could you kindly comment how long would this usually take to get to an official nodejs version?

mortenbo · 2020-06-24T09:15:25Z

I've been debugging a memory leak in our Node.js app after upgrading from Node 10 to Node 14. I was close to giving up when stumbling upon this issue. Downgrading from 14.4.0 to 12.6.0 fixed it for us.

karicem · 2020-06-25T17:06:13Z

We recently upgraded from 12.16 to 12.18 and noticed the memory leak.
Will the fix for the memory leak be included in the next release of Node.js 12.x? When will that be available? We would like to stay on the most current version, rather then rolling back, to ensure we have all security fixes.

addaleax · 2020-06-25T17:41:51Z

We recently upgraded from 12.16 to 12.18 and noticed the memory leak.

@karicem Just to make sure, are you confident that it is this one or is there a chance it’s a separate bug?

Will the fix for the memory leak be included in the next release of Node.js 12.x?

See #33962 – probably yes.

When will that be available?

According to nodejs/Release#494, that would be July 14th. However, as you can see in the PR, we’re discussing potentially releasing this earlier than that.

karicem · 2020-06-27T00:40:05Z

We haven't rebuilt from source with the patch but we have some data that seems to indicate it will. We have one app running with 12.18 that does not use keep-alive and it does not show signs of growing memory usage. When we look at memory usage of other apps running 12.18 that are using keep-alive it keeps increasing, at different rates depending on the app.

BethGriggs · 2020-06-27T12:45:32Z

Opened a v12.18.2 release proposal (#34077) which contains this patch. Aiming for it to be released next Tuesday, 30th June. Here's a release candidate build that you may be able to test with - https://nodejs.org/download/rc/v12.18.2-rc.0/

karicem · 2020-06-27T12:51:42Z

Thanks, we will have a look.

Since f59ec2a, `BaseObject` instances were tracked in heap snapshots through their associated `CleanupHookCallback`s which were stored on the `Environment`; however, this is inaccurate, because: - Edges in heap dumps imply a keeps-alive relationship, but cleanup hooks do not keep the `BaseObject`s that they point to alive. - It loses information about whether `BaseObject` instances are GC roots: Even weak `BaseObject`s are now, practically speaking, showing up as hanging off a GC root when that isn’t actually the case (e.g. in the description of #33468). Thus, this is a partial revert of f59ec2a. Refs: #33468 Refs: #27018 PR-URL: #33809 Reviewed-By: James M Snell <[email protected]> Reviewed-By: Colin Ihrig <[email protected]>

mgilbey · 2020-07-02T13:18:03Z

Confirming that v12.18.2 and v14.5.0 no longer exhibit this behaviour. Thank you for the fix!

I've updated the reproduction repository with new certificates as the original expired if people want to test other versions.

Since f59ec2a, `BaseObject` instances were tracked in heap snapshots through their associated `CleanupHookCallback`s which were stored on the `Environment`; however, this is inaccurate, because: - Edges in heap dumps imply a keeps-alive relationship, but cleanup hooks do not keep the `BaseObject`s that they point to alive. - It loses information about whether `BaseObject` instances are GC roots: Even weak `BaseObject`s are now, practically speaking, showing up as hanging off a GC root when that isn’t actually the case (e.g. in the description of #33468). Thus, this is a partial revert of f59ec2a. Refs: #33468 Refs: #27018 PR-URL: #33809 Reviewed-By: James M Snell <[email protected]> Reviewed-By: Colin Ihrig <[email protected]>

mgilbey changed the title ~~TLSWrap objects grow with HTTP keep-alive enabled until of out memory errors occurs~~ TLSWrap objects grow with HTTP keep-alive enabled until out memory error occurs May 19, 2020

lpinca added http Issues or PRs related to the http subsystem. https Issues or PRs related to the https subsystem. labels May 20, 2020

addaleax mentioned this issue Jun 9, 2020

src: do not track BaseObjects via cleanup hooks #33809

Closed

3 tasks

mgilbey changed the title ~~TLSWrap objects grow with HTTP keep-alive enabled until out memory error occurs~~ TLSWrap objects grow with HTTP keep-alive enabled until out of memory error occurs Jun 15, 2020

addaleax mentioned this issue Jun 19, 2020

[v12.x] src: use symbol to store AsyncWrap resource #33962

Closed

mortenbo mentioned this issue Jun 24, 2020

(Potential) Memory Leak In insert() Call googleapis/nodejs-bigquery#710

Closed

BethGriggs mentioned this issue Jun 26, 2020

v12.18.2 proposal #34077

Merged

mgilbey closed this as completed Jul 7, 2020

cardin mentioned this issue Aug 19, 2020

This could lead to memory leak of Promise/PromiseWrap mafintosh/why-is-node-running#35

Closed

JonasBa mentioned this issue Dec 30, 2022

KeepAlive for @sentry/node transports getsentry/sentry-javascript#2555

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TLSWrap objects grow with HTTP keep-alive enabled until out of memory error occurs #33468

TLSWrap objects grow with HTTP keep-alive enabled until out of memory error occurs #33468

mgilbey commented May 19, 2020 •

edited

Loading

addaleax commented May 19, 2020

mgilbey commented May 19, 2020 •

edited

Loading

omsmith commented Jun 2, 2020

abartomeu commented Jun 15, 2020 •

edited

Loading

christhegrand commented Jun 16, 2020

abartomeu commented Jun 18, 2020

addaleax commented Jun 19, 2020

abartomeu commented Jun 22, 2020

abartomeu commented Jun 24, 2020 •

edited

Loading

mortenbo commented Jun 24, 2020

karicem commented Jun 25, 2020

addaleax commented Jun 25, 2020

karicem commented Jun 27, 2020

BethGriggs commented Jun 27, 2020

karicem commented Jun 27, 2020

mgilbey commented Jul 2, 2020 •

edited

Loading

TLSWrap objects grow with HTTP keep-alive enabled until out of memory error occurs #33468

TLSWrap objects grow with HTTP keep-alive enabled until out of memory error occurs #33468

Comments

mgilbey commented May 19, 2020 • edited Loading

What steps will reproduce the bug?

How often does it reproduce? Is there a required condition?

What is the expected behavior?

What do you see instead?

Additional information

addaleax commented May 19, 2020

mgilbey commented May 19, 2020 • edited Loading

omsmith commented Jun 2, 2020

abartomeu commented Jun 15, 2020 • edited Loading

christhegrand commented Jun 16, 2020

abartomeu commented Jun 18, 2020

addaleax commented Jun 19, 2020

abartomeu commented Jun 22, 2020

abartomeu commented Jun 24, 2020 • edited Loading

mortenbo commented Jun 24, 2020

karicem commented Jun 25, 2020

addaleax commented Jun 25, 2020

karicem commented Jun 27, 2020

BethGriggs commented Jun 27, 2020

karicem commented Jun 27, 2020

mgilbey commented Jul 2, 2020 • edited Loading

mgilbey commented May 19, 2020 •

edited

Loading

mgilbey commented May 19, 2020 •

edited

Loading

abartomeu commented Jun 15, 2020 •

edited

Loading

abartomeu commented Jun 24, 2020 •

edited

Loading

mgilbey commented Jul 2, 2020 •

edited

Loading