http.Agent tcp keepalive not working #41965

JohnTJohnston · 2022-02-14T14:34:08Z

Version

v16.14.0

Platform

Linux ip-192-168-4-97.us-west-2.compute.internal 5.10.96-90.460.amzn2.x86_64 #1 SMP Fri Feb 4 17:12:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Subsystem

http

What steps will reproduce the bug?

Client:

"use strict";

const http = require("http");

(async function main() {

    process.env.SERVER_HOST = process.env.SERVER_HOST || "192.168.4.155";
    process.env.SERVER_PORT = process.env.SERVER_PORT || 3000;

    await issueRequest();

})();

async function issueRequest() {

    const request = http.request({
        host   : process.env.SERVER_HOST,
        port   : process.env.SERVER_PORT,
        path   : "/tcp-keepalive-test",
        method : "GET",
        agent  : new http.Agent({
            keepAlive      : true,
            keepAliveMsecs : 1000 * 10,
        }),
    });

    const start = new Date();

    const result = await new Promise(resolve => {

        request.on("response", response => {

            let responseBody = "";

            response.on("data", data => {
                responseBody += data;
            });

            response.on("end", () => {
                resolve(responseBody);
            });
        });

        request.end();
    });

    const stop = new Date();

    console.log(`start: ${start.toISOString()}, stop: ${stop.toISOString()}, result: ${result}`);
}

Server:

"use strict";

const http = require("http");

(function main() {

    process.env.SERVER_PORT = process.env.SERVER_PORT || 3000;

    const server = http.createServer(async (req, resp) => {
        await new Promise(resolve => setTimeout(resolve, 1000 * 60 * 5));
        resp.statusCode = 200;
        resp.end("success");
    });

    server.listen(process.env.SERVER_PORT);

})();

How often does it reproduce? Is there a required condition?

100% reproducible.

What is the expected behavior?

TCP keepalive packets should be seen in the TCP stream every 10 seconds. This shows all of the packets between client and server and the keepalives are absent:

[ec2-user@ip-192-168-4-97 testv3]$ tshark -r keepalive.pcap ip.addr == 192.168.4.155
 50          9 192.168.4.97 -> 192.168.4.155 TCP 74 57602 > hbci [SYN] Seq=0 Win=62727 Len=0 MSS=8961 SACK_PERM=1 TSval=1629361317 TSecr=0 WS=128
 51          9 192.168.4.155 -> 192.168.4.97 TCP 74 hbci > 57602 [SYN, ACK] Seq=0 Ack=1 Win=26847 Len=0 MSS=8961 SACK_PERM=1 TSval=2679957088 TSecr=1629361317 WS=128
 52          9 192.168.4.97 -> 192.168.4.155 TCP 66 57602 > hbci [ACK] Seq=1 Ack=1 Win=62848 Len=0 TSval=1629361317 TSecr=2679957088
 53          9 192.168.4.97 -> 192.168.4.155 HTTP 152 GET /tcp-keepalive-test HTTP/1.1
 54          9 192.168.4.155 -> 192.168.4.97 TCP 66 hbci > 57602 [ACK] Seq=1 Ack=87 Win=26880 Len=0 TSval=2679957090 TSecr=1629361319
188        309 192.168.4.155 -> 192.168.4.97 HTTP 195 HTTP/1.1 200 OK
191        309 192.168.4.97 -> 192.168.4.155 TCP 66 57602 > hbci [ACK] Seq=87 Ack=130 Win=62720 Len=0 TSval=1629661377 TSecr=2680257146
194        309 192.168.4.97 -> 192.168.4.155 TCP 66 57602 > hbci [FIN, ACK] Seq=87 Ack=130 Win=62720 Len=0 TSval=1629661383 TSecr=2680257146
195        309 192.168.4.155 -> 192.168.4.97 TCP 66 hbci > 57602 [FIN, ACK] Seq=130 Ack=88 Win=26880 Len=0 TSval=2680257153 TSecr=1629661383
196        309 192.168.4.97 -> 192.168.4.155 TCP 66 57602 > hbci [ACK] Seq=88 Ack=131 Win=62720 Len=0 TSval=1629661384 TSecr=2680257153

This shows all captured keepalive packets, and the only keepalives present are associated with the ssh session used to invoke the client:

[ec2-user@ip-192-168-4-97 testv3]$ tshark -r keepalive.pcap tcp.analysis.keep_alive
 60         30 192.168.1.39 -> 192.168.4.97 TCP 66 [TCP Keep-Alive] 42370 > ssh [ACK] Seq=556 Ack=1265 Win=885 Len=0 TSval=170273892 TSecr=3357619098
 66         51 192.168.1.39 -> 192.168.4.97 TCP 66 [TCP Keep-Alive] 42370 > ssh [ACK] Seq=556 Ack=1265 Win=885 Len=0 TSval=170294372 TSecr=3357640683
 72         71 192.168.1.39 -> 192.168.4.97 TCP 66 [TCP Keep-Alive] 42370 > ssh [ACK] Seq=556 Ack=1265 Win=885 Len=0 TSval=170314852 TSecr=3357661163
 78         92 192.168.1.39 -> 192.168.4.97 TCP 66 [TCP Keep-Alive] 42370 > ssh [ACK] Seq=556 Ack=1265 Win=885 Len=0 TSval=170335332 TSecr=3357681643
 86        112 192.168.1.39 -> 192.168.4.97 TCP 66 [TCP Keep-Alive] 42370 > ssh [ACK] Seq=556 Ack=1265 Win=885 Len=0 TSval=170355812 TSecr=3357702123
 91        133 192.168.1.39 -> 192.168.4.97 TCP 66 [TCP Keep-Alive] 42370 > ssh [ACK] Seq=556 Ack=1265 Win=885 Len=0 TSval=170376292 TSecr=3357722603
 99        153 192.168.1.39 -> 192.168.4.97 TCP 66 [TCP Keep-Alive] 42370 > ssh [ACK] Seq=556 Ack=1265 Win=885 Len=0 TSval=170396772 TSecr=3357743083
107        174 192.168.1.39 -> 192.168.4.97 TCP 66 [TCP Keep-Alive] 42370 > ssh [ACK] Seq=556 Ack=1265 Win=885 Len=0 TSval=170417252 TSecr=3357763563
153        194 192.168.1.39 -> 192.168.4.97 TCP 66 [TCP Keep-Alive] 42370 > ssh [ACK] Seq=556 Ack=1265 Win=885 Len=0 TSval=170437732 TSecr=3357784043
157        215 192.168.1.39 -> 192.168.4.97 TCP 66 [TCP Keep-Alive] 42370 > ssh [ACK] Seq=556 Ack=1265 Win=885 Len=0 TSval=170458212 TSecr=3357804523
167        235 192.168.1.39 -> 192.168.4.97 TCP 66 [TCP Keep-Alive] 42370 > ssh [ACK] Seq=556 Ack=1265 Win=885 Len=0 TSval=170478692 TSecr=3357825003
171        256 192.168.1.39 -> 192.168.4.97 TCP 66 [TCP Keep-Alive] 42370 > ssh [ACK] Seq=556 Ack=1265 Win=885 Len=0 TSval=170499172 TSecr=3357845483
178        276 192.168.1.39 -> 192.168.4.97 TCP 66 [TCP Keep-Alive] 42370 > ssh [ACK] Seq=556 Ack=1265 Win=885 Len=0 TSval=170519652 TSecr=3357865963
184        297 192.168.1.39 -> 192.168.4.97 TCP 66 [TCP Keep-Alive] 42370 > ssh [ACK] Seq=556 Ack=1265 Win=885 Len=0 TSval=170540132 TSecr=3357886443

What do you see instead?

No keepalive packets are captured.

Additional information

The motivation for this is to solve a problem we have in AWS where long running transactions complete successfully, yet the result is not returned to the client. Packet captures in this environment show that there are no keepalives.

theanarkh · 2022-06-11T12:39:17Z

Currently, the agent is not setting keepalive on each sockets, it only set keepalive when the socket is free which means there are no request use this socket.

JohnTJohnston · 2022-06-13T20:59:39Z

@theanarkh, thanks for looking. I agree with your code analysis.

I just got back to this late last week.

I have completed some experimentation, so far focusing on Windows out of convenience, with Linux testing to follow.

The following two snippets behave very differently. Again, the problem I am trying to solve is to engage keepalive for long running http requests.

The code immediately below starts TCP keepalive, but only AFTER the first request completes (consistent with theanarkh's comment).

async function issueRequest() {

    const request = http.request({
        host      : process.env.SERVER_HOST,
        port      : process.env.SERVER_PORT,
        path      : "/tcp-keepalive-test",
        method    : "GET",
        keepAlive : true,
        agent     : new http.Agent({
            keepAlive      : true,
            keepAliveMsecs : 1000 * 10,
        }),
    });

    const start = new Date();

    const result = await new Promise(resolve => {

        request.on("response", response => {

            let responseBody = "";

            response.on("data", data => {
                responseBody += data;
            });

            response.on("end", () => {
                resolve(responseBody);
            });

        });

        request.end();
    });

    const stop = new Date();

    console.log(`start: ${start.toISOString()}, stop: ${stop.toISOString()}, result: ${result}`);
}

And the following code behaves as desired, initiating keepalive pings while waiting for the long running request to complete. However, on Windows at least, the initialDelay value passed in to setKeepAlive appears to be ignored, with keepAlive packets always initiating 1 minute after the initial request was sent.

async function issueRequest() {

    const request = http.request({
        host      : process.env.SERVER_HOST,
        port      : process.env.SERVER_PORT,
        path      : "/tcp-keepalive-test",
        method    : "GET",
     });

    const start = new Date();

    const result = await new Promise(resolve => {

        request.on("socket", (sock) => {
            sock.setKeepAlive(true, /* 1000 * 10 */ 1);
        });

        request.on("response", response => {

            let responseBody = "";

            response.on("data", data => {
                responseBody += data;
            });

            response.on("end", () => {
                resolve(responseBody);
            });

        });

        request.end();
    });

    const stop = new Date();

    console.log(`start: ${start.toISOString()}, stop: ${stop.toISOString()}, result: ${result}`);
}

I'm hoping that the Linux implementation honors the initialDelay parameter. This portion of the documentation indicates keepalive won't support a 15 minute request if initialDelay is tied to 1 minute like Windows (https://nodejs.org/api/net.html#socketsetkeepaliveenable-initialdelay):

Enabling the keep-alive functionality will set the following socket options:
SO_KEEPALIVE=1
TCP_KEEPIDLE=initialDelay
TCP_KEEPCNT=10
TCP_KEEPINTVL=1

Next step for me is Linux experimentation. I'm guessing that it is best to set Linux defaults (eg https://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html) as nodejs does not offer socket support for TCP_KEEPCNT or TCP_KEEPINTVL.

JohnTJohnston · 2022-06-15T18:04:51Z

Some more progress. We have deployed v16.14.0 via AWS Elastic Beanstalk, which is the most recent version supported by beanstalk.

Nodejs hard codes the delay in this version as per >https://github.com/nodejs/node/blob/b6b65101873c32655c8d71b4d73363d624f58770/deps/uv/src/unix/stream.c>:

  if (stream->type == UV_TCP) {
    if ((stream->flags & UV_HANDLE_TCP_NODELAY) && uv__tcp_nodelay(fd, 1))
      return UV__ERR(errno);

    /* TODO Use delay the user passed in. */
    if ((stream->flags & UV_HANDLE_TCP_KEEPALIVE) &&
        uv__tcp_keepalive(fd, 1, 60)) {
      return UV__ERR(errno);
    }
  }

So that would explain the hard coded 60 second delay I witnessed on both Linux and Windows. Further the 16.14.0 version of node/deps/uv/src/unix/tcp.c leaves TCP_KEEPINTVL and TCP_KEEPCNT at system defaults. It seems that https://nodejs.org/docs/latest-v16.x/api/net.html#socketsetkeepaliveenable-initialdelay does not document setKeepAlive correctly.

The code I shared works identically for Linux and Windows, and it does support 15 minute http requests because, contrary to the reference I was using, TCP_KEEPCNT is the number of consecutive failures that must accrue before socket teardown and NOT the total number of probes that will be sent.

So I should be good as of now, though still need to deploy to production and verify. That said, there is a bug preventing keepalive from working on the first request when passing keepalive parameters into new http.Agent().

Note: I ran across #38445 which appears to be related.

yunnysunny · 2022-10-04T03:28:34Z

The keepalive parameter of Agent constructor can only guarantee sending TCP keepalive package at fixed interval. This parameter can use use to check whether the server is online. And it works on transport layer.
On the other hand , server use server.keepAliveTimeout to check whether the client is alive. This parameter is diffrent from keepalive of Agent, it work on appliaction layer, in other words, the server will close the socket when client do not send http package after server.keepAliveTimeout milliseconds.
The default value of server.keepAliveTimeout is 5000 ms, so the time between package 188 and 194 would be 5000ms. It will take no effect when you set a low value of keepalive parameter of Agent constructor, node server does not read the keepalive package on transport layer.
You can increase the server.keepAliveTimeout on server side. And set timeout parameter to http.request(options[, callback]), it indicates the idle time on the client side, it should be smaller than server.keepAliveTimeout .

VoltrexKeyva added the http Issues or PRs related to the http subsystem. label Feb 14, 2022

theanarkh mentioned this issue Jun 11, 2022

http: fix http agent keep alive #43380

Merged

4 tasks

Alexander-Ludwig mentioned this issue Mar 18, 2023

Keep Alive option is not working on https.agent #47137

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

http.Agent tcp keepalive not working #41965

http.Agent tcp keepalive not working #41965

JohnTJohnston commented Feb 14, 2022 •

edited by VoltrexKeyva

Loading

theanarkh commented Jun 11, 2022 •

edited

Loading

JohnTJohnston commented Jun 13, 2022

JohnTJohnston commented Jun 15, 2022 •

edited

Loading

yunnysunny commented Oct 4, 2022

http.Agent tcp keepalive not working #41965

http.Agent tcp keepalive not working #41965

Comments

JohnTJohnston commented Feb 14, 2022 • edited by VoltrexKeyva Loading

Version

Platform

Subsystem

What steps will reproduce the bug?

How often does it reproduce? Is there a required condition?

What is the expected behavior?

What do you see instead?

Additional information

theanarkh commented Jun 11, 2022 • edited Loading

JohnTJohnston commented Jun 13, 2022

JohnTJohnston commented Jun 15, 2022 • edited Loading

yunnysunny commented Oct 4, 2022

JohnTJohnston commented Feb 14, 2022 •

edited by VoltrexKeyva

Loading

theanarkh commented Jun 11, 2022 •

edited

Loading

JohnTJohnston commented Jun 15, 2022 •

edited

Loading