Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket timeout - race condition #57

Closed
moscohen opened this issue Feb 19, 2018 · 39 comments
Closed

Socket timeout - race condition #57

moscohen opened this issue Feb 19, 2018 · 39 comments

Comments

@moscohen
Copy link

While using the library in my code, I get sporadic connection reset caused by the use of the agent.
The problem is with the onTimeout() function, let’s say we have a socket which has been idle for 4.99 seconds and the freeSocketTimeout is 5 seconds, the agent is giving me this specific free socket to use on my next request, once the socket is given to the application it does what it needs and the timeout has been fired, but because there is no preemption, the onTimout will get fired once my request is on flight, hence getting destroyed by the agent while my app is using it.

I suggest you’ll add a flag to the socket ‘inUse’ and on a timeout event destroy only when inUse is false.

I’ll try to create a code which reproduces this 100% and attach it here.

Latest version of the module, node 8.9 (happens on previous versions as well).

@tony-gutierrez
Copy link
Contributor

Did you ever get demonstrable code? Is this a confirmed issue or not?

@tony-gutierrez
Copy link
Contributor

tony-gutierrez commented Aug 8, 2018

@moscohen Is this an issue? How can we set up replication or a test?
@fengmk2 ?

@moscohen
Copy link
Author

moscohen commented Aug 8, 2018

@tony-gutierrez haven’t got a code which reproduces this. It happens sporadically on production environment. So hard to add debug there.

@tony-gutierrez
Copy link
Contributor

So I played with the code, trying to generate this race condition, but could not.

Is there any node documentation on the socket "free" event? I do not see it in the docs for net.Socket: https://nodejs.org/api/net.html#net_class_net_socket

@fengmk2
Copy link
Member

fengmk2 commented Oct 20, 2018

See this new impl code, the reuse free socket will reset agent.options.timeout before it reuse.
https://github.com/node-modules/agentkeepalive/pull/69/files#diff-7403da6ce2244c65d641fdfcc095be93R146

@ronag
Copy link

ronag commented Oct 31, 2018

I can confirm that we have an issue that sounds very similar to this. Sporadic connection reset just at the start of requests. Had to stop using agentkeepalive for now. Node 10.11, agentkeepalive 4.0.0.

@admirkadriu
Copy link

@ronag May I ask, what are you using now?

@ronag
Copy link

ronag commented Dec 14, 2018

@admirkadriu we are not using keep-alive anymore. The risk out-weights the gains at the moment.

@loris
Copy link

loris commented Dec 16, 2018

Confirm the issue, Node 11.3.0 & agentkeepalive 4.0.0. It is sporadic, but happens at least once a day since we started using agentkeepalive. Trouble is that we have no way to catch that thrown error and it crashes the whole process. We have to stop using the lib for now :(

@fengmk2
Copy link
Member

fengmk2 commented Dec 17, 2018

@loris can you show me the error message and stack?

@loris
Copy link

loris commented Dec 17, 2018

@fengmk2 Sure, here it is:

Process exited with status 1
State changed from up to crashed
events.js:167
throw er; // Unhandled 'error' event
^

Error: Socket timeout
at TLSSocket.onTimeout (/app/node_modules/agentkeepalive/lib/agent.js:304:23)
at TLSSocket.emit (events.js:182:13)
at TLSSocket.EventEmitter.emit (domain.js:441:20)
at TLSSocket.Socket._onTimeout (net.js:452:8)
at listOnTimeout (timers.js:324:15)
at processTimers (timers.js:268:5)
Emitted 'error' event at:
at ClientRequest.origin.emit.args [as emit] (/app/node_modules/@szmarczak/http-timer/source/index.js:36:11)
at TLSSocket.socketErrorListener (_http_client.js:399:9)
at TLSSocket.emit (events.js:187:15)
at TLSSocket.EventEmitter.emit (domain.js:441:20)
at emitErrorNT (internal/streams/destroy.js:82:8)
at process.internalTickCallback (internal/process/next_tick.js:72:19)

@fengmk2
Copy link
Member

fengmk2 commented Dec 17, 2018

@loris do you handle the error event on using https://github.com/szmarczak/http-timer#usage ?

You need to handle request error event by yourself.

const request = https.get('https://httpbin.org/anything');
const timings = timer(request);

request.on('response', response => {
  response.on('data', () => {}); // Consume the data somehow
  response.on('end', () => {
    console.log(timings);
  });
});

request.on('error', err => {
  // handle request error here
  console.error(err);
});

@loris
Copy link

loris commented Dec 17, 2018

@fengmk2 Actually i'm not using http-timer directly, it is a dependency used by https://github.com/sindresorhus/got. I will check how to improve error event handling, but it is tricky since I was not able to reproduce the error locally.
FYI, agent options I'm using are (not sure if optimal):

const keepAliveOptions = {
  maxSockets: 100,
  maxFreeSockets: 10,
  timeout: 60000,
  freeSocketTimeout: 30000,
};

@palamccc
Copy link

I'm also getting same error, when using agentkeepalive with got.
I'm also not able to reproduce it, it happens in production once in a while.
Node 8.15, Got 9, KeepAliveAgent 4.

@fengmk2 Sure, here it is:

Process exited with status 1
State changed from up to crashed
events.js:167
throw er; // Unhandled 'error' event
^

Error: Socket timeout
at TLSSocket.onTimeout (/app/node_modules/agentkeepalive/lib/agent.js:304:23)
at TLSSocket.emit (events.js:182:13)
at TLSSocket.EventEmitter.emit (domain.js:441:20)
at TLSSocket.Socket._onTimeout (net.js:452:8)
at listOnTimeout (timers.js:324:15)
at processTimers (timers.js:268:5)
Emitted 'error' event at:
at ClientRequest.origin.emit.args [as emit] (/app/node_modules/@szmarczak/http-timer/source/index.js:36:11)
at TLSSocket.socketErrorListener (_http_client.js:399:9)
at TLSSocket.emit (events.js:187:15)
at TLSSocket.EventEmitter.emit (domain.js:441:20)
at emitErrorNT (internal/streams/destroy.js:82:8)
at process.internalTickCallback (internal/process/next_tick.js:72:19)

@gluwer
Copy link

gluwer commented Jan 4, 2019

I think we have the similar issue with Unhandled 'error' event, but it starts to happen only when we go from 8.11.3 to anything 10.x (I've also checked 11.6 with setTimeout fixes and it happens there too).

I can reproduce the problem using our integration test (it does not happen every time, but very regularly). Switching to 8.x silences the error. Because the integration tests are using several emulators and Redis I also cannot easily create small code sample.

I've tried to check what really happens with debug for the lib enabled but I was not able to get anywhere, but I can share the logs if needed.

I was trying to make a fix but without success. I'm able to change the nodejs crash because of unhandled exception, but then there are 'socket hung up' errors.

There is a else part of onTimeout function. There is a check for listenerCount === 1 there. We know from error that situation happens for count=1, but why agent.removeSocket(socket, options); is not called for other cases? If there is other request socket timeout handler I understand it should call socket.destroy(), but likely not removeSocket. Because destroy must be first, I've rearranged the code to below version:

    } else {
      // if there is no any request socket timeout handler,
      // agent need to handle socket timeout itself.
      //
      // custom request socket timeout handle logic must follow these rules:
      //  1. Destroy socket first
      //  2. Must emit socket 'agentRemove' event tell agent remove socket
      //     from freeSockets list immediately.
      //     Otherise you may be get 'socket hang up' error when reuse
      //     free socket and timeout happen in the same time.
      if (!socket.destroyed) socket.destroy();
      agent.removeSocket(socket, options);
      if (listenerCount === 1) {
        const error = new Error('Socket timeout2');
        error.code = 'ERR_SOCKET_TIMEOUT';
        error.timeout = getSocketTimeout(socket);
        // must manually call socket.end() or socket.destroy() to end the connection.
        // https://nodejs.org/dist/latest-v10.x/docs/api/net.html#net_socket_settimeout_timeout_callback
        socket.emit(error);
        debug('%s destroy with timeout error', socket[SOCKET_NAME]);
      }

After this change there is no unhandled errors, but then socket hung up happens for some of the requests.

I've also tried to check what node 10.x change may be the cause and I have found that socket.setTimeout was changed betwen releases slighty but likely not the cause.

I can help testing any other idea anyone have.

@da1nerd
Copy link

da1nerd commented Mar 13, 2019

I used this module on a project years ago but with recent node updates I started getting socket timeout errors. After switching over to the standard http.Agent and https.Agent and manually configuring things. I'm getting the performance I need without any socket timeouts.

const agentOptions = {
    keepAlive: true,
    timeout: 30000
};
const httpAgent = new http.Agent(agentOptions);
const httpsAgent = new https.Agent(agentOptions);

@tony-gutierrez
Copy link
Contributor

If I understand correctly, the problem with going native is that Node fires an event on timeout, but doesn't actually close the socket: https://nodejs.org/api/net.html#net_event_timeout

This is why a lib like agentkeepalive makes a difference in environments like Azure which have really low available ports (~140 per app service), correct?

@gluwer
Copy link

gluwer commented Mar 14, 2019

Original Agent does not have the same behavior as this lib AFAIK. If there is no waiting requests, connection is closed when if keepAlive = true. This lib waits defined time even if there is no new requests, so there is no problem with available ports. Of course all depends on the traffic. If there is always some request waiting to send, both Agents work the same way, maybe except ECONNRESET errors.

@Freyert
Copy link

Freyert commented Mar 18, 2019

I see a few potential issues here:

  1. socket.destroy() happens before we remove socket from the list of available sockets.
  2. It should try to "politely" close the socket with socket.end, wait for X period, then socket.destroy to forcefully free the socket.
    • I imagine this is debatable, but these unclean socket shutdowns definitely cause issues for AWS Classic ELB.
  3. NodeJS does not document the behavior of Agent.removeSocket.

It also seems as if Atomics have existed since Node 8 which could help prevent this race condition.

@ronag
Copy link

ronag commented Mar 31, 2019

Atomics are only relevant for worker threads... not sure how that is relevant here?

@tony-gutierrez
Copy link
Contributor

@gluwer This seems to indicate that the native agent does keep unused sockets around??
https://nodejs.org/api/http.html#http_new_agent_options

@gluwer
Copy link

gluwer commented Apr 18, 2019

@tony-gutierrez Yes. This is also likely the reason that all works fine in most situations, but because of the way free sockets are handled there is socket hang out (other side is not closing the socket normally). Native keep alive agent does not handles that automatically, but the agentkeepalive makes a retry in this case I suppose.

@chengB12
Copy link

I can reproduce this issue 100% of time with following configure:

const options = {
    maxSockets: 100,
    maxFreeSockets: 10,
    timeout: 60000, // active socket keepalive for 60 seconds
    freeSocketTimeout: 30000, // free socket keepalive for 30 seconds
}
const httpAgent = new HttpAgent(options)
const httpsAgent = new HttpsAgent(options)
const client = axios.create({ httpAgent, httpsAgent })
await client.get('..')
await sleep(5000) // wait for exact 5 seconds, sleep is our wrapper for setTimeout as promise
await client.get('..')

change sleep time to 4 second or 6 second, it went without issue, however, at 5 seconds, it is 100% run into connection reset error

version 4.1.3
node version 12.18.3

@chengB12
Copy link

chengB12 commented Nov 2, 2020

I can reproduce this issue 100% of time with following configure:

const options = {
    maxSockets: 100,
    maxFreeSockets: 10,
    timeout: 60000, // active socket keepalive for 60 seconds
    freeSocketTimeout: 30000, // free socket keepalive for 30 seconds
}
const httpAgent = new HttpAgent(options)
const httpsAgent = new HttpsAgent(options)
const client = axios.create({ httpAgent, httpsAgent })
await client.get('..')
await sleep(5000) // wait for exact 5 seconds, sleep is our wrapper for setTimeout as promise
await client.get('..')

change sleep time to 4 second or 6 second, it went without issue, however, at 5 seconds, it is 100% run into connection reset error

version 4.1.3
node version 12.18.3

Update: I tried same code using node's http.Agent, instead of use agentkeepalive wrapper, it has exactly same issue, and I tried both node 12 and node 14, same

so, this bug, as least which I observed, is underline bug of node.js

@tony-gutierrez
Copy link
Contributor

tony-gutierrez commented Nov 2, 2020

https://nodejs.org/api/http.html#http_request_reusedsocket

Is your server a node server using the default timeout?

@cschockaert
Copy link

cschockaert commented Nov 12, 2020

Hello,

we frequently have this error in our production env, making our node process crash and exit

events.js:174
      throw er; // Unhandled 'error' event
      ^

Error: Socket timeout
    at Socket.onTimeout (/snapshot/lark-crest/node_modules/agentkeepalive/lib/agent.js:345:23)
    at Socket.emit (events.js:198:13)
    at Socket._onTimeout (net.js:443:8)
    at ontimeout (timers.js:436:11)
    at tryOnTimeout (timers.js:300:5)
    at listOnTimeout (timers.js:263:5)
    at Timer.processTimers (timers.js:223:10)
Emitted 'error' event at:
    at Socket.socketErrorListener (_http_client.js:392:9)
    at Socket.emit (events.js:203:15)
    at emitErrorNT (internal/streams/destroy.js:91:8)
    at process._tickCallback (internal/process/next_tick.js:63:19)

with theses settings:

  keepAlive: true
  timeout: 240000
  freeSocketTimeout: 120000
  socketActiveTTL: 300000

node: 12.18.4
agentkeepalive: 4.1.3

We cannot use default http agent of nodejs because we need the socketActiveTTL feature to force recycle of our persistant connections. Note that we default http agent, we do not have theses crashs

do someone have a workaround for that bug? are you missing an error listener somewhere?
i cant really figure what is happing, socket seems to timeout and still in use if we look inside agent.js:345:23 ... but why ?
image

i saw that this part of code has changed since that issue is open...

should not listener count tested against default listener count ?

// node <= 10
let defaultTimeoutListenerCount = 1;
const majorVersion = parseInt(process.version.split('.', 1)[0].substring(1));
if (majorVersion >= 11 && majorVersion <= 12) {
  defaultTimeoutListenerCount = 2;
} else if (majorVersion >= 13) {
  defaultTimeoutListenerCount = 3;
}

is it still linked to that issue?

@fengmk2

Thanks

@cschockaert
Copy link

cschockaert commented Nov 13, 2020

Hey, got more logs (activated debug mode):

2020-11-13T04:38:25.974Z agentkeepalive sock[7709#product-renderer:80:](requests: 124, finished: 123) timeout after 240000ms, listeners 1, defaultTimeoutListenerCount 1, hasHttpRequest true, HttpRequest timeoutListenerCount 0
2020-11-13T04:38:25.974Z agentkeepalive timeout listeners: onTimeout
2020-11-13T04:38:25.974Z agentkeepalive sock[7709#product-renderer:80:] destroy with timeout error
2020-11-13T04:38:25.974Z agentkeepalive sock[7709#product-renderer:80:](requests: 124, finished: 123) error: Error: Socket timeout, listenerCount: 2
events.js:174
      throw er; // Unhandled 'error' event
      ^

Error: Socket timeout
    at Socket.onTimeout (/snapshot/lark-crest/node_modules/agentkeepalive/lib/agent.js:345:23)
    at Socket.emit (events.js:198:13)
    at Socket._onTimeout (net.js:443:8)
    at ontimeout (timers.js:436:11)
    at tryOnTimeout (timers.js:300:5)
    at listOnTimeout (timers.js:263:5)
    at Timer.processTimers (timers.js:223:10)
Emitted 'error' event at:
    at Socket.socketErrorListener (_http_client.js:392:9)
    at Socket.emit (events.js:203:15)
    at emitErrorNT (internal/streams/destroy.js:91:8)
    at process._tickCallback (internal/process/next_tick.js:63:19)

Seems that a request was inflight when socket timeout cas called

@ronag
Copy link

ronag commented Nov 13, 2020

For those of you still struggling with this I can recommend trying https://www.npmjs.com/package/undici which does not suffer from the same problem.

@cschockaert
Copy link

cschockaert commented Nov 13, 2020

Again, more log for this socket:

image

Seems that bug appear when onClose is called

@cschockaert
Copy link

For those of you still struggling with this I can recommend trying https://www.npmjs.com/package/undici which does not suffer from the same problem.

thanks, but we cant refacto all our code to use another http client, + we are using this keepaliveagent with http-proxy and got client

@ronag
Copy link

ronag commented Nov 13, 2020

@cschockaert: This has been open for almost 3 years now. I don't think it's likely to get fixed.

@cschockaert
Copy link

cschockaert commented Nov 13, 2020

well i try it, seeing that there is still some activity :)

@SimenB ;)

@cschockaert
Copy link

cschockaert commented Nov 13, 2020

Wooch, was in node 10, and not node 12... retrying with node 12.19.0 :)

@cschockaert
Copy link

Update: our pod is running since 17 hours, and still no errors / crash at all. Seems that bumping node from 10.X to 12.19.0 resolved my issue

@SimenB
Copy link
Contributor

SimenB commented Nov 14, 2020

well i try it, seeing that there is still some activity :)

@SimenB ;)

Not sure why I'm tagged? 😅

@harrylepotter-win
Copy link

still appear to be having this in node 16. My situation is somewhat unique though. I'm creating a connection over a unix socket rather than a url

@jsongConvoy
Copy link

I can reproduce this issue 100% of time with following configure:

const options = {
    maxSockets: 100,
    maxFreeSockets: 10,
    timeout: 60000, // active socket keepalive for 60 seconds
    freeSocketTimeout: 30000, // free socket keepalive for 30 seconds
}
const httpAgent = new HttpAgent(options)
const httpsAgent = new HttpsAgent(options)
const client = axios.create({ httpAgent, httpsAgent })
await client.get('..')
await sleep(5000) // wait for exact 5 seconds, sleep is our wrapper for setTimeout as promise
await client.get('..')

change sleep time to 4 second or 6 second, it went without issue, however, at 5 seconds, it is 100% run into connection reset error
version 4.1.3
node version 12.18.3

Update: I tried same code using node's http.Agent, instead of use agentkeepalive wrapper, it has exactly same issue, and I tried both node 12 and node 14, same

so, this bug, as least which I observed, is underline bug of node.js

For your situation, wrt 5 seconds, is it possible it's due to the same issue described in this post? https://medium.com/ssense-tech/reduce-networking-errors-in-nodejs-23b4eb9f2d83. where the default nodejs's keepAliveTimeout is 5 seconds.

fengmk2 added a commit that referenced this issue Dec 31, 2021
The default server-side timeout is 5000 milliseconds, to avoid
ECONNRESET exceptions, we set the default value to `4000` milliseconds.

See 'race condition' detail on #57
fengmk2 added a commit that referenced this issue Dec 31, 2021
The default server-side timeout is 5000 milliseconds, to avoid
ECONNRESET exceptions, we set the default value to `4000` milliseconds.

See 'race condition' detail on #57
@fengmk2
Copy link
Member

fengmk2 commented Dec 31, 2021

Fixed no [email protected]

@fengmk2 fengmk2 closed this as completed Dec 31, 2021
@fengmk2
Copy link
Member

fengmk2 commented Dec 31, 2021

Thanks all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests