Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idle Connections are not being cleaned up under certain conditions. #3020

Closed
robert-pitt-foodhub opened this issue Sep 6, 2024 · 7 comments · Fixed by #3022
Closed

Idle Connections are not being cleaned up under certain conditions. #3020

robert-pitt-foodhub opened this issue Sep 6, 2024 · 7 comments · Fixed by #3022
Labels

Comments

@robert-pitt-foodhub
Copy link
Contributor

robert-pitt-foodhub commented Sep 6, 2024

Setup

  • Environment: AWS Lambda (x86)
  • mysql2 version: v3.110 (latest as of post)
  • MySQL Versions Tested: 8 & 9

Description:

Hey, I have been noticing some unusual connection errors, and have been tracing it back to the connection cleanup process within the Pool class, below I have outlined my setup and the steps I took to find the issue and how to replicate the problem using an existing test case.

We are getting a fairly sizeable percentage of requests that are failing due to socket timeout or disconnection issues, which typically show as the PROTOCOL_CONNECTION_LOST, as shown below:

Error: Connection lost: The server closed the connection.
    at PromisePoolConnection.query (/opt/nodejs/node_modules/mysql2/promise.js:93:22)
    at /var/task/index.js:1435:25
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async Runtime.handler (/var/task/index.js:2659:15) {
  code: 'PROTOCOL_CONNECTION_LOST',
  errno: undefined,
  sql: undefined,
  sqlState: undefined,
  sqlMessage: undefined
}

The configuration we had setup was {"maxIdle": 3, connectionLimit: 3, idleTimeout: 30000 }, and after digging around I found the following line of code, which implies that the maxIdle MUST be less than the connectionLimit, so in this configuration the connection cleanup process is not enabled, meaning once a connection is opened, it will never be removed.

if (this.config.maxIdle < this.config.connectionLimit) {

So I then with excitement updated one of our production lambdas to take on the config {"maxIdle": 1, connectionLimit: 3, idleTimeout: 30000 } in the hope that activating the connection cleanup process would resolve the connection errors we were seeing, I went to bed with the hope of waking up to no red blobs on our dashboard, however to my surprise the connection errors continued to happen, following a similar patterns to the previous nights, as if it made no effect.

I looked at some of the logs and I noticed that the Timer object / reference was on the pool properties since I dropped the maxIdle to a lower number than the connectionLimit

Before:
image

After:
image

So I was certain the timer was now running, however I was still having the same connection errors 😠 ..., I added some more logs to track how long each connections thread id was lasting and what the time delays were between queries

So at this point I knew there was a possibility the the bug is within the node-mysql2 library, so I cloned the repository and setup a dockerised mysql sandbox to be able to execute the integration test suite, which thankfully was a piece of cake to get up and running, I ran the full suite to validate that everything is expected, and then I took at a look the test/integration/test-pool-release-idle-connection.test.cjs test file as that was closes to what I was doing.

https://github.com/sidorares/node-mysql2/blob/dee0c0813854658baf5f73055e5de36a1a78e7bf/test/integration/test-pool-release-idle-connection.test.cjs

I setup my MYSQL, ensured that there were no connections from anything else and executed the test file as is, without making any changes

MySQL before the tests:

Screenshot 2024-09-06 at 21 22 44

Executed the tests via the following command:

yarn poku ./test/integration/test-pool-release-idle-connection.test.cjs

The test PASSES as expected 🎉

Screenshot 2024-09-06 at 21 24 49

Within the MYSQL console, I was refreshing the process list every second and I can see 3 connections created, as expected, and I see those same three connections drop off after around 7-8 seconds, which again Is what I expect given the setTimeout within this test scenario.

After the test was completed, mysql was back down to the initial two base connections:
Screenshot 2024-09-06 at 21 26 15


Ok, here's where it seems to go wrong, which I think could be the cause for connection socket timeouts that I am experiencing, when I change the configuration to {connectionLimit: 3, maxIdle: 2}, the test just hangs, it never finishes (even after 10/15+ minutes).

Screenshot 2024-09-06 at 21 33 35

When checking MySQL it seems as though the there are connections outstanding that have not been closed

Screenshot 2024-09-06 at 21 32 18

Im pretty certain that the cause of this bug is somewhere near this code, however I don't have a huge amount of experience with this library so looking for some support here on replicating the bug to offering a solution that wont cause any regressions.

Thanks

References:

@robert-pitt-foodhub
Copy link
Contributor Author

robert-pitt-foodhub commented Sep 7, 2024

Hey, I just wanted to follow up with some more findings that I have found, as I really want to get this bug pinned down if possible, I was thinking last night that the reason why the test might be hanging could be due to the assertion within the setTimeout are firing because I changed the connectionLimit and maxIdle, so I commented those out this morning and run the test again and they passed, I tried with a few configurations and they passed, but I still think that there are connections that are not being destroyed in certain conditions.

Ive just pulled some logs from one of our lambdas that are having the issue, and wanted to outline a scenario where this occurs:

Setup

  • Environment: AWS Lambda (x86)
  • mysql2 version: v3.110 (latest as of post)
  • MySQL Version: 8
  • connectionLimit = 5
  • maxIdle = 1
  • idleTimeout = 5000

Timeline

I have taken a screenshot of my logs, showing the logs for a single failure scenario within a single lambda instance, the screenshot shows the timeline of 5 requests over time against the same pool instance, for each request I am logging the delta for the last time the connection was used (not since it was created), the connection.threadId

Screenshot 2024-09-07 at 12 53 58

Here's Each Request separated (for better readability):

Request 1

Screenshot 2024-09-07 at 13 03 14

Request 2

Screenshot 2024-09-07 at 13 03 32

Request 3

Screenshot 2024-09-07 at 13 03 46

Request 4

Screenshot 2024-09-07 at 13 04 02

Request 5

Screenshot 2024-09-07 at 13 04 14

Observations

Given that I have set a 5000ms idleTimeout, it should not be possible for a connection to be returned to the user land code if the connection has surpassed the idleTimeout, however, within this configuration of maxIdle = 1, I think that the logic within the connection cleanup process is not cleaning up connections below the count of maxIdle.

You can observe the last activity of the connection is was used was over 344 seconds, which is backed up by the lambda timings column on the left that also shows the large difference.

So far, I am looking at this function in Pool.js as being the area that needs adjusting, however I am not sure if using maxIdle and connectionLimit is enough to do this, or wether we need to introduce maybe a new config such as maxConnectionIdleTimeout..

Screenshot 2024-09-07 at 13 23 05

Here is my current suggestion:

  _removeIdleTimeoutConnections() {
    if (this._removeIdleTimeoutConnectionsTimer) {
      clearTimeout(this._removeIdleTimeoutConnectionsTimer);
    }

    this._removeIdleTimeoutConnectionsTimer = setTimeout(() => {
      try {
        const now = Date.now();
        while (this._freeConnections.length > 0) {
          const connection = this._freeConnections.get(0);
          const idleTime = now - connection.lastActiveTime;

          if (
            this._freeConnections.length > this.config.maxIdle ||
            idleTime > this.config.idleTimeout
          ) {
            connection.destroy();
          } else {
            // If we reach a connection that shouldn't be removed, we can stop checking
            break;
          }
        }
      } finally {
        this._removeIdleTimeoutConnections();
      }
    }, 1000);
  }

This approach attempts to loop over every record, first checking the idle timeout, if the connection has not timed out it will check if the current length of connection is above the threshold, if either case is true the connection is removed.

Thoughts

@robert-pitt-foodhub
Copy link
Contributor Author

Hey @wellwelwel, Im just wondering what the next steps are in moving from PR to Release, if did attempt to deploy from my github clone via npm, but during our CI it failed saying that the lur-cache module was not found, so I am assuming that there may be a build step your side to having a working package.

If your able to provide me with a npm installable package I will be happy to deploy to our staging infra across our 200+ lambdas that use the mysql2 library, we can update all the lambdas relatively easily and monitor for any observations.

Thanks

@wellwelwel
Copy link
Collaborator

Im just wondering what the next steps are in moving from PR to Release

I'd like to take a closer look before merging (if @sidorares doesn't do it first).

during our CI it failed saying that the lur-cache module was not found, so I am assuming that there may be a build step your side to having a working package.

To better understand the lru-cache conflicts, see #2988 and its tracks 🙋🏻‍♂️

@robert-pitt-foodhub
Copy link
Contributor Author

Thanks @wellwelwel, yea please take a closer look, and in regards to the #2988, I have created a branch with this change and the connection cleanup change, currently deploying to our SIT environment where I can leave it running for a few hours to see see if there is any regressions.

Ill update here soon

@robert-pitt-foodhub
Copy link
Contributor Author

robert-pitt-foodhub commented Sep 8, 2024

My updates so far:

I am using AWS CDK to manage my infrastructure, so by me making tis version change and deploying I have redeployed hundreds of lambda functions that now use the latest code, the latest code being this issue and the patch I applied from #2988.

A full deployment like this took close to 2 hours to fully rollout, below are my observations.

  1. The deployment started around 19:35 UTC, where the CloudFormation deployment process would update the lambda functions of over 40+ stacks/services, during this window CloudFormation will deploy a new instances/versions of lambda functions running with the new code, if CloudFormation sees all changes are deployed, it will then shut down the previous deployment of lambdas.

During this time, I would expected the number of database connections to increase as more lambda instances are running during the deployment, and the number of connections should reduce back to the running average.

Connections During Deployment:
Screenshot 2024-09-08 at 22 24 57

Total Connections:
Screenshot 2024-09-08 at 22 49 11

The above screenshot shows the pattern I expected, and also shows that the number of connections reduced back to the expected amount.

I have also attached the screenshot of the Aborted Client metric, showing that during the deployment window the number of aborted clients jumped around as connections were bing made from new lambdas, and connections being aborted my lambdas being terminated

Aborted Clients during deployment
Screenshot 2024-09-08 at 22 34 03

So far I am not observing any issues, I selected the sum of Errors for 500 lambdas over the past 12 hours, and I see no difference in the volume of errors before vs after the release

Screenshot 2024-09-08 at 22 44 51

I am still monitoring, as I also have a metric for the Socket Timeout error but as it's intermittent I will collect some data for several hours and see if we are still experiencing the errors.

Will post back soon

@robert-pitt-foodhub
Copy link
Contributor Author

Hey @wellwelwel, so as far as any regressions are concerns, theres nothing on our side statistically that shows any increase in error or significant change in the number of connections for our RDS, which tells me that the new deployed code is working as expected.

Here's a screenshot of the Errors that we were tracking for the Socket Timeout error, and as you can see after I released the patch to our staging environment around8:35pm, the Socket Timeout error seems to have stopped

Screenshot 2024-09-09 at 09 48 10

@robert-pitt-foodhub
Copy link
Contributor Author

@wellwelwel I think we still to resolve the issue where the connection cleanup process doesn't start to tick unless the maxIdle is lower then the connectionLimit, my suggestion is that we just remove the if condition and start the timeout ticks regardless of the value, so we are able to update all connections based my logix change in the _removeIdleTimeoutConnections function.

if (this.config.maxIdle < this.config.connectionLimit) {

Thoughts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants