Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"timeout exceeded when trying to connect" spike after upgrading to 8.2.1 #2262

Open
mriedem opened this issue Jul 2, 2020 · 43 comments
Open

Comments

@mriedem
Copy link
Contributor

mriedem commented Jul 2, 2020

Yesterday we upgraded from [email protected] (with [email protected]) to [email protected] (with [email protected]), specifically from our package-lock.json:

    "pg": {
      "version": "8.2.1",
      "resolved": "https://registry.npmjs.org/pg/-/pg-8.2.1.tgz",
      "integrity": "sha512-DKzffhpkWRr9jx7vKxA+ur79KG+SKw+PdjMb1IRhMiKI9zqYUGczwFprqy+5Veh/DCcFs1Y6V8lRLN5I1DlleQ==",
      "requires": {
        "buffer-writer": "2.0.0",
        "packet-reader": "1.0.0",
        "pg-connection-string": "^2.2.3",
        "pg-pool": "^3.2.1",
        "pg-protocol": "^1.2.4",
        "pg-types": "^2.1.0",
        "pgpass": "1.x",
        "semver": "4.3.2"
      },
      "dependencies": {
        "pg-connection-string": {
          "version": "2.2.3",
          "resolved": "https://registry.npmjs.org/pg-connection-string/-/pg-connection-string-2.2.3.tgz",
          "integrity": "sha512-I/KCSQGmOrZx6sMHXkOs2MjddrYcqpza3Dtsy0AjIgBr/bZiPJRK9WhABXN1Uy1UDazRbi9gZEzO2sAhL5EqiQ=="
        },
        "semver": {
          "version": "4.3.2",
          "resolved": "https://registry.npmjs.org/semver/-/semver-4.3.2.tgz",
          "integrity": "sha1-x6BxWKgL7dBSNVt3DYLWZA+AO+c="
        }
      }
    }

We started seeing a spike in timeout exceeded when trying to connect errors with this stacktrace:

Error: timeout exceeded when trying to connect
    at Timeout._onTimeout (/usr/src/app/node_modules/pg-pool/index.js:188:27)
    at listOnTimeout (internal/timers.js:549:17)
    at processTimers (internal/timers.js:492:7)

This is a pretty basic express app with a postgres 12 backend running on node 12.

We report metrics on the connection pool max/total/idle/waiting count values and there is an obvious spike in the wait count from the time the 8.2.1 upgrade was deployed (around 9am CT yesterday) and then the drop when we reverted that change (about 6am CT today):

image

That corresponds with our API request/response/error rates (again, just a simple express app over a pg db):

image

We're not sure how to debug this. These are the relevant values we're using related to the Pool config:

  • connectionTimeoutMillis = 60000
  • idleTimeoutMillis = 60000
  • max = 150

We have a staging environment where this showed up as well but we didn't have an alert setup for it (we do now). So if there is something we can do to help debug this and provide information back we can probably do that in our staging environment.

@jsatsvat
Copy link

jsatsvat commented Jul 8, 2020

We're having the same issue - causing us some headache. Just looking at 8.2.2 to see if that fixed it.

@mriedem
Copy link
Contributor Author

mriedem commented Jul 8, 2020

Looking at the changes in 8.2.2 I'm not sure those are going to resolve my issue since I'm not using bytea type columns:

https://github.com/brianc/node-postgres/compare/[email protected]@8.2.2

@charmander
Copy link
Collaborator

charmander commented Jul 8, 2020

@mriedem The fix applies to any long message.

@brianc
Copy link
Owner

brianc commented Jul 8, 2020

hmm sorry this is happening - I'm assuming this isn't happening all the time or in a reproducible way, just sometimes at random? Those issues are hard to debug. I'm gonna check my app logs & see if I can see any of this - I usually run the latest on our systems and they're under pretty heavy load all day.

@mriedem
Copy link
Contributor Author

mriedem commented Jul 9, 2020

I'm assuming this isn't happening all the time or in a reproducible way, just sometimes at random?

It's definitely reproducible / persistent to the point that we had to revert the package upgrade in our app because of the regression. I mean, if we had some debug patch or something applied in our staging environment I'm sure we'd hit it to get more data if you have something in mind. I can't say there is a specific query that fails 100% of the time though.

I'm not sure if it helps, but there were quite a few of the errors happening on queries to fetch data from tables with json columns, so I'm not sure if that would be impacted by #2240? But we had other failures on queries that didn't involve json columns, e.g. a simple count query: SELECT COUNT(job_id) FROM jobs WHERE queue_name = $1 AND execution_time IS NULL;, though those might have been on the same pod (db connection pool) so if the problem causes an issue for all connections in the same pool then that might explain the issue in other basic queries.

@mriedem
Copy link
Contributor Author

mriedem commented Jul 9, 2020

@jsatsvat did moving to 8.2.2 resolve the issue for you?

@jsatsvat
Copy link

jsatsvat commented Jul 9, 2020

did moving to 8.2.2 resolve the issue for you?

No, it didn't unfortunately.

there were quite a few of the errors happening on queries to fetch data from tables with json columns, so I'm not sure if that would be impacted by #2240

FYI - We don't have any JSON columns as far as I am aware.

@brianc
Copy link
Owner

brianc commented Jul 9, 2020

It's definitely reproducible / persistent to the point that we had to revert the package upgrade in our app because of the regression.

yeah I mean is there a way this can be reproduced in an isolated script so I could run it locally and analyze what's going on?

@brianc
Copy link
Owner

brianc commented Jul 9, 2020

Are y'all using an ORM (knex, typeorm, etc) or using pg directly?

@mriedem
Copy link
Contributor Author

mriedem commented Jul 9, 2020

Are y'all using an ORM (knex, typeorm, etc) or using pg directly?

We're not using an ORM, just pg directly.

@brianc
Copy link
Owner

brianc commented Jul 9, 2020

okay - and I'm guessing it happens some what "at random" meaning your app has been running for a while & then you get a timeout? Once you get a timeout do you get a bunch or is it just one every once and a while?

@jsatsvat
Copy link

jsatsvat commented Jul 9, 2020

We're using slonik.

And yes, at least for us, it happens randomly out of a sudden with a bunch of timeouts at once.

@brianc
Copy link
Owner

brianc commented Jul 9, 2020

dang - do you have numbers around how long it takes before it happens? What's the recovery you do? restart the app?

@jsatsvat
Copy link

jsatsvat commented Jul 9, 2020

For now we've drastically increased the pool size (previously set to 100) which temporarily takes some pressure off it; but as a quick recovery measure we do indeed restart the affected service.

As it seems to happen randomly, I am not sure as to how helpful numbers will be as we can't really determine a certain interval or similar.

@mriedem
Copy link
Contributor Author

mriedem commented Jul 9, 2020

okay - and I'm guessing it happens some what "at random" meaning your app has been running for a while & then you get a timeout? Once you get a timeout do you get a bunch or is it just one every once and a while?

I would have to dig into this. In our production cluster we have the app running in 20 replicas and each has a pool configured for 150 connections. Our readiness probe is set to hit an API which does a select now(); query to make sure that pod's connection pool is OK because if it's full then we want that pod to go out of rotation for traffic until it can drain its connection requests. The pod will only crash and restart automatically if an uncaught error slips through.

I think I can say when it hits we get a bunch of timeouts which would probably explain why the waiting count per pod (in the graph in the issue description) spikes, because presumably something is blocking in the pool and so other requests are waiting until a timeout occurs.

@brianc
Copy link
Owner

brianc commented Jul 9, 2020

K thanks for the info this is helpful. One other question...are you all using SSL for the connection? I'm wondering if there's a weird edge case in the startup / request ssl packet handling.

@brianc
Copy link
Owner

brianc commented Jul 9, 2020

Sorry...more questions. I'm trying to repro locally...no luck yet. What cloud are y'all running in? I wonder if it's networking related to k8s or the like.

@mriedem
Copy link
Contributor Author

mriedem commented Jul 10, 2020

Yeah we're setting ssl.ca in the options that get passed to create the pool.

We're in the IBM cloud using the K8S service there to run the app and the IBM Cloud Databases for Postgresql for the PG backend.

@brianc
Copy link
Owner

brianc commented Jul 10, 2020

Alright...I left a script run all night that was connecting like this: local-script -> glcoud sql proxy -> cloud sql instance in gcloud. It has a 10 connection pool, and every 1-2 seconds it checks out a client, runs a query, and returns it. The pool is set to idleTimeoutMillis of 100 - so basically every time the script waits for 1-2 seconds the pool is cleaned out & there's a reconnection attempt. If it ever takes more than 3 seconds for the client to connect, query, and be returned to the pool the script exits. It ran all night without problems....so maybe thinking the connection logic has a problem is a red herring? Looking at the code the timeout exceeded when trying to connect is actually a bad error message - that error happens when the pool is saturated & cannot give you an existing client in time. I'm going to change that error message to be more descriptive. You've logged out the number of waiting on your pool and it skyrockets. That leads me to believe one of three things is going on...

  1. somewhere your code is checking out a client and not checking it back in. Eventually this starves the pool of available clients, waiting count skyrockets, and yeah...bad things. Doing nothing but upgrading from [email protected] to [email protected] is unlikely to have caused this since theoretically your app code didn't change & all errors are handled the same...but worth considering. The best way to diagnose this would be for me to put a pool.emit('release', client) call into the code whenever a client is released back to the pool. Not having this event is just an oversight on my part. The pool already emits a pool.on('acquire', (client) => {}) event. With a matching release event you could do something to find orphaned clients like this:
pool.on('acquire', (client) => {
  client.orphanTimeoutId = setTimeout(() => {
    console.error('client was not returned within 5 seconds', client)
    process.exit(-1)  // do whatever you need to do here to trip alarms
  }, 5000)
})

pool.on('release', (client) => {
  clearTimeout(client.orphanTimeoutId)
})
  1. Something changed between 8.0 and 8.2 which makes a client enter some kind of exceptionally bad run time (we fixed one of those things with fix: major performance issues with bytea performance #2240 #2241, there might be more?) The best way to debug this would be to log all query execution times and use statement_timeout in your connection options and set it to something reasonably low like 3 seconds (or 2 * what your longest query every takes) and then monitor the logs for errors due to queries timing out. That could then isolate it down to a particular query at which point given that query & estimated data volumes returned from it I could likely build a reproducible fix on my end.

  2. Something changed between 8.0 and 8.2 which makes a client get "stuck" thinking it's executing a query when it's not. This isn't likely...but would be worth investigating. The easiest way to do this most likely would be hopefully you have a single file where all queries "funnel through" in your app, otherwise you'll need to monkeypatch (Client.prototype.query) and wrap it with a timing function. Time the execution and log out to the console (or crash the app entirely if its in staging or something) if the duration between calling client.query and the query finishing exceeded some threshold. This is different than statement_timeout in that maybe all the queries finish fast but for some reason something gets "stuck" in the parser or client and it never reads the readyForQuery message properly.

Sorry this is hard to track down....I'll go ahead and add the pool.on('release') event and change the error message when the pool is starved for available clients to not say "timeout when trying to connect" because it's actually not trying to connect, it's starved.

What do you think?

@mriedem
Copy link
Contributor Author

mriedem commented Jul 10, 2020

Thanks for the detailed feedback and ideas of things to try. We can try (1) when that's released. Doing (2) would also be pretty easy but I'd need to tinker a bit with figuring out our slowest query times to set a proper timeout. Thankfully we do wrap query with a timer function to log query execution times so we should have that information in our logs. And because we have that wrapper function on query we should be able to try (3) as well.

I don't think I'll probably be able to get to this today but we should be able to try some of this early next week. Thanks again.

@PsyTae
Copy link

PsyTae commented Jul 14, 2020

Looks like we are experiencing this same issue. We are using a pool and pool.query() for all our db queries, so if you would like additional info we can provide, we can try anything you suggest as well.

@PsyTae
Copy link

PsyTae commented Jul 14, 2020

Just updated our code to 8.3.0, in the meantime setting our max connections to 150 from 50. We are using amazon aws.
Over the last 2 hours, we have noticed that the connections are steadily rising and not releasing back to the pool for the next query to use appropriately.
RDSConnections

@mriedem
Copy link
Contributor Author

mriedem commented Jul 15, 2020

@brianc re your earlier comment on things to try for (3) we have a wrapper on Pool.query to log how long the query function took, i.e. this is a log statement from a query in our staging environment that took longer than 45 seconds for whatever reason:

Jul 1 12:01:05 api-queue-764c7789b9-7hlpk api-queue 20 {"name":"pg-connect","hostname":"api-queue-764c7789b9-7hlpk","pid":20,"level":20,"time_taken_ms":48866,"args":["SELECT * FROM tasks\n WHERE timestamp_dequeued IS NULL AND queue_name = $1\n ORDER BY timestamp_queued LIMIT 1 FOR UPDATE SKIP LOCKED;",["status-devices-staging"]],"msg":"","time":"2020-07-01T17:01:05.837Z","v":0}

That was from July 1 when we upgraded to [email protected] and excludes any logs from timeout exceeded when trying to connect failures.

I'm not sure about (2) though. Sorry if this is a dumb question but I'm not sure how to do this:

The best way to debug this would be to log all query execution times

I was thinking maybe there would be a property on the pg.Result object to record how long a query took but I don't see anything like that. Am I missing something obvious?

@PsyTae
Copy link

PsyTae commented Jul 15, 2020

on my connection file

const mysql = require('mysql'); // mysql
const pg = require('pg'); // postgres
const log = false;
//--- My SQL Connection -----------------------------------------------------------------------------------------------------------------

const mysqlConfig = {
    connectionLimit: 10,
    waitForConnections: true,
    host: process.env.MYSQL_HOST,
    port: process.env.MYSQL_PORT,
    user: process.env.MYSQL_USER,
    password: process.env.MYSQL_PASS,
    database: process.env.MYSQL_DATABASE
};

const postgisConfig = {
    host: process.env.POSTGIS_HOST,
    user: process.env.POSTGIS_USER,
    password: process.env.POSTGIS_PASS,
    database: process.env.POSTGIS_DATABASE,
    port: process.env.POSTGIS_PORT,
    max: 300,
    idleTimeoutMillis: 30 * 1000,
    connectionTimeoutMillis: 2 * 1000
};

const mysqlConn = mysql.createPool(mysqlConfig);
const postGisConn = new pg.Pool(postgisConfig);

postGisConn.myQuery = (text, values, callback) => {
    if (typeof values === 'function') {
        callback = values;
        values = undefined;
    }
    callback = typeof callback === 'function' ? callback : () => {};
    const start = Date.now();
    return postGisConn.query(text, values, (err, res) => {
        const duration = Date.now() - start;
        if (err) return callback(err);
        if (!err && (log || duration > (1 * 1000))) console.log('executed query', { text, duration, rows: res.rowCount });
        callback(err, res);
    });
};

mysqlConn.on('error', err => {
    console.error('Mysql Error:', err);
});

postGisConn.on('error', (err, client) => {
    console.error('PostGis Error', err);
});

module.exports.mysql = mysqlConn;
module.exports.postGis = postGisConn;

I created a wrapper for our pool.query function to calculate durations and then updated every pool.query in the rest of the project to the pool.myQuery. Doing this we were able to find that we did have an extremely long-running query that was hitting the DB very frequently. Turns out there was a trigger on this table we had to disable to get the long-running query taken care of.

This caused all our connection pool to be consumed very quickly and not be released in time for other queries to use appropriately. Hopefully, this code might help @mriedem find a way to log the query execution times out.

@mriedem
Copy link
Contributor Author

mriedem commented Jul 15, 2020

We have something similar, everything goes through a function that wraps Pool.query and logs how long the original query function took.

I've posted some code for that today in our staging environment which will set both statement_timeout to 45 seconds and also log a warning for which we can setup an alert if query takes longer than 45 seconds. Checking our staging logs we have some queries that routinely take about 30 seconds so with that in place we can try to upgrade to [email protected] again in our staging environment and see what it triggers.

@mriedem
Copy link
Contributor Author

mriedem commented Jul 16, 2020

We're starting to play around with log_min_duration_statement which defaults to 100ms in the IBM managed postgres DB instance and we definitely have queries like this taking over 1 second:

SELECT * FROM tasks WHERE timestamp_dequeued IS NULL AND queue_name = $1 ORDER BY timestamp_queued LIMIT 1 FOR UPDATE SKIP LOCKED;

This is the definition of that table:

                                               Table "public.tasks"
       Column       |            Type             | Collation | Nullable |                Default                 
--------------------+-----------------------------+-----------+----------+----------------------------------------
 task_id            | integer                     |           | not null | nextval('tasks_task_id_seq'::regclass)
 task_content       | json                        |           | not null | 
 queue_name         | text                        |           | not null | 
 timestamp_queued   | timestamp without time zone |           | not null | timezone('utc'::text, now())
 timestamp_dequeued | timestamp without time zone |           |          | 
Indexes:
    "tasks_pkey" PRIMARY KEY, btree (task_id)
    "task_queuename" btree (queue_name)
    "task_ts_dequeued" btree (timestamp_dequeued)
    "tasks_qn_dequeued_queued" btree (queue_name, timestamp_dequeued, timestamp_queued)

In our production DB there are currently 2250262 records in that table and the task_content json blob in some of them can be relatively large (20MB).

@mriedem
Copy link
Contributor Author

mriedem commented Aug 10, 2020

In our production DB there are currently 2250262 records in that table and the task_content json blob in some of them can be relatively large (20MB).

We're now actively trimming this table and running vacuum analyze in a cron job to keep the DB size down. We still need to upgrade to [email protected] in our staging environment and see if we hit the regression (my guess is it might not be as noticeable now that we're routinely cleaning up old records).

@EdenAzulay
Copy link

Hi,
I'm having the same issue when using version 8.3.3 .
Do you know if there are any recent changes on the last releases that might cause the same issue?

@brianc
Copy link
Owner

brianc commented Oct 4, 2020

@EdenAzulay

I'm having the same issue when using version 8.3.3 .
Do you know if there are any recent changes on the last releases that might cause the same issue?

What issue are you having? can you describe it a bit more? Did you upgrade from an older version without the issue? I don't know of any changes which may have caused the issue. There have been several reports on this thread, but nothing conclusive yet.

@PsyTae

Looks like we are experiencing this same issue. We are using a pool and pool.query() for all our db queries, so if you would like additional info we can provide, we can try anything you suggest as well.

Thanks for the info and it sounds like you cleared up your own issue? Did the issue only show up w/ pg@8 and not pg@7?

As an aside: I have a separate pool in our app for long running queries (we have a few that take several seconds) with longer connection timeouts and so on. It can help.

@briangonzalez
Copy link

@PsyTae @mriedem

Did either of you get to the bottom of this? We're seeing something similar.

@mriedem
Copy link
Contributor Author

mriedem commented Jan 4, 2021

Did either of you get to the bottom of this? We're seeing something similar.

@briangonzalez Not yet no. We're still on [email protected] since I've been dragging my feet on doing the upgrade. We have instrumented some stuff in our code for logging warnings in slow queries and such to hopefully help us identify / alert on any performance regression when we do upgrade.

@nilansht
Copy link

nilansht commented Jun 3, 2021

@brianc Do we have some kind of easy timeout for queries. So that the pool is available for the waiting clients if it takes large time for running queries. ? For now my connections keep building more and more, but out of the blue they stop leaving the connections and finally the application hangs out. This is happening on Production for me .

@mriedem
Copy link
Contributor Author

mriedem commented Jun 3, 2021

@brianc Do we have some kind of easy timeout for queries. So that the pool is available for the waiting clients if it takes large time for running queries. ? For now my connections keep building more and more, but out of the blue they stop leaving the connections and finally the application hangs out. This is happening on Production for me .

Have you tried setting statement_timeout or query_timeout? See the Client constructor config which you can pass through the Pool constructor. Those values are passthrough to postgres.

https://node-postgres.com/api/client

@heitordobeis
Copy link

hi @mriedem,
i have the same problem today, did you get any solution?

@airhorns
Copy link

I think this issue is definitely still a thing, we're seeing the occasional timeout exceeded when trying to connect error under load in production when the pool is seemingly not under much stress and still pretty far from it's max limit. Here's an example from our tracing system where we're counting the number of times we're connecting with pool.connect, and each time we connect we log the current pool stats (idleCount, totalCount, waitingCount) at the same time. Our max is 75, so the pool should definitely be able to allocate a new connection.
:
CleanShot 2022-05-10 at 22 34 53@2x

My first suspicion for a strange timeout like this would be node event loop blocking where the process is stuck doing something for the 5s, but the event loop utilization never goes above .3:

CleanShot 2022-05-10 at 22 41 02@2x

So, I am inclined to think there's some very subtle bug in the pool logic that causes the timeout to be hit under load despite the pool not being full. It works almost all the time for us, but it seems to be load that causes this issue rarely.

@oriarazi
Copy link

oriarazi commented Jul 8, 2022

@mriedem any news on that? using [email protected] with [email protected] and facing the same issue under load in production.
@airhorns found any solution?

@karo-dc
Copy link

karo-dc commented Mar 1, 2023

@mriedem did you solve that? I use "pg": "8.9.0", same issue here.

@sscots
Copy link

sscots commented May 24, 2023

Having same issue on 8.8

@alexandresobolevski

This comment was marked as duplicate.

@koplenov

This comment was marked as duplicate.

@airhorns
Copy link

Are y'all able to prove that you aren't being hit by long event loop blocks during connection acquisition?

@robaca
Copy link

robaca commented Mar 11, 2024

Getting the same error with version 8.11.3 and we are sure there is no event loop lag (we measure and would log it). Our service is also running on GCP/GKE and are using the @google-cloud/cloud-sql-connector together with the Pool to connect to a Cloud SQL for PostgreSQL instance. It's only affecting a small fraction of queries.

@brianc
Copy link
Owner

brianc commented Mar 12, 2024

Dang! sorry about this! Do you have any code or way to reproduce this at all? I'd love to help fix this for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests