"timeout exceeded when trying to connect" spike after upgrading to 8.2.1 #2262

mriedem · 2020-07-02T13:58:22Z

Yesterday we upgraded from [email protected] (with [email protected]) to [email protected] (with [email protected]), specifically from our package-lock.json:

    "pg": {
      "version": "8.2.1",
      "resolved": "https://registry.npmjs.org/pg/-/pg-8.2.1.tgz",
      "integrity": "sha512-DKzffhpkWRr9jx7vKxA+ur79KG+SKw+PdjMb1IRhMiKI9zqYUGczwFprqy+5Veh/DCcFs1Y6V8lRLN5I1DlleQ==",
      "requires": {
        "buffer-writer": "2.0.0",
        "packet-reader": "1.0.0",
        "pg-connection-string": "^2.2.3",
        "pg-pool": "^3.2.1",
        "pg-protocol": "^1.2.4",
        "pg-types": "^2.1.0",
        "pgpass": "1.x",
        "semver": "4.3.2"
      },
      "dependencies": {
        "pg-connection-string": {
          "version": "2.2.3",
          "resolved": "https://registry.npmjs.org/pg-connection-string/-/pg-connection-string-2.2.3.tgz",
          "integrity": "sha512-I/KCSQGmOrZx6sMHXkOs2MjddrYcqpza3Dtsy0AjIgBr/bZiPJRK9WhABXN1Uy1UDazRbi9gZEzO2sAhL5EqiQ=="
        },
        "semver": {
          "version": "4.3.2",
          "resolved": "https://registry.npmjs.org/semver/-/semver-4.3.2.tgz",
          "integrity": "sha1-x6BxWKgL7dBSNVt3DYLWZA+AO+c="
        }
      }
    }

We started seeing a spike in timeout exceeded when trying to connect errors with this stacktrace:

Error: timeout exceeded when trying to connect
    at Timeout._onTimeout (/usr/src/app/node_modules/pg-pool/index.js:188:27)
    at listOnTimeout (internal/timers.js:549:17)
    at processTimers (internal/timers.js:492:7)

This is a pretty basic express app with a postgres 12 backend running on node 12.

We report metrics on the connection pool max/total/idle/waiting count values and there is an obvious spike in the wait count from the time the 8.2.1 upgrade was deployed (around 9am CT yesterday) and then the drop when we reverted that change (about 6am CT today):

That corresponds with our API request/response/error rates (again, just a simple express app over a pg db):

We're not sure how to debug this. These are the relevant values we're using related to the Pool config:

connectionTimeoutMillis = 60000
idleTimeoutMillis = 60000
max = 150

We have a staging environment where this showed up as well but we didn't have an alert setup for it (we do now). So if there is something we can do to help debug this and provide information back we can probably do that in our staging environment.

The text was updated successfully, but these errors were encountered:

jsatsvat · 2020-07-08T09:25:06Z

We're having the same issue - causing us some headache. Just looking at 8.2.2 to see if that fixed it.

mriedem · 2020-07-08T13:19:27Z

Looking at the changes in 8.2.2 I'm not sure those are going to resolve my issue since I'm not using bytea type columns:

https://github.com/brianc/node-postgres/compare/[email protected]@8.2.2

charmander · 2020-07-08T14:04:33Z

@mriedem The fix applies to any long message.

brianc · 2020-07-08T15:35:09Z

hmm sorry this is happening - I'm assuming this isn't happening all the time or in a reproducible way, just sometimes at random? Those issues are hard to debug. I'm gonna check my app logs & see if I can see any of this - I usually run the latest on our systems and they're under pretty heavy load all day.

mriedem · 2020-07-09T13:06:27Z

I'm assuming this isn't happening all the time or in a reproducible way, just sometimes at random?

It's definitely reproducible / persistent to the point that we had to revert the package upgrade in our app because of the regression. I mean, if we had some debug patch or something applied in our staging environment I'm sure we'd hit it to get more data if you have something in mind. I can't say there is a specific query that fails 100% of the time though.

I'm not sure if it helps, but there were quite a few of the errors happening on queries to fetch data from tables with json columns, so I'm not sure if that would be impacted by #2240? But we had other failures on queries that didn't involve json columns, e.g. a simple count query: SELECT COUNT(job_id) FROM jobs WHERE queue_name = $1 AND execution_time IS NULL;, though those might have been on the same pod (db connection pool) so if the problem causes an issue for all connections in the same pool then that might explain the issue in other basic queries.

mriedem · 2020-07-09T13:06:49Z

@jsatsvat did moving to 8.2.2 resolve the issue for you?

jsatsvat · 2020-07-09T13:07:47Z

did moving to 8.2.2 resolve the issue for you?

No, it didn't unfortunately.

there were quite a few of the errors happening on queries to fetch data from tables with json columns, so I'm not sure if that would be impacted by #2240

FYI - We don't have any JSON columns as far as I am aware.

brianc · 2020-07-09T16:04:26Z

It's definitely reproducible / persistent to the point that we had to revert the package upgrade in our app because of the regression.

yeah I mean is there a way this can be reproduced in an isolated script so I could run it locally and analyze what's going on?

brianc · 2020-07-09T17:11:35Z

Are y'all using an ORM (knex, typeorm, etc) or using pg directly?

mriedem · 2020-07-09T17:14:58Z

Are y'all using an ORM (knex, typeorm, etc) or using pg directly?

We're not using an ORM, just pg directly.

brianc · 2020-07-09T17:28:27Z

okay - and I'm guessing it happens some what "at random" meaning your app has been running for a while & then you get a timeout? Once you get a timeout do you get a bunch or is it just one every once and a while?

jsatsvat · 2020-07-09T17:46:42Z

We're using slonik.

And yes, at least for us, it happens randomly out of a sudden with a bunch of timeouts at once.

brianc · 2020-07-09T17:57:31Z

dang - do you have numbers around how long it takes before it happens? What's the recovery you do? restart the app?

jsatsvat · 2020-07-09T18:17:14Z

For now we've drastically increased the pool size (previously set to 100) which temporarily takes some pressure off it; but as a quick recovery measure we do indeed restart the affected service.

As it seems to happen randomly, I am not sure as to how helpful numbers will be as we can't really determine a certain interval or similar.

mriedem · 2020-07-09T18:39:52Z

okay - and I'm guessing it happens some what "at random" meaning your app has been running for a while & then you get a timeout? Once you get a timeout do you get a bunch or is it just one every once and a while?

I would have to dig into this. In our production cluster we have the app running in 20 replicas and each has a pool configured for 150 connections. Our readiness probe is set to hit an API which does a select now(); query to make sure that pod's connection pool is OK because if it's full then we want that pod to go out of rotation for traffic until it can drain its connection requests. The pod will only crash and restart automatically if an uncaught error slips through.

I think I can say when it hits we get a bunch of timeouts which would probably explain why the waiting count per pod (in the graph in the issue description) spikes, because presumably something is blocking in the pool and so other requests are waiting until a timeout occurs.

brianc · 2020-07-09T20:00:34Z

K thanks for the info this is helpful. One other question...are you all using SSL for the connection? I'm wondering if there's a weird edge case in the startup / request ssl packet handling.

brianc · 2020-07-09T21:15:35Z

Sorry...more questions. I'm trying to repro locally...no luck yet. What cloud are y'all running in? I wonder if it's networking related to k8s or the like.

mriedem · 2020-07-10T13:14:35Z

Yeah we're setting ssl.ca in the options that get passed to create the pool.

We're in the IBM cloud using the K8S service there to run the app and the IBM Cloud Databases for Postgresql for the PG backend.

brianc · 2020-07-10T15:58:52Z

Alright...I left a script run all night that was connecting like this: local-script -> glcoud sql proxy -> cloud sql instance in gcloud. It has a 10 connection pool, and every 1-2 seconds it checks out a client, runs a query, and returns it. The pool is set to idleTimeoutMillis of 100 - so basically every time the script waits for 1-2 seconds the pool is cleaned out & there's a reconnection attempt. If it ever takes more than 3 seconds for the client to connect, query, and be returned to the pool the script exits. It ran all night without problems....so maybe thinking the connection logic has a problem is a red herring? Looking at the code the timeout exceeded when trying to connect is actually a bad error message - that error happens when the pool is saturated & cannot give you an existing client in time. I'm going to change that error message to be more descriptive. You've logged out the number of waiting on your pool and it skyrockets. That leads me to believe one of three things is going on...

somewhere your code is checking out a client and not checking it back in. Eventually this starves the pool of available clients, waiting count skyrockets, and yeah...bad things. Doing nothing but upgrading from [email protected] to [email protected] is unlikely to have caused this since theoretically your app code didn't change & all errors are handled the same...but worth considering. The best way to diagnose this would be for me to put a pool.emit('release', client) call into the code whenever a client is released back to the pool. Not having this event is just an oversight on my part. The pool already emits a pool.on('acquire', (client) => {}) event. With a matching release event you could do something to find orphaned clients like this:

pool.on('acquire', (client) => {
  client.orphanTimeoutId = setTimeout(() => {
    console.error('client was not returned within 5 seconds', client)
    process.exit(-1)  // do whatever you need to do here to trip alarms
  }, 5000)
})

pool.on('release', (client) => {
  clearTimeout(client.orphanTimeoutId)
})

Something changed between 8.0 and 8.2 which makes a client enter some kind of exceptionally bad run time (we fixed one of those things with fix: major performance issues with bytea performance #2240 #2241, there might be more?) The best way to debug this would be to log all query execution times and use statement_timeout in your connection options and set it to something reasonably low like 3 seconds (or 2 * what your longest query every takes) and then monitor the logs for errors due to queries timing out. That could then isolate it down to a particular query at which point given that query & estimated data volumes returned from it I could likely build a reproducible fix on my end.
Something changed between 8.0 and 8.2 which makes a client get "stuck" thinking it's executing a query when it's not. This isn't likely...but would be worth investigating. The easiest way to do this most likely would be hopefully you have a single file where all queries "funnel through" in your app, otherwise you'll need to monkeypatch (Client.prototype.query) and wrap it with a timing function. Time the execution and log out to the console (or crash the app entirely if its in staging or something) if the duration between calling client.query and the query finishing exceeded some threshold. This is different than statement_timeout in that maybe all the queries finish fast but for some reason something gets "stuck" in the parser or client and it never reads the readyForQuery message properly.

Sorry this is hard to track down....I'll go ahead and add the pool.on('release') event and change the error message when the pool is starved for available clients to not say "timeout when trying to connect" because it's actually not trying to connect, it's starved.

What do you think?

mriedem · 2020-07-10T19:11:07Z

Thanks for the detailed feedback and ideas of things to try. We can try (1) when that's released. Doing (2) would also be pretty easy but I'd need to tinker a bit with figuring out our slowest query times to set a proper timeout. Thankfully we do wrap query with a timer function to log query execution times so we should have that information in our logs. And because we have that wrapper function on query we should be able to try (3) as well.

I don't think I'll probably be able to get to this today but we should be able to try some of this early next week. Thanks again.

PsyTae · 2020-07-14T15:34:53Z

Looks like we are experiencing this same issue. We are using a pool and pool.query() for all our db queries, so if you would like additional info we can provide, we can try anything you suggest as well.

PsyTae · 2020-07-14T16:04:30Z

Just updated our code to 8.3.0, in the meantime setting our max connections to 150 from 50. We are using amazon aws.
Over the last 2 hours, we have noticed that the connections are steadily rising and not releasing back to the pool for the next query to use appropriately.

mriedem · 2020-07-15T19:28:07Z

@brianc re your earlier comment on things to try for (3) we have a wrapper on Pool.query to log how long the query function took, i.e. this is a log statement from a query in our staging environment that took longer than 45 seconds for whatever reason:

Jul 1 12:01:05 api-queue-764c7789b9-7hlpk api-queue 20 {"name":"pg-connect","hostname":"api-queue-764c7789b9-7hlpk","pid":20,"level":20,"time_taken_ms":48866,"args":["SELECT * FROM tasks\n WHERE timestamp_dequeued IS NULL AND queue_name = $1\n ORDER BY timestamp_queued LIMIT 1 FOR UPDATE SKIP LOCKED;",["status-devices-staging"]],"msg":"","time":"2020-07-01T17:01:05.837Z","v":0}

That was from July 1 when we upgraded to [email protected] and excludes any logs from timeout exceeded when trying to connect failures.

I'm not sure about (2) though. Sorry if this is a dumb question but I'm not sure how to do this:

The best way to debug this would be to log all query execution times

I was thinking maybe there would be a property on the pg.Result object to record how long a query took but I don't see anything like that. Am I missing something obvious?

PsyTae · 2020-07-15T21:27:23Z

on my connection file

const mysql = require('mysql'); // mysql
const pg = require('pg'); // postgres
const log = false;
//--- My SQL Connection -----------------------------------------------------------------------------------------------------------------

const mysqlConfig = {
    connectionLimit: 10,
    waitForConnections: true,
    host: process.env.MYSQL_HOST,
    port: process.env.MYSQL_PORT,
    user: process.env.MYSQL_USER,
    password: process.env.MYSQL_PASS,
    database: process.env.MYSQL_DATABASE
};

const postgisConfig = {
    host: process.env.POSTGIS_HOST,
    user: process.env.POSTGIS_USER,
    password: process.env.POSTGIS_PASS,
    database: process.env.POSTGIS_DATABASE,
    port: process.env.POSTGIS_PORT,
    max: 300,
    idleTimeoutMillis: 30 * 1000,
    connectionTimeoutMillis: 2 * 1000
};

const mysqlConn = mysql.createPool(mysqlConfig);
const postGisConn = new pg.Pool(postgisConfig);

postGisConn.myQuery = (text, values, callback) => {
    if (typeof values === 'function') {
        callback = values;
        values = undefined;
    }
    callback = typeof callback === 'function' ? callback : () => {};
    const start = Date.now();
    return postGisConn.query(text, values, (err, res) => {
        const duration = Date.now() - start;
        if (err) return callback(err);
        if (!err && (log || duration > (1 * 1000))) console.log('executed query', { text, duration, rows: res.rowCount });
        callback(err, res);
    });
};

mysqlConn.on('error', err => {
    console.error('Mysql Error:', err);
});

postGisConn.on('error', (err, client) => {
    console.error('PostGis Error', err);
});

module.exports.mysql = mysqlConn;
module.exports.postGis = postGisConn;

I created a wrapper for our pool.query function to calculate durations and then updated every pool.query in the rest of the project to the pool.myQuery. Doing this we were able to find that we did have an extremely long-running query that was hitting the DB very frequently. Turns out there was a trigger on this table we had to disable to get the long-running query taken care of.

This caused all our connection pool to be consumed very quickly and not be released in time for other queries to use appropriately. Hopefully, this code might help @mriedem find a way to log the query execution times out.

mriedem · 2020-07-15T21:45:23Z

We have something similar, everything goes through a function that wraps Pool.query and logs how long the original query function took.

I've posted some code for that today in our staging environment which will set both statement_timeout to 45 seconds and also log a warning for which we can setup an alert if query takes longer than 45 seconds. Checking our staging logs we have some queries that routinely take about 30 seconds so with that in place we can try to upgrade to [email protected] again in our staging environment and see what it triggers.

mriedem · 2020-07-16T15:53:56Z

We're starting to play around with log_min_duration_statement which defaults to 100ms in the IBM managed postgres DB instance and we definitely have queries like this taking over 1 second:

SELECT * FROM tasks WHERE timestamp_dequeued IS NULL AND queue_name = $1 ORDER BY timestamp_queued LIMIT 1 FOR UPDATE SKIP LOCKED;

This is the definition of that table:

                                               Table "public.tasks"
       Column       |            Type             | Collation | Nullable |                Default                 
--------------------+-----------------------------+-----------+----------+----------------------------------------
 task_id            | integer                     |           | not null | nextval('tasks_task_id_seq'::regclass)
 task_content       | json                        |           | not null | 
 queue_name         | text                        |           | not null | 
 timestamp_queued   | timestamp without time zone |           | not null | timezone('utc'::text, now())
 timestamp_dequeued | timestamp without time zone |           |          | 
Indexes:
    "tasks_pkey" PRIMARY KEY, btree (task_id)
    "task_queuename" btree (queue_name)
    "task_ts_dequeued" btree (timestamp_dequeued)
    "tasks_qn_dequeued_queued" btree (queue_name, timestamp_dequeued, timestamp_queued)

In our production DB there are currently 2250262 records in that table and the task_content json blob in some of them can be relatively large (20MB).

mriedem · 2020-08-10T14:20:08Z

In our production DB there are currently 2250262 records in that table and the task_content json blob in some of them can be relatively large (20MB).

We're now actively trimming this table and running vacuum analyze in a cron job to keep the DB size down. We still need to upgrade to [email protected] in our staging environment and see if we hit the regression (my guess is it might not be as noticeable now that we're routinely cleaning up old records).

EdenAzulay · 2020-09-30T08:35:58Z

Hi,
I'm having the same issue when using version 8.3.3 .
Do you know if there are any recent changes on the last releases that might cause the same issue?

brianc · 2020-10-04T18:50:46Z

@EdenAzulay

I'm having the same issue when using version 8.3.3 .
Do you know if there are any recent changes on the last releases that might cause the same issue?

What issue are you having? can you describe it a bit more? Did you upgrade from an older version without the issue? I don't know of any changes which may have caused the issue. There have been several reports on this thread, but nothing conclusive yet.

@PsyTae

Looks like we are experiencing this same issue. We are using a pool and pool.query() for all our db queries, so if you would like additional info we can provide, we can try anything you suggest as well.

Thanks for the info and it sounds like you cleared up your own issue? Did the issue only show up w/ pg@8 and not pg@7?

As an aside: I have a separate pool in our app for long running queries (we have a few that take several seconds) with longer connection timeouts and so on. It can help.

briangonzalez · 2020-12-22T19:09:44Z

@PsyTae @mriedem

Did either of you get to the bottom of this? We're seeing something similar.

mriedem · 2021-01-04T15:34:58Z

Did either of you get to the bottom of this? We're seeing something similar.

@briangonzalez Not yet no. We're still on [email protected] since I've been dragging my feet on doing the upgrade. We have instrumented some stuff in our code for logging warnings in slow queries and such to hopefully help us identify / alert on any performance regression when we do upgrade.

nilansht · 2021-06-03T08:24:45Z

@brianc Do we have some kind of easy timeout for queries. So that the pool is available for the waiting clients if it takes large time for running queries. ? For now my connections keep building more and more, but out of the blue they stop leaving the connections and finally the application hangs out. This is happening on Production for me .

mriedem · 2021-06-03T12:19:39Z

@brianc Do we have some kind of easy timeout for queries. So that the pool is available for the waiting clients if it takes large time for running queries. ? For now my connections keep building more and more, but out of the blue they stop leaving the connections and finally the application hangs out. This is happening on Production for me .

Have you tried setting statement_timeout or query_timeout? See the Client constructor config which you can pass through the Pool constructor. Those values are passthrough to postgres.

https://node-postgres.com/api/client

heitordobeis · 2022-02-14T19:51:01Z

hi @mriedem,
i have the same problem today, did you get any solution?

airhorns · 2022-05-11T02:42:14Z

I think this issue is definitely still a thing, we're seeing the occasional timeout exceeded when trying to connect error under load in production when the pool is seemingly not under much stress and still pretty far from it's max limit. Here's an example from our tracing system where we're counting the number of times we're connecting with pool.connect, and each time we connect we log the current pool stats (idleCount, totalCount, waitingCount) at the same time. Our max is 75, so the pool should definitely be able to allocate a new connection.
:

My first suspicion for a strange timeout like this would be node event loop blocking where the process is stuck doing something for the 5s, but the event loop utilization never goes above .3:

So, I am inclined to think there's some very subtle bug in the pool logic that causes the timeout to be hit under load despite the pool not being full. It works almost all the time for us, but it seems to be load that causes this issue rarely.

oriarazi · 2022-07-08T13:24:47Z

@mriedem any news on that? using [email protected] with [email protected] and facing the same issue under load in production.
@airhorns found any solution?

karo-dc · 2023-03-01T17:17:57Z

@mriedem did you solve that? I use "pg": "8.9.0", same issue here.

sscots · 2023-05-24T13:37:34Z

Having same issue on 8.8

airhorns · 2024-03-10T18:26:37Z

Are y'all able to prove that you aren't being hit by long event loop blocks during connection acquisition?

robaca · 2024-03-11T09:05:26Z

Getting the same error with version 8.11.3 and we are sure there is no event loop lag (we measure and would log it). Our service is also running on GCP/GKE and are using the @google-cloud/cloud-sql-connector together with the Pool to connect to a Cloud SQL for PostgreSQL instance. It's only affecting a small fraction of queries.

brianc · 2024-03-12T19:56:58Z

Dang! sorry about this! Do you have any code or way to reproduce this at all? I'd love to help fix this for you.

PsyTae mentioned this issue Jul 14, 2020

timeout exceeded when trying to connect #2276

Closed

nihonjinrxs mentioned this issue Oct 17, 2022

Emit a 'release' event when a connection is released back to the pool #2845

Merged

This comment was marked as duplicate.

Sign in to view

"timeout exceeded when trying to connect" spike after upgrading to 8.2.1 #2262

"timeout exceeded when trying to connect" spike after upgrading to 8.2.1 #2262

Comments

mriedem commented Jul 2, 2020 • edited Loading

jsatsvat commented Jul 8, 2020

mriedem commented Jul 8, 2020

charmander commented Jul 8, 2020 • edited Loading

brianc commented Jul 8, 2020

mriedem commented Jul 9, 2020

mriedem commented Jul 9, 2020

jsatsvat commented Jul 9, 2020 • edited Loading

brianc commented Jul 9, 2020

brianc commented Jul 9, 2020

mriedem commented Jul 9, 2020

brianc commented Jul 9, 2020

jsatsvat commented Jul 9, 2020

brianc commented Jul 9, 2020

jsatsvat commented Jul 9, 2020 • edited Loading

mriedem commented Jul 9, 2020

brianc commented Jul 9, 2020

brianc commented Jul 9, 2020

mriedem commented Jul 10, 2020

brianc commented Jul 10, 2020

mriedem commented Jul 10, 2020

PsyTae commented Jul 14, 2020

PsyTae commented Jul 14, 2020 • edited Loading

mriedem commented Jul 15, 2020

PsyTae commented Jul 15, 2020 • edited Loading

mriedem commented Jul 15, 2020

mriedem commented Jul 16, 2020

mriedem commented Aug 10, 2020

EdenAzulay commented Sep 30, 2020

brianc commented Oct 4, 2020

briangonzalez commented Dec 22, 2020

mriedem commented Jan 4, 2021 • edited Loading

nilansht commented Jun 3, 2021

mriedem commented Jun 3, 2021

heitordobeis commented Feb 14, 2022

airhorns commented May 11, 2022

oriarazi commented Jul 8, 2022

karo-dc commented Mar 1, 2023

sscots commented May 24, 2023

This comment was marked as duplicate.

This comment was marked as duplicate.

airhorns commented Mar 10, 2024

robaca commented Mar 11, 2024

brianc commented Mar 12, 2024

mriedem commented Jul 2, 2020 •

edited

Loading

charmander commented Jul 8, 2020 •

edited

Loading

jsatsvat commented Jul 9, 2020 •

edited

Loading

jsatsvat commented Jul 9, 2020 •

edited

Loading

PsyTae commented Jul 14, 2020 •

edited

Loading

PsyTae commented Jul 15, 2020 •

edited

Loading

mriedem commented Jan 4, 2021 •

edited

Loading