Connection pool not discarding bad connections #494

mhartenbower · 2018-12-14T15:50:04Z

I have run into a situation in which my postgres database is being restarted (not sure if it's a graceful restart or not), thus getting rid of all of the connections on the server side.

The connection pool does not seem to be responding well to this. When I try to query using one of the bad connections, I receive a write TCP timeout error. The connection pool does not recognize this (even though it should be checking if the connection is alive before every query), and puts the connection back into the pool.

I've dug through the code a bit and am not seeing any obvious problems that could be causing this, but I'm going to keep looking.

Can someone please tell me if I should expect the connection pool to discard the connection upon receiving a TCP timeout error? I have resorted to detecting this error in my application and resetting the entire pool, but this is too heavy of a solution to be used permanently in my opinion.

Thanks!

jackc · 2018-12-15T23:32:25Z

Yes, it should be expected for the connection pool to discard a connection upon receiving a TCP error. That said, connections are not tested before use. i.e. They don't try to talk to the remote server. The pool is only responsible for discarding a connection once an error had occurred.

If the remote server is restarted, the next N uses of the connection pool could be expected to fail where N is number of existing connections. The problem is the pool has no way of knowing if a just single connection has failed or many / all have failed.

The one possible exception to this is beginning a transaction. Since begin is always safe to retry the pool automatically retries failed begin calls.

jackc · 2018-12-15T23:33:26Z

P. S. I agree this is not ideal. I'd like to completely revamp the connection pool in the next major release of pgx.

mhartenbower · 2018-12-17T13:53:34Z

I think I may be experiencing a bug in v3.2.0, in that case.

My pool has a max of 25 connections. When the postgres database is restarted, we will continue to get TCP write errors until we manually do a restart of our application. The number of TCP write errors is much greater than N = 25.

So it seems like the pool isn't recognizing the TCP error at all. I'm planning on looking more into this today; I'll post an update when I have one.

andrewtian · 2018-12-21T23:42:46Z

I've just ran into the same issue here on AWS when the underlying IP for the hostname changed. The existing connection seems to have been kept alive on the other end.

Only indications were the previous connections were gone from PG's side and queries were failing with write timeouts. The pool recycles write timeouts errors given timeouts probably benign this deep in. I think it may have to do how routing is handled inside AWS.

pgx/conn.go

Line 1382 in 4618730

return !(is && netErr.Timeout())

pgx/conn.go

Lines 1364 to 1365 in 4618730

    
           if fatalWriteErr(n, err) { 
        
           	c.die(err)

What'll probably work for us is connecting to a static internal ip for now.

lwithers · 2019-01-25T13:19:46Z

I've also hit this error within a deployment; specifically with errors like write tcp 100.127.118.16:42102->100.69.250.227:5432: write: connection timed out.

I guess that the original problem is triggered by some underlying network fault/change, but once that has been cleared, our application frequently encounters the same error (same source port number in the error), so it absolutely looks like a connection in a bad state is being put back into the pool.

What is the intention behind the fatalWriteErr function? I can see that it's basically saying "not fatal if we haven't written any partial data and if the error is a timeout", but from reading the net package documentation it's not actually clear whether a timeout error state persists or clears on the underlying TCP connection (net.Conn), or if it's required to explicitly clear it. I could be wrong, but I think the underlying connection becomes basically stuck in this state, so all future read/write syscalls will return ETIMEDOUT.

jackc · 2019-01-26T14:23:51Z

You are correct about the purpose of the fatalWriteErr. Timeout error state will persist until the underlying deadline that caused the error is cleared. But if the deadline was set by pgx due to context cancellation then it should also be cleared by the cancelQuery function.

Is it possible that net timeout errors can occur without a deadline being set? This would definitely confuse the current logic.

lwithers · 2019-01-26T19:01:56Z

I believe this may be possible if TCP keepalives are enabled (SO_KEEPALIVE). I can confirm that my code does indeed set a custom Dialer that enables TCP keepalives.

jackc · 2019-01-26T20:32:10Z

That looks like the problem. pgx assumes that timeout errors are only caused by a context cancellation and are therefore not fatal. That logic needs to change so it is only considered recoverable if a context has set a deadline. Not sure how big a change that will be.

With TLS connections a Write timeout caused by a SetDeadline permanently breaks the connection. However, the errors are reported as temporary. So there is no way to determine if it really is recoverable. As these were the only kind of Write error that was recovered all Write errors are now fatal to the connection. #494 #506 golang/go#29971

jackc · 2019-01-29T04:52:23Z

Upon further investigation I found that it is impossible to recover from a Write error caused by SetDeadline with a tls.Conn. That is the only kind of Write error that was recoverable. So this basically makes it impossible to recover any Write errors at all.

I just committed a change to master that considers all Write errors fatal. Try it out and see if it works. I think that should resolve this.

lwithers · 2019-01-29T16:42:46Z

Thank you for this, I've deployed a build using the latest master and I'll let you know what I observe over the next few days.

lwithers · 2019-01-30T14:58:15Z

Since I've deployed applications built with commit 6067cfa, I've observed a couple of timeouts that have been handled gracefully, and no other regressions in behaviour. Looks good to me!

sean- · 2019-01-31T16:25:23Z

Same. We were running into issues with broken pipes on write when using connections coming out of the connection pool. Now we only see that once and the next retry has a valid connection. This has been a huge improvement for us (we're not sure why the other end was closing its connection, but think it was idle connection timeouts inside of crdb).

It would be nice if there was a periodic callback handler or some such that could be installed that would ping the connection every N idle seconds in order to keep the connection refreshed/alive/valid, but wouldn't count against the idle connection limit. Having usable connections coming out of the connection pool dramatically improves the usability of this library.

For example: connections that have not been used by the application for 5min are automatically closed by the client, however the pool will periodically run a SELECT TRUE as it's ping command every ~15-30s or so just to ensure the idle connection is still valid. This reduces the window of time a connection can become invalid to ~15-30s, and prevents the app from having to proactively execute a query to verify the connection is valid.

lwithers · 2019-02-25T14:37:32Z

Having been running with this for several weeks now, we do still sometimes see a connection stuck in the pool in a bad state (so e.g. repeated errors of write tcp 100.112.30.50:56282->100.65.115.217:5432: write: connection timed out, with the same source port each time).

One thing that stood out to me is I think we still have a repeat of the pattern that's looking for net timeout errors and treating them specially:

pgx/conn.go

Lines 1407 to 1409 in 6067cfa

    
           if netErr, ok := err.(net.Error); !(ok && netErr.Timeout()) { 
        
           	c.die(err) 
        
           }

I'm not sure if that's ultimately the cause of what I'm seeing, though.

sean- · 2019-02-25T17:58:11Z

@lwithers One of the things we've done is move all of our connections from raw jackc/pgx DB connections to using the pgx/stdlib facade because it has better error handling. We're >=3wks with the sql.DB interface and it's made a marked improvement to our service availability.

lwithers · 2019-02-25T22:15:31Z

Yeah, we've been using stdlib.OpenDB as the entry point too since we adopted this driver a few months ago.

jackc · 2019-03-09T17:39:35Z

See #517 for a potential solution.

jackc · 2019-09-16T14:24:43Z

This should be resolved in the v4 release. It is more aggressive about closing potentially broken connections rather than trying to recover from them (and possibly failing).

jackc mentioned this issue Mar 9, 2019

Proposal: Do not attempt recovery of network timeouts #517

Closed

jackc closed this as completed Sep 16, 2019

This was referenced Feb 4, 2021

PLEASE REMOVE THE ADVICE TO USE PGX FROM YOUR PAGE lib/pq#1022

Open

Moving from pq to pgx #905

Closed

dzbee mentioned this issue Mar 1, 2021

Replace lib/pq with pgx/stdlib as postgres driver golang-migrate/migrate#512

Closed

mtrang1263 mentioned this issue Mar 17, 2021

BCDA-4338 - Use singleton db reference for main and queue databases CMSgov/bcda-app#665

Merged

6 tasks

serprex mentioned this issue Dec 11, 2023

basic alerting, for internal use primarily PeerDB-io/peerdb#778

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection pool not discarding bad connections #494

Connection pool not discarding bad connections #494

mhartenbower commented Dec 14, 2018

jackc commented Dec 15, 2018

jackc commented Dec 15, 2018

mhartenbower commented Dec 17, 2018

andrewtian commented Dec 21, 2018 •

edited

Loading

lwithers commented Jan 25, 2019

jackc commented Jan 26, 2019

lwithers commented Jan 26, 2019

jackc commented Jan 26, 2019

jackc commented Jan 29, 2019

lwithers commented Jan 29, 2019

lwithers commented Jan 30, 2019

sean- commented Jan 31, 2019

lwithers commented Feb 25, 2019

sean- commented Feb 25, 2019

lwithers commented Feb 25, 2019

jackc commented Mar 9, 2019

jackc commented Sep 16, 2019

Connection pool not discarding bad connections #494

Connection pool not discarding bad connections #494

Comments

mhartenbower commented Dec 14, 2018

jackc commented Dec 15, 2018

jackc commented Dec 15, 2018

mhartenbower commented Dec 17, 2018

andrewtian commented Dec 21, 2018 • edited Loading

lwithers commented Jan 25, 2019

jackc commented Jan 26, 2019

lwithers commented Jan 26, 2019

jackc commented Jan 26, 2019

jackc commented Jan 29, 2019

lwithers commented Jan 29, 2019

lwithers commented Jan 30, 2019

sean- commented Jan 31, 2019

lwithers commented Feb 25, 2019

sean- commented Feb 25, 2019

lwithers commented Feb 25, 2019

jackc commented Mar 9, 2019

jackc commented Sep 16, 2019

andrewtian commented Dec 21, 2018 •

edited

Loading