Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection pool not discarding bad connections #494

Closed
mhartenbower opened this issue Dec 14, 2018 · 17 comments
Closed

Connection pool not discarding bad connections #494

mhartenbower opened this issue Dec 14, 2018 · 17 comments

Comments

@mhartenbower
Copy link

I have run into a situation in which my postgres database is being restarted (not sure if it's a graceful restart or not), thus getting rid of all of the connections on the server side.

The connection pool does not seem to be responding well to this. When I try to query using one of the bad connections, I receive a write TCP timeout error. The connection pool does not recognize this (even though it should be checking if the connection is alive before every query), and puts the connection back into the pool.

I've dug through the code a bit and am not seeing any obvious problems that could be causing this, but I'm going to keep looking.

Can someone please tell me if I should expect the connection pool to discard the connection upon receiving a TCP timeout error? I have resorted to detecting this error in my application and resetting the entire pool, but this is too heavy of a solution to be used permanently in my opinion.

Thanks!

@jackc
Copy link
Owner

jackc commented Dec 15, 2018

Yes, it should be expected for the connection pool to discard a connection upon receiving a TCP error. That said, connections are not tested before use. i.e. They don't try to talk to the remote server. The pool is only responsible for discarding a connection once an error had occurred.

If the remote server is restarted, the next N uses of the connection pool could be expected to fail where N is number of existing connections. The problem is the pool has no way of knowing if a just single connection has failed or many / all have failed.

The one possible exception to this is beginning a transaction. Since begin is always safe to retry the pool automatically retries failed begin calls.

@jackc
Copy link
Owner

jackc commented Dec 15, 2018

P. S. I agree this is not ideal. I'd like to completely revamp the connection pool in the next major release of pgx.

@mhartenbower
Copy link
Author

I think I may be experiencing a bug in v3.2.0, in that case.

My pool has a max of 25 connections. When the postgres database is restarted, we will continue to get TCP write errors until we manually do a restart of our application. The number of TCP write errors is much greater than N = 25.

So it seems like the pool isn't recognizing the TCP error at all. I'm planning on looking more into this today; I'll post an update when I have one.

@andrewtian
Copy link

andrewtian commented Dec 21, 2018

I've just ran into the same issue here on AWS when the underlying IP for the hostname changed. The existing connection seems to have been kept alive on the other end.

Only indications were the previous connections were gone from PG's side and queries were failing with write timeouts. The pool recycles write timeouts errors given timeouts probably benign this deep in. I think it may have to do how routing is handled inside AWS.

pgx/conn.go

Line 1382 in 4618730

return !(is && netErr.Timeout())

pgx/conn.go

Lines 1364 to 1365 in 4618730

if fatalWriteErr(n, err) {
c.die(err)

What'll probably work for us is connecting to a static internal ip for now.

@lwithers
Copy link

I've also hit this error within a deployment; specifically with errors like write tcp 100.127.118.16:42102->100.69.250.227:5432: write: connection timed out.

I guess that the original problem is triggered by some underlying network fault/change, but once that has been cleared, our application frequently encounters the same error (same source port number in the error), so it absolutely looks like a connection in a bad state is being put back into the pool.

What is the intention behind the fatalWriteErr function? I can see that it's basically saying "not fatal if we haven't written any partial data and if the error is a timeout", but from reading the net package documentation it's not actually clear whether a timeout error state persists or clears on the underlying TCP connection (net.Conn), or if it's required to explicitly clear it. I could be wrong, but I think the underlying connection becomes basically stuck in this state, so all future read/write syscalls will return ETIMEDOUT.

@jackc
Copy link
Owner

jackc commented Jan 26, 2019

You are correct about the purpose of the fatalWriteErr. Timeout error state will persist until the underlying deadline that caused the error is cleared. But if the deadline was set by pgx due to context cancellation then it should also be cleared by the cancelQuery function.

Is it possible that net timeout errors can occur without a deadline being set? This would definitely confuse the current logic.

@lwithers
Copy link

I believe this may be possible if TCP keepalives are enabled (SO_KEEPALIVE). I can confirm that my code does indeed set a custom Dialer that enables TCP keepalives.

@jackc
Copy link
Owner

jackc commented Jan 26, 2019

That looks like the problem. pgx assumes that timeout errors are only caused by a context cancellation and are therefore not fatal. That logic needs to change so it is only considered recoverable if a context has set a deadline. Not sure how big a change that will be.

jackc added a commit that referenced this issue Jan 29, 2019
With TLS connections a Write timeout caused by a SetDeadline permanently
breaks the connection. However, the errors are reported as temporary. So
there is no way to determine if it really is recoverable. As these were
the only kind of Write error that was recovered all Write errors are now
fatal to the connection.

#494
#506
golang/go#29971
@jackc
Copy link
Owner

jackc commented Jan 29, 2019

Upon further investigation I found that it is impossible to recover from a Write error caused by SetDeadline with a tls.Conn. That is the only kind of Write error that was recoverable. So this basically makes it impossible to recover any Write errors at all.

I just committed a change to master that considers all Write errors fatal. Try it out and see if it works. I think that should resolve this.

@lwithers
Copy link

Thank you for this, I've deployed a build using the latest master and I'll let you know what I observe over the next few days.

@lwithers
Copy link

Since I've deployed applications built with commit 6067cfa, I've observed a couple of timeouts that have been handled gracefully, and no other regressions in behaviour. Looks good to me!

@sean-
Copy link
Contributor

sean- commented Jan 31, 2019

Same. We were running into issues with broken pipes on write when using connections coming out of the connection pool. Now we only see that once and the next retry has a valid connection. This has been a huge improvement for us (we're not sure why the other end was closing its connection, but think it was idle connection timeouts inside of crdb).

It would be nice if there was a periodic callback handler or some such that could be installed that would ping the connection every N idle seconds in order to keep the connection refreshed/alive/valid, but wouldn't count against the idle connection limit. Having usable connections coming out of the connection pool dramatically improves the usability of this library.

For example: connections that have not been used by the application for 5min are automatically closed by the client, however the pool will periodically run a SELECT TRUE as it's ping command every ~15-30s or so just to ensure the idle connection is still valid. This reduces the window of time a connection can become invalid to ~15-30s, and prevents the app from having to proactively execute a query to verify the connection is valid.

@lwithers
Copy link

Having been running with this for several weeks now, we do still sometimes see a connection stuck in the pool in a bad state (so e.g. repeated errors of write tcp 100.112.30.50:56282->100.65.115.217:5432: write: connection timed out, with the same source port each time).

One thing that stood out to me is I think we still have a repeat of the pattern that's looking for net timeout errors and treating them specially:

pgx/conn.go

Lines 1407 to 1409 in 6067cfa

if netErr, ok := err.(net.Error); !(ok && netErr.Timeout()) {
c.die(err)
}

I'm not sure if that's ultimately the cause of what I'm seeing, though.

@sean-
Copy link
Contributor

sean- commented Feb 25, 2019

@lwithers One of the things we've done is move all of our connections from raw jackc/pgx DB connections to using the pgx/stdlib facade because it has better error handling. We're >=3wks with the sql.DB interface and it's made a marked improvement to our service availability.

@lwithers
Copy link

Yeah, we've been using stdlib.OpenDB as the entry point too since we adopted this driver a few months ago.

@jackc
Copy link
Owner

jackc commented Mar 9, 2019

See #517 for a potential solution.

@jackc
Copy link
Owner

jackc commented Sep 16, 2019

This should be resolved in the v4 release. It is more aggressive about closing potentially broken connections rather than trying to recover from them (and possibly failing).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants