-
Notifications
You must be signed in to change notification settings - Fork 827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection pool not discarding bad connections #494
Comments
Yes, it should be expected for the connection pool to discard a connection upon receiving a TCP error. That said, connections are not tested before use. i.e. They don't try to talk to the remote server. The pool is only responsible for discarding a connection once an error had occurred. If the remote server is restarted, the next N uses of the connection pool could be expected to fail where N is number of existing connections. The problem is the pool has no way of knowing if a just single connection has failed or many / all have failed. The one possible exception to this is beginning a transaction. Since begin is always safe to retry the pool automatically retries failed begin calls. |
P. S. I agree this is not ideal. I'd like to completely revamp the connection pool in the next major release of pgx. |
I think I may be experiencing a bug in v3.2.0, in that case. My pool has a max of 25 connections. When the postgres database is restarted, we will continue to get TCP write errors until we manually do a restart of our application. The number of TCP write errors is much greater than N = 25. So it seems like the pool isn't recognizing the TCP error at all. I'm planning on looking more into this today; I'll post an update when I have one. |
I've just ran into the same issue here on AWS when the underlying IP for the hostname changed. The existing connection seems to have been kept alive on the other end. Only indications were the previous connections were gone from PG's side and queries were failing with write timeouts. The pool recycles write timeouts errors given timeouts probably benign this deep in. I think it may have to do how routing is handled inside AWS. Line 1382 in 4618730
Lines 1364 to 1365 in 4618730
What'll probably work for us is connecting to a static internal ip for now. |
I've also hit this error within a deployment; specifically with errors like I guess that the original problem is triggered by some underlying network fault/change, but once that has been cleared, our application frequently encounters the same error (same source port number in the error), so it absolutely looks like a connection in a bad state is being put back into the pool. What is the intention behind the |
You are correct about the purpose of the Is it possible that net timeout errors can occur without a deadline being set? This would definitely confuse the current logic. |
I believe this may be possible if TCP keepalives are enabled (SO_KEEPALIVE). I can confirm that my code does indeed set a custom Dialer that enables TCP keepalives. |
That looks like the problem. pgx assumes that timeout errors are only caused by a context cancellation and are therefore not fatal. That logic needs to change so it is only considered recoverable if a context has set a deadline. Not sure how big a change that will be. |
With TLS connections a Write timeout caused by a SetDeadline permanently breaks the connection. However, the errors are reported as temporary. So there is no way to determine if it really is recoverable. As these were the only kind of Write error that was recovered all Write errors are now fatal to the connection. #494 #506 golang/go#29971
Upon further investigation I found that it is impossible to recover from a Write error caused by SetDeadline with a tls.Conn. That is the only kind of Write error that was recoverable. So this basically makes it impossible to recover any Write errors at all. I just committed a change to master that considers all Write errors fatal. Try it out and see if it works. I think that should resolve this. |
Thank you for this, I've deployed a build using the latest master and I'll let you know what I observe over the next few days. |
Since I've deployed applications built with commit 6067cfa, I've observed a couple of timeouts that have been handled gracefully, and no other regressions in behaviour. Looks good to me! |
Same. We were running into issues with broken pipes on write when using connections coming out of the connection pool. Now we only see that once and the next retry has a valid connection. This has been a huge improvement for us (we're not sure why the other end was closing its connection, but think it was idle connection timeouts inside of crdb). It would be nice if there was a periodic callback handler or some such that could be installed that would ping the connection every N idle seconds in order to keep the connection refreshed/alive/valid, but wouldn't count against the idle connection limit. Having usable connections coming out of the connection pool dramatically improves the usability of this library. For example: connections that have not been used by the application for 5min are automatically closed by the client, however the pool will periodically run a |
Having been running with this for several weeks now, we do still sometimes see a connection stuck in the pool in a bad state (so e.g. repeated errors of One thing that stood out to me is I think we still have a repeat of the pattern that's looking for net timeout errors and treating them specially: Lines 1407 to 1409 in 6067cfa
I'm not sure if that's ultimately the cause of what I'm seeing, though. |
@lwithers One of the things we've done is move all of our connections from raw |
Yeah, we've been using |
See #517 for a potential solution. |
This should be resolved in the |
I have run into a situation in which my postgres database is being restarted (not sure if it's a graceful restart or not), thus getting rid of all of the connections on the server side.
The connection pool does not seem to be responding well to this. When I try to query using one of the bad connections, I receive a write TCP timeout error. The connection pool does not recognize this (even though it should be checking if the connection is alive before every query), and puts the connection back into the pool.
I've dug through the code a bit and am not seeing any obvious problems that could be causing this, but I'm going to keep looking.
Can someone please tell me if I should expect the connection pool to discard the connection upon receiving a TCP timeout error? I have resorted to detecting this error in my application and resetting the entire pool, but this is too heavy of a solution to be used permanently in my opinion.
Thanks!
The text was updated successfully, but these errors were encountered: