Fix SSL error queue cleanup for backend conns #4602
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The fix in this PR and it's associated report are the result of a followup for issue #4556. Due to the nature
of the issue, and the initial report, extra analysis for the exact conditions and potential consequences was
required. First we are going to analyze the scenario presented by #4556:
Scenario - #4556
An
SSL
error takes place in a frontend connection, this error fills the per-threadSSL
error queue.Further attempts of performing
SSL
operations by the same thread results in invalid errors beingreceived by the
SSL
calls. With the right conditions, if the error queue isn't cleaned by furtheroperations, this could results in:
It's important to remark, that in the right conditions, the backend connection destruction will
manifest with a cascading effect, resulting in many connections being closed at once, without any
apparent reason from either
ProxySQL
orMySQL
sides. This would be all the connection that wereattempted to be used by the offending thread, the one with the non-cleared
SSL
error queue.Reproduction - #4556
For reproduction, the following conditions should met:
SSL
enabled; these connections will be the ones affected by the non-clearedSSL
error queue, after the error in the frontend connections take place.non-SSL
traffic in the frontend connections; this traffic will make use of theSSL
backend connections, without making use of
SSL
themselves. Remark: It's possible to triggerthe issue with
SSL
traffic in the frontend, but under normal load would be much harder, asSSL
operations for the frontend connections clear theSSL
error queue (viaget_sslstatus
).SSL
will generate an error, placing an entry in theSSL
error queue.
Since the rest of the frontend traffic is
non-SSL
. This will prevent any potential queue cleanup(
ERR_clear_error
), that will otherwise take place duringget_sslstatus
(seemysql_data_stream.cpp
). So, even with traffic, theSSL
error queue will still be filled. Theissues will propagate to all the other backend connections handled by that
MySQL_Thread
, ifProxySQL
is using--idle-threads
, this means connections created by other threads, but nowassigned with this offending thread for query processing.
The provided regression test (
reg_test_4556-ssl_error_queue-t.cpp
) is able to achieve reproductiontrivially when inducing a SSL error in the frontend connection; also let's us verify the per-thread
error distribution, showing that the errors are concentrated on one
mysql_thread
. This is donemaking use of
mysql-session_idle_ms
, which makes sure that connections doesn't leave theircreation thread (not considered idle) before the imposed
idle
timeout:The test output is much more complete as is heavily edited for readability. The test, of course,
makes use of this capability for ensuring that errors are not present with or without connection
sharing between the threads (again via
mysql-session_idle_ms
).Impact - #4556
As discussed previously the impact that can be expected from this issue is highly dependent from the
workload that ProxySQL is receiving. Even if
SSL
errors take place, under heavy load of frontendSSL
connections, the queue will be cleanup with every client request, and with an evendistribution of queries across the threads, it could be hard to hit the timing that will result in a
cascading backend connection close from the connection pool, also certain conditions would be
required from the backend connections, as connection creation could also clear the error queue, more
on this later.
So, the scenario in which this bug is expected to have higher impact is, with
SSL
connectionsenabled on the backend, low
SSL
traffic is received in the frontend (required for triggeringthe error), with low per-thread creation of backend conns. Meaning high-efficiency of the connection
pool in connection reuse. The amount of
non-SSL
traffic on the frontend as long as theconnection-pool is efficient in connection reuse isn't relevant.
Implications - Issue beyond #4556
The implications of #4556 go beyond the original issue, and extra fixes are required for ensuring
issues similar in nature to that one doesn't take place. Revising the now complete scenario:
SSL
enabled.non-SSL
traffic in the frontend connections.SSL
will generate an error, placing an entry in theSSL
error queue.
So, if there are
SSL
backend connections, and these backend connections are serving traffic. Whyisn't the SSL error queue cleanup when doing these operations? Why isn't cleanup after an error is
detect in one of them and the connection is closed?
This suggested that no SSL queue cleanup was being performed by the backend connections. This
SSL
handling of the backend connections is, in principle, managed bylibmariadbclient
. Thisimplies, that no error cleanup, or maybe only for certain cases an error cleanup was taking place
inside the library.
This theory can be proven by creating a bunch of backend connection, filling up the connection pool,
and killing a connection from
MySQL
itself. In the nextping
or attempt byProxySQL
to usethat connections, the non-cleared error queue will propagate to the rest of backend connections
being handled by that thread.
First we create a bunch of backend connections:
Second we kill one of those backend connections from
MySQL
, inProxySQL
all appears to be fine asidefrom the killed connection:
But if we waited long enough for the next ping operations, we could see that the connections used by
the offending thread, the one receiving that received the kill during the query, will present the
following errors:
When pinging the connections, seeing the ping errors might not be immediate, specially with a
low-number of connections, and few backend errors. Specially in a
multi-threaded
environment andwith
idle-threads
. This is because the thread selected for pinging the connections may not matchthe one that received the error. But if we try to exercise the connection pool using
NON-SSL
traffic:We see that as soon as the thread that received the error starts trying serving traffic, errors
starts taking place:
A more fine tuned and aggressive case for this scenario is present in the TAP test
(
reg_test_4556-ssl_error_queue-t.cpp
) added by this PR. Testing both, detected failures duringbackend connection keep-alive pings, and while exercising the connections present in the
connection-pool.
Impact
The impact of this issue could be very similar to the one from #4556. The main difference would be
the source of the error, now originated in a backend connection, but as #4556 it could result in:
The main difference is that this issue doesn't need frontend traffic to be triggered, so it could be
observed in a completely idle ProxySQL with just a filled up connection pool.
Proposed Solution - Patch
The error lies within
libmariadbclient
, which should perform a cleanup of theSSL
error queuewhenever a error of this class is detected. But, since in
ProxySQL
we never share connectionsbefore the completion of operations performed by
libmariadbclient
, patching the library doesn'tseems necessary, it should be safe to handle the error and clear the queue from
ProxySQL
itself.Commit #ece3694e packs this proposed fix, whenever a client side error is detected and the
backend connection was making use of
SSL
, the error queue is cleared viaERR_clear_error
.This should be sufficient for keeping a clear
SSL
error queue for backend connections.