Fix SSL error queue cleanup for backend conns #4602

JavierJF · 2024-08-08T16:42:33Z

The fix in this PR and it's associated report are the result of a followup for issue #4556. Due to the nature
of the issue, and the initial report, extra analysis for the exact conditions and potential consequences was
required. First we are going to analyze the scenario presented by #4556:

Scenario: Which scenario is generated and what can be expected from the misbehavior.
Reproduction: Exact conditions required to reproduce the issue.
Impact: What impact is to be expected depending on the load being received.
Implications: Full extent of the implications and further actions performed by this PR.

Scenario - #4556

An SSL error takes place in a frontend connection, this error fills the per-thread SSL error queue.
Further attempts of performing SSL operations by the same thread results in invalid errors being
received by the SSL calls. With the right conditions, if the error queue isn't cleaned by further
operations, this could results in:

Incorrect destruction of connections in conn-pool.
Invalid error propagation to clients.

It's important to remark, that in the right conditions, the backend connection destruction will
manifest with a cascading effect, resulting in many connections being closed at once, without any
apparent reason from either ProxySQL or MySQL sides. This would be all the connection that were
attempted to be used by the offending thread, the one with the non-cleared SSL error queue.

Reproduction - #4556

For reproduction, the following conditions should met:

Backend servers have SSL enabled; these connections will be the ones affected by the non-cleared
SSL error queue, after the error in the frontend connections take place.
There is non-SSL traffic in the frontend connections; this traffic will make use of the SSL
backend connections, without making use of SSL themselves. Remark: It's possible to trigger
the issue with SSL traffic in the frontend, but under normal load would be much harder, as
SSL operations for the frontend connections clear the SSL error queue (via get_sslstatus).
A fronted connection making use of SSL will generate an error, placing an entry in the SSL
error queue.

Since the rest of the frontend traffic is non-SSL. This will prevent any potential queue cleanup
(ERR_clear_error), that will otherwise take place during get_sslstatus (see
mysql_data_stream.cpp). So, even with traffic, the SSL error queue will still be filled. The
issues will propagate to all the other backend connections handled by that MySQL_Thread, if
ProxySQL is using --idle-threads, this means connections created by other threads, but now
assigned with this offending thread for query processing.

The provided regression test (reg_test_4556-ssl_error_queue-t.cpp) is able to achieve reproduction
trivially when inducing a SSL error in the frontend connection; also let's us verify the per-thread
error distribution, showing that the errors are concentrated on one mysql_thread. This is done
making use of mysql-session_idle_ms, which makes sure that connections doesn't leave their
creation thread (not considered idle) before the imposed idle timeout:

# 2024-08-06 18:17:01: Issuing query 'SET mysql-session_idle_ms=10000' to ('127.0.0.1':6032)
...
# Dumping per-thread conn count:
Map entry   thread=0x71ee06481000 count=6
Map entry   thread=0x71ee064c0000 count=5
Map entry   thread=0x71ee07016000 count=3
Map entry   thread=0x71ee07048000 count=8
Map entry   thread=0x71ee070bf000 count=5
Map entry   thread=0x71ee070f2000 count=4
Map entry   thread=0x71ee07e36000 count=5
Map entry   thread=0x71ee07e50000 count=4
ok 1 - No thread should be left without connections   lo_count=3
# Force early SSL failure in thread
...
# Checking thread conns   thread_addr=0x71ee06481000
ok 2 - Query should execute without error   rc=0 mysql_error=''
...
ok 13 - Resultset should be properly retreived   myres=0x5729e0f37e10 mysql_error=''
# Checking thread conns   thread_addr=0x71ee064c0000
ok 14 - Query should execute without error   rc=0 mysql_error=''
...
ok 23 - Resultset should be properly retreived   myres=0x5729e0f36e30 mysql_error=''
...
...
# Checking thread conns   thread_addr=0x71ee07e36000
ok 64 - Query should execute without error   rc=0 mysql_error=''
...
ok 73 - Resultset should be properly retreived   myres=0x5729e0f37e10 mysql_error=''
# Checking thread conns   thread_addr=0x71ee07e50000
not ok 74 - Query should execute without error   rc=1 mysql_error='Lost connection to server during query'
...
not ok 81 - Resultset should be properly retreived   myres=(nil) mysql_error='Lost connection to server during query'

The test output is much more complete as is heavily edited for readability. The test, of course,
makes use of this capability for ensuring that errors are not present with or without connection
sharing between the threads (again via mysql-session_idle_ms).

Impact - #4556

As discussed previously the impact that can be expected from this issue is highly dependent from the
workload that ProxySQL is receiving. Even if SSL errors take place, under heavy load of frontend
SSL connections, the queue will be cleanup with every client request, and with an even
distribution of queries across the threads, it could be hard to hit the timing that will result in a
cascading backend connection close from the connection pool, also certain conditions would be
required from the backend connections, as connection creation could also clear the error queue, more
on this later.

So, the scenario in which this bug is expected to have higher impact is, with SSL connections
enabled on the backend, low SSL traffic is received in the frontend (required for triggering
the error), with low per-thread creation of backend conns. Meaning high-efficiency of the connection
pool in connection reuse. The amount of non-SSL traffic on the frontend as long as the
connection-pool is efficient in connection reuse isn't relevant.

Implications - Issue beyond #4556

The implications of #4556 go beyond the original issue, and extra fixes are required for ensuring
issues similar in nature to that one doesn't take place. Revising the now complete scenario:

Backend servers have SSL enabled.
There is non-SSL traffic in the frontend connections.
A fronted connection making use of SSL will generate an error, placing an entry in the SSL
error queue.

So, if there are SSL backend connections, and these backend connections are serving traffic. Why
isn't the SSL error queue cleanup when doing these operations? Why isn't cleanup after an error is
detect in one of them and the connection is closed?

This suggested that no SSL queue cleanup was being performed by the backend connections. This
SSL handling of the backend connections is, in principle, managed by libmariadbclient. This
implies, that no error cleanup, or maybe only for certain cases an error cleanup was taking place
inside the library.

This theory can be proven by creating a bunch of backend connection, filling up the connection pool,
and killing a connection from MySQL itself. In the next ping or attempt by ProxySQL to use
that connections, the non-cleared error queue will propagate to the rest of backend connections
being handled by that thread.

First we create a bunch of backend connections:

query="SELECT SLEEP(10)"

for i in {1..10}; do
    mysql -uroot -proot -h0.0.0.0 -P6033 --ssl-mode=DISABLED -e"$query" &
done

Second we kill one of those backend connections from MySQL, in ProxySQL all appears to be fine aside
from the killed connection:

2024-07-29 09:58:22 MySQL_Session.cpp:4811:handler_minus1_ClientLibraryError(): [ERROR] Detected a broken connection while running query on (0,127.0.0.1,13306,7237) , FD (Conn:66 , MyDS:66) , user root , last_used 80513004ms ago : 2013, Lost connection to server during query

But if we waited long enough for the next ping operations, we could see that the connections used by
the offending thread, the one receiving that received the kill during the query, will present the
following errors:

2024-07-29 09:59:22 MySQL_Session.cpp:1962:handler_again___status_PINGING_SERVER(): [ERROR] Detected a broken connection while during ping on (0,127.0.0.1,13306,15358) , FD (Conn:83 , MyDS:83) , user root , last_used 0ms ago : 2013, Lost connection to server during query

When pinging the connections, seeing the ping errors might not be immediate, specially with a
low-number of connections, and few backend errors. Specially in a multi-threaded environment and
with idle-threads. This is because the thread selected for pinging the connections may not match
the one that received the error. But if we try to exercise the connection pool using NON-SSL traffic:

while true; do mysql -c --ssl-mode=DISABLED -uroot -proot -P6033 --protocol=TCP --verbose -e"SELECT 1"; [ $? -ne 0 ] && break; done

We see that as soon as the thread that received the error starts trying serving traffic, errors
starts taking place:

2024-07-29 10:12:02 MySQL_Session.cpp:4811:handler_minus1_ClientLibraryError(): [ERROR] Detected a broken connection while running query on (0,127.0.0.1,13306,7236) , FD (Conn:65 , MyDS:65) , user root , last_used 4185ms ago : 2013, Lost connection to server during query
2024-07-29 10:12:05 MySQL_Session.cpp:4811:handler_minus1_ClientLibraryError(): [ERROR] Detected a broken connection while running query on (0,127.0.0.1,13306,7231) , FD (Conn:60 , MyDS:60) , user root , last_used 14ms ago : 2013, Lost connection to server during query

A more fine tuned and aggressive case for this scenario is present in the TAP test
(reg_test_4556-ssl_error_queue-t.cpp) added by this PR. Testing both, detected failures during
backend connection keep-alive pings, and while exercising the connections present in the
connection-pool.

Impact

The impact of this issue could be very similar to the one from #4556. The main difference would be
the source of the error, now originated in a backend connection, but as #4556 it could result in:

Incorrect destruction of connections in conn-pool.
Invalid error propagation to clients.

The main difference is that this issue doesn't need frontend traffic to be triggered, so it could be
observed in a completely idle ProxySQL with just a filled up connection pool.

Proposed Solution - Patch

The error lies within libmariadbclient, which should perform a cleanup of the SSL error queue
whenever a error of this class is detected. But, since in ProxySQL we never share connections
before the completion of operations performed by libmariadbclient, patching the library doesn't
seems necessary, it should be safe to handle the error and clear the queue from ProxySQL itself.

Commit #ece3694e packs this proposed fix, whenever a client side error is detected and the
backend connection was making use of SSL, the error queue is cleared via ERR_clear_error.
This should be sufficient for keeping a clear SSL error queue for backend connections.

The SSL error queue wasn't cleanup after an SSL related error took place in a backend connection. This would result in the propagation of the error to other conns handled by the thread, which could result in: - Incorrect destruction of connections in conn-pool. - Invalid error propagation to clients. This is a consequence of 'libmariadbclient' not performing a cleanup of this queue by itself. The situation got mitigated since the library **does** perform a cleanup of such queue during connect phase ('auth_caching_sha2_client|auth_sha256_client'), and ProxySQL does a cleanup of this queue during frontend SSL traffic.

renecannao · 2024-08-08T20:16:38Z

retest this please

renecannao · 2024-08-10T17:45:33Z

retest this please

JavierJF added 3 commits August 8, 2024 18:17

Add reg test for frontend/backend conns SSL errors

b19106a

Add TAP test checking thread conns distribution

1fdaad0

renecannao mentioned this pull request Aug 8, 2024

Branch v2.x-AT #4603

Closed

renecannao merged commit 1f4b088 into v2.x Aug 13, 2024
78 of 99 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SSL error queue cleanup for backend conns #4602

Fix SSL error queue cleanup for backend conns #4602

JavierJF commented Aug 8, 2024

renecannao commented Aug 8, 2024

renecannao commented Aug 10, 2024

Fix SSL error queue cleanup for backend conns #4602

Fix SSL error queue cleanup for backend conns #4602

Conversation

JavierJF commented Aug 8, 2024

Scenario - #4556

Reproduction - #4556

Impact - #4556

Implications - Issue beyond #4556

Impact

Proposed Solution - Patch

renecannao commented Aug 8, 2024

renecannao commented Aug 10, 2024