Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProxySQL sets state of all server to OFFLINE_HARD for no obvious reasons #850

Closed
ovaistariq opened this issue Dec 22, 2016 · 13 comments
Closed

Comments

@ovaistariq
Copy link

ProxySQL is configured in an environment where its acting as a load balancer for PXC nodes. Randomly, it sets the nodes to OFFLINE_HARD state for no obvious reasons. This has happened 3 times in the last 4 days. The cluster doesn't have any load at all, its a test cluster used for periodic load testing. ProxySQL node and all the cluster nodes are in amazon AWS within the same private subnet.

Excerpts from ProxySQL's error log:

2016-12-19 11:10:56 MySQL_HostGroups_Manager.cpp:477:commit(): [WARNING] Removed server at address 140401366121088, hostgroup 11, address 172.31.5.18 port 3306. Setting status OFFLINE HARD and immediately dropping all free connections. Used connections will be dropped when trying to use them
2016-12-19 11:10:56 MySQL_HostGroups_Manager.cpp:477:commit(): [WARNING] Removed server at address 140401392652032, hostgroup 11, address 172.31.10.29 port 3306. Setting status OFFLINE HARD and immediately dropping all free connections. Used connections will be dropped when trying to use them
2016-12-19 11:10:56 MySQL_HostGroups_Manager.cpp:477:commit(): [WARNING] Removed server at address 140401392651776, hostgroup 10, address 172.31.15.163 port 3306. Setting status OFFLINE HARD and immediately dropping all free connections. Used connections will be dropped when trying to use them
HID: 11 , address: 172.31.5.18 , port: 3306 , weight: 1000 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 11 , address: 172.31.10.29 , port: 3306 , weight: 1000 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 10 , address: 172.31.15.163 , port: 3306 , weight: 1000 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: WRITE
HID: 11 , address: 172.31.5.18 , port: 3306 , weight: 1000 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 11 , address: 172.31.10.29 , port: 3306 , weight: 1000 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 10 , address: 172.31.15.163 , port: 3306 , weight: 1000 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: WRITE
HID: 11 , address: 172.31.10.29 , port: 3306 , weight: 1000 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 11 , address: 172.31.10.29 , port: 3306 , weight: 1000 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
@ovaistariq
Copy link
Author

The galera checker and node monitor scripts being used can be found at https://github.com/twindb/proxysql/tree/master/support/percona

@ovaistariq
Copy link
Author

ovaistariq commented Dec 22, 2016

Excerpt from Galera checker script for the same time period:

Mon Dec 19 11:10:56 UTC 2016 Check server 10:172.31.15.163:3306 , status ONLINE , wsrep_local_state 4
Mon Dec 19 11:10:56 UTC 2016 Check server 11:172.31.10.29:3306 , status ONLINE , wsrep_local_state 4
Mon Dec 19 11:10:56 UTC 2016 Check server 11:172.31.11.165:3306 , status ONLINE , wsrep_local_state 4
Mon Dec 19 11:10:56 UTC 2016 Check server 11:172.31.5.18:3306 , status ONLINE , wsrep_local_state 4
Mon Dec 19 11:10:56 UTC 2016 Number of writers online: 0 : hostgroup: 10
Mon Dec 19 11:10:56 UTC 2016 Trying to set an available reader node as the writer node of the cluster
Mon Dec 19 11:10:56 UTC 2016 Number of writers online: 0 : hostgroup: 10
Mon Dec 19 11:10:56 UTC 2016 Trying to enable last available node of the cluster (in Donor/Desync state)
Mon Dec 19 11:10:56 UTC 2016 Enabling config

@ovaistariq
Copy link
Author

@renecannao
Copy link
Contributor

@ovaistariq : I think something has happened before that time.
The warning you get at MySQL_HostGroups_Manager.cpp:477:commit() means that during LOAD MYSQL SERVERS TO RUNTIME there are no servers in mysql_servers_incoming (an internal table used to move entries from Admin to Hostgroup_Manager), resulting in no matches in a the LEFT JOIN here . In other words, mysql_servers in Hostgroup_Manager has entries not existing in mysql_servers in Admin.

Could you please attach an hour of so of logs from both ProxySQL's error log and Galera checker script log, before such event?
I want to search for any other anomaly.
Thanks

@ovaistariq
Copy link
Author

Sure will do that tonight.

@ovaistariq
Copy link
Author

@ovaistariq
Copy link
Author

Let me know if you need more info.

As you can see, just before seeing 0 writers online, galera_checker did see all nodes in ONLINE state:

Mon Dec 19 11:10:56 UTC 2016 Check server 10:172.31.15.163:3306 , status ONLINE , wsrep_local_state 4
Mon Dec 19 11:10:56 UTC 2016 Check server 11:172.31.10.29:3306 , status ONLINE , wsrep_local_state 4
Mon Dec 19 11:10:56 UTC 2016 Check server 11:172.31.11.165:3306 , status ONLINE , wsrep_local_state 4
Mon Dec 19 11:10:56 UTC 2016 Check server 11:172.31.5.18:3306 , status ONLINE , wsrep_local_state 4
Mon Dec 19 11:10:56 UTC 2016 Number of writers online: 0 : hostgroup: 10

@ovaistariq
Copy link
Author

Similarly, this is also interesting:

2016-12-19 11:10:56 [INFO] Received LOAD MYSQL SERVERS TO RUNTIME command
HID: 11 , address: 172.31.5.18 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 11 , address: 172.31.11.165 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 11 , address: 172.31.10.29 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 10 , address: 172.31.15.163 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: WRITE
2016-12-19 11:10:56 MySQL_HostGroups_Manager.cpp:477:commit(): [WARNING] Removed server at address 140401366121088, hostgroup 11, address 172.31.5.18 port 3306. Setting status OFFLINE HARD and immediately dropping all free connections. Used connections will be dropped when trying to use them
2016-12-19 11:10:56 MySQL_HostGroups_Manager.cpp:477:commit(): [WARNING] Removed server at address 140401392652032, hostgroup 11, address 172.31.10.29 port 3306. Setting status OFFLINE HARD and immediately dropping all free connections. Used connections will be dropped when trying to use them
2016-12-19 11:10:56 MySQL_HostGroups_Manager.cpp:477:commit(): [WARNING] Removed server at address 140401392651776, hostgroup 10, address 172.31.15.163 port 3306. Setting status OFFLINE HARD and immediately dropping all free connections. Used connections will be dropped when trying to use them

@ovaistariq
Copy link
Author

ProxySQL receives a LOAD MYSQL SERVERS TO RUNTIME command and see all servers as ONLINE, however, it immediately removes all of them.

@renecannao
Copy link
Contributor

All the versions up to 1.3.2, handle LOAD MYSQL SERVERS TO RUNTIME this way:
a) prints in error log the current content of main.mysql_servers (as seen in Admin)
b) passes the server one by to MySQL_HostGroups_Manager that writes them in myhgm.mysql_servers_incoming (an internal table used by MySQL_HostGroups_Manager)
c) calls MySQL_HostGroups_Manager::commit()
c.1) remove all servers from myhgm.mysql_servers (an internal table used by MySQL_HostGroups_Manager)
c.2) based on memory structures, rebuilds myhgm.mysql_servers and prints in error log the memory representation of mysql servers
c.3) performs some joins between myhgm.mysql_servers and myhgm.mysql_servers_incoming to figure out what has changed
c.4) based on memory structures, rebuilds myhgm.mysql_servers and prints in error log the memory representation of mysql servers (as in c.2)

Extraction a portion of error log without errors, we can divide it this way:

  • phase a:
2016-12-19 11:10:51 [INFO] Received LOAD MYSQL SERVERS TO RUNTIME command
HID: 11 , address: 172.31.5.18 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 11 , address: 172.31.11.165 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 11 , address: 172.31.10.29 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 10 , address: 172.31.15.163 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: WRITE
  • phase c.2 :
HID: 11 , address: 172.31.5.18 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 11 , address: 172.31.11.165 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 11 , address: 172.31.10.29 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 10 , address: 172.31.15.163 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: WRITE
  • phase c.4:
HID: 11 , address: 172.31.5.18 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 11 , address: 172.31.11.165 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 11 , address: 172.31.10.29 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 10 , address: 172.31.15.163 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: WRITE

But, what happened 5 seconds after, at 11:10:56 is interesting:

2016-12-19 11:10:56 [INFO] Received LOAD MYSQL VARIABLES TO RUNTIME command
2016-12-19 11:10:56 ProxySQL_Admin.cpp:2890:flush_mysql_variables___database_to_runtime(): [ERROR] Impossible to set not existing variable session_debug with value "(null)". Deleting
2016-12-19 11:10:56 [INFO] Received SAVE MYSQL VARIABLES TO DISK command
2016-12-19 11:10:56 [INFO] Received LOAD MYSQL SERVERS TO RUNTIME command
HID: 11 , address: 172.31.5.18 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 11 , address: 172.31.11.165 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 11 , address: 172.31.10.29 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 10 , address: 172.31.15.163 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: WRITE
2016-12-19 11:10:56 MySQL_HostGroups_Manager.cpp:477:commit(): [WARNING] Removed server at address 140401366121088, hostgroup 11, address 172.31.5.18 port 3306. Setting status OFFLINE HARD and immediately dropping all free connections. Used connections will be dropped when trying to use them
2016-12-19 11:10:56 MySQL_HostGroups_Manager.cpp:477:commit(): [WARNING] Removed server at address 140401392652032, hostgroup 11, address 172.31.10.29 port 3306. Setting status OFFLINE HARD and immediately dropping all free connections. Used connections will be dropped when trying to use them
2016-12-19 11:10:56 MySQL_HostGroups_Manager.cpp:477:commit(): [WARNING] Removed server at address 140401392651776, hostgroup 10, address 172.31.15.163 port 3306. Setting status OFFLINE HARD and immediately dropping all free connections. Used connections will be dropped when trying to use them
HID: 11 , address: 172.31.5.18 , port: 3306 , weight: 1000 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 11 , address: 172.31.11.165 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 11 , address: 172.31.10.29 , port: 3306 , weight: 1000 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 10 , address: 172.31.15.163 , port: 3306 , weight: 1000 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: WRITE
HID: 11 , address: 172.31.5.18 , port: 3306 , weight: 1000 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 11 , address: 172.31.11.165 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 11 , address: 172.31.10.29 , port: 3306 , weight: 1000 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 10 , address: 172.31.15.163 , port: 3306 , weight: 1000 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: WRITE
2016-12-19 11:10:56 [INFO] New mysql_replication_hostgroups table
2016-12-19 11:10:56 [INFO] Received LOAD MYSQL SERVERS TO RUNTIME command
HID: 11 , address: 172.31.10.29 , port: 3306 , weight: 1000 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 11 , address: 172.31.11.165 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 11 , address: 172.31.10.29 , port: 3306 , weight: 1000 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 11 , address: 172.31.11.165 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 11 , address: 172.31.10.29 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 11 , address: 172.31.11.165 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 11 , address: 172.31.5.18 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 10 , address: 172.31.15.163 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: WRITE
2016-12-19 11:10:56 [INFO] New mysql_replication_hostgroups table
2016-12-19 11:10:56 [INFO] Received SAVE MYSQL SERVERS TO DISK command

There is a lot of output, but I think what is mostly relevant is the fact that there are 2 LOAD MYSQL SERVERS TO RUNTIME .
There is a mutex that serializes the execution of LOAD MYSQL SERVERS TO RUNTIME, therefore the second one starts only after the first one has completed.
Something wrong happened during the first LOAD MYSQL SERVERS TO RUNTIME. I initially thought about some DELETE being executed during it, but according to the output of after the first LOAD MYSQL SERVERS TO RUNTIME there were 4 servers, all online, during phase a.

The output of the second LOAD MYSQL SERVERS TO RUNTIME is surely interesting, as in "phase a" it seems there are only 2 nodes:

HID: 11 , address: 172.31.10.29 , port: 3306 , weight: 1000 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 11 , address: 172.31.11.165 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ

Only 2 nodes seems available in "phase c.2":

HID: 11 , address: 172.31.10.29 , port: 3306 , weight: 1000 , status: OFFLINE_HARD , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 11 , address: 172.31.11.165 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ

But suddenly they become 4 nodes in "phase c.4":

HID: 11 , address: 172.31.10.29 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 11 , address: 172.31.11.165 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 11 , address: 172.31.5.18 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: READ
HID: 10 , address: 172.31.15.163 , port: 3306 , weight: 1000 , status: ONLINE , max_connections: 1000 , max_replication_lag: 0 , use_ssl: 0 , max_latency_ms: 0 , comment: WRITE

I am yet not sure what happened, but I believe there is some race condition involved.
Do you know why there were 2 LOAD MYSQL SERVERS AT RUNTIME?

I also noticed that in this log file there are 2 calls to LOAD MYSQL VARIABLES TO RUNTIME and 2 calls to SAVE MYSQL SERVERS TO DISK .

$ grep "LOAD MYSQL VARIABLES TO RUNTIME" proxysql.log.20161219.txt
2016-12-19 10:37:29 [INFO] Received LOAD MYSQL VARIABLES TO RUNTIME command
2016-12-19 11:10:56 [INFO] Received LOAD MYSQL VARIABLES TO RUNTIME command
$ grep "SAVE MYSQL SERVERS TO DISK" proxysql.log.20161219.txt
2016-12-19 10:37:29 [INFO] Received SAVE MYSQL SERVERS TO DISK command
2016-12-19 11:10:56 [INFO] Received SAVE MYSQL SERVERS TO DISK command
$ egrep "10:37:2[89].*(LOAD|SAVE) MYSQL" proxysql.log.20161219.txt
2016-12-19 10:37:28 [INFO] Received LOAD MYSQL SERVERS TO RUNTIME command
2016-12-19 10:37:29 [INFO] Received LOAD MYSQL VARIABLES TO RUNTIME command
2016-12-19 10:37:29 [INFO] Received SAVE MYSQL VARIABLES TO DISK command
2016-12-19 10:37:29 [INFO] Received LOAD MYSQL SERVERS TO RUNTIME command
2016-12-19 10:37:29 [INFO] Received SAVE MYSQL SERVERS TO DISK command

So I think the question is: what script run LOAD MYSQL VARIABLES TO RUNTIME , SAVE MYSQL VARIABLES TO DISK, LOAD MYSQL SERVERS TO RUNTIME and SAVE MYSQL SERVERS TO DISK , and what does it change?
Understanding what this script does may help in understanding the race condition it causes.

One more note: proxysql_galera_checker.sh should be improved, as LOAD MYSQL SERVERS TO RUNTIME should be executed only if some change happened.

Final note: version 1.4.0 is a lot faster than 1.3.x and there are less moving part. A lot of code was simplified due to work on #829 , and therefore 1.4.0 it is now able to process a large number of hosts over 30 times faster than 1.3.x .

@ovaistariq
Copy link
Author

LOAD MYSQL VARIABLES TO RUNTIME;SAVE MYSQL VARIABLES TO DISK; is executed by proxysql-admin which gets executed by Chef. That can be seen here https://github.com/twindb/proxysql/blob/master/support/percona/proxysql-admin#L385

proxysql_galera_checker.sh can definitely be improved. If nothing has changed would LOAD MYSQL SERVERS TO RUNTIME have any impact?

@renecannao
Copy link
Contributor

@ovaistariq : closing this issue as it seems not relevant anymore.

Thanks

@ovaistariq
Copy link
Author

It does indeed does not seem like a ProxySQL bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants