-
Notifications
You must be signed in to change notification settings - Fork 977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Galera node wrongly set as offline during SST with xtrabackup / mariabackup #2953
Comments
I'm also having an issue which seems to be related to this one. After a while without restarting proxysql, all galera nodes end up offline and cluster stops responding. |
@maximumG Did the change in your PR actually resolve the issue for you? I applied your patch but I'm still seeing the problem where after some time all but one of my pxc nodes are removed from host groups by proxysql. I have to redeploy the proxysql pods in order for it to start working again. |
Same for me. Ended up switching proxysql by haproxy (Percona xtraDB kubernetes setup), which have worked flawlessly. The only thing is that you'll have to specify a different db connection for writes if you want them to a single node. |
Ugh...I really want to stick with proxysql but I may not have a choice if I don't find a solution to this problem soon. 😔 |
same issues under v2.0.15 |
Hi, thanks @lots0logs for the PR, and @maximumG for reporting the issue. We know about it, the problem with this change, is that it's a "breaking change", at the moment, infra could be relying in the current behavior and changing it could lead to potentially dangerous unintended behaviors. It's in our scope adding in the future an option to decide between the old behavior, and the new one proposed here, but it's something we are not going to add in the short term. This issue will be kept open until that option is added, thanks you! |
@JavierJF Do you understand that due to this issue using proxysql for a pxc cluster that does multiple backups an hour is completely broken? It requires manually restarting the proxysql containers every couple hours because perfectly working nodes get completely dropped from the mysql servers table and never picked back up until redeploying the proxysql pod. I do not understand how fixing this is a breaking change. The current behavior is BROKEN. It's causing people to not use proxysql and go with haproxy instead. I'm willing to bet that its part of the reason why Percona changed the default load balancer for their Kubernetes Operator from proxysql to haproxy. Just to be clear, the change in my PR has zero effect on existing behavior for setups that use a blocking SST. For those setups it will work exactly the same. The only setups that will have different behavior are those using a non-blocking SST as the behavior currently is broken for those setups and with this PR it will work properly. |
@lots0logs That's no the behavior that we have been observing, and we haven't see any other reports of the nodes not being back online after SST have finished. If that is your case, if you could please share your ProxySQL logs, we could follow with that issue. Also, as a tip, if for cause of a bug (or another reason we yet don't know) your nodes are not detected as online automatically, redeploying ProxySQL should not be necessary, a simple "LOAD MYSQL SERVERS TO RUNTIME" against the admin interface should force the reload of them. Hope that does the trick until further log inspection reveals more details about this. |
I just quickly want to +1 this fix. We had a crash the other day where 2 out of 3 galera servers went down. Upon restoring the two servers the one and only working server went offline due to the sst process. Had to go back to use HAProxy until this one is fixed. |
I should note that while this fix does indeed fix the issue of going offline during SST, there apparently is still a bug elsewhere because I'm still seeing the servers table end up with only one node that proxysql won't even send queries to after a few hours of starting the cluster (all three pxc nodes are up and working perfectly fine when this happens). I'm in the process of switching to haproxy now because I've wasted too much time trying to get proxysql to work right. |
Does anyone bother to attach an error log? |
I've tried several times to capture an error log. However I don't have time to sit around waiting for the moment the failure occurs. By the time I have been able to get back to grab the logs too much time has passed since the error and it's no longer in the logs. However that is not relevant to this PR because the issue this PR resolves could not be more clear. As for whatever the issue is causing a broken cluster that is not related to the issue this PR solves, it is not hard to reproduce. You can deploy Percona's Kubernetes Operator configured to use proxysql (it uses haproxy by default) with a backup job running every hour at 00, 15, and 30. It'll work fine for a random amount of time (usually several hours) and then it will break. It always happens at least once in a 24 hour period. I say that because I would reset it before leaving work for the day and by the time I get in the next day the cluster would be broken and need to be reset again. |
Please allow me to understand. |
Really? 🙄 Did you even read the original issue description? It explains the problem very clearly IN DETAIL. You don't need logs because there is no error. proxysql does what its coded to do currently. The problem is that the code does not currently account for the fact that pxc nodes have non-blocking SST. Thus, anytime a node becomes a DONOR proxysql marks it offline without checking to see if its actually blocking. I'm not sure what more you expect. What is it about the explanation provided here that you do not understand? To be clear, in my earlier response I thought we were commenting on my PR. I didn't realize we were commenting on this issue. Sorry for the confusion. |
We also discontinued use of proxysql as a part of Percona Xtradb due to this error. In our case with nightly backups the error occured every 1-3 weeks. A nightmare to investigate. We found out Percona had switched away from proxysql in more recent releases, which made us to do that too. We needed a product that was stable right now. |
The bottom line is that there is at least one if not multiple bugs in proxysql's galera support that make it unsuitable for use in production. Multiple users have reported this on multiple issues here on GitHub. Percona switched their default LB to haproxy a month or so ago and while they did not say why publicly, it's pretty safe to say it was for the reason I just mentioned. It is what it is. 🤷♂️ |
Percona published a related article here "The use of the scheduler with a properly developed script/application that handles the Galera support can guarantee better consistency and proper behavior in respect to your custom expectations. " "If a node is the only one in a segment, the check will behave accordingly. IE if a node is the only one in the MAIN segment**, it will not put the node in OFFLINE_SOFT when the node become donor** to prevent the cluster to become unavailable for the applications. As mention is possible to declare a segment as MAIN, quite useful when managing prod and DR site." |
Closes #2953: Honor 'wsrep_sst_donor_rejects_queries' avoiding setting a DONOR node offline during a SST
We were just about to go into production with new infrastructure, using ProxySQL (2.0.14) as the load balancing layer for our Galera cluster. However, as previously pointed out in this issue and reading articles online we have seen unpredictable behaviour. We too have experienced hosts disappearing from the runtime mysql servers, only coming back when manually loading from config and then to runtime. We just redid everything utilizing haproxy instead. |
Hi, @Bazze ProxySQL versions v2.0.16 and v2.1.0 pack several fixes for Galera cluster. V2.1.0 includes the requested change of behavior described in this issue, honoring 'wsrep_sst_donor_rejects_queries'. In the release notes for v2.1.0 and v2.0.16 you will be able to find a detailed explanation of which Galera related bugs have been fixed. If those fixes doesn't cover your use cases of 'disappearing from the runtime mysql servers', please feel free to open a new issue with the information related to that unexpected behavior. Because this issue has gone very off-topic at this point, I'm locking the conversation now. Thank you. |
ProxySQL version
ProxySQL version 2.0.12-38-g58a909a0, codename Truls
OS Version
Debian 10
Infrastructure
ProxySQL in front of a 3-node galera cluster named G1, G2 and G3.
2
Issue description
G3 is removed from the galera cluster and moved back a while after. This is causing an SST transfer from G2 (desync/donor) to G3 (joiner) using mariabackup as mechanism. G2 is wrongly set in the offline hostgroup.
Mariabackup/xtrabackup enables non blocking State Transfer that should NOT cause G2 to become part of the offline hostgroup. This feature is already taken into account by ProxySQL but it seems that the checks in the code has some issue.
https://github.com/sysown/proxysql/blob/v2.0.12/lib/MySQL_Monitor.cpp#L1814-L1851
On L.1814 we are changing the node to offline if not part of the primary partition OR wsrep_desync OR wsrep_local_state NOT 4. Thus the check of the wsrep_reject_queries or wsrep_sst_donor_rejects_queries is never fullfilled.
I guess this is similar to issue #2292
The text was updated successfully, but these errors were encountered: