Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Galera node wrongly set as offline during SST with xtrabackup / mariabackup #2953

Closed
maximumG opened this issue Jul 15, 2020 · 21 comments · Fixed by #3227
Closed

Galera node wrongly set as offline during SST with xtrabackup / mariabackup #2953

maximumG opened this issue Jul 15, 2020 · 21 comments · Fixed by #3227

Comments

@maximumG
Copy link

maximumG commented Jul 15, 2020

ProxySQL version

ProxySQL version 2.0.12-38-g58a909a0, codename Truls

OS Version

Debian 10

Infrastructure

ProxySQL in front of a 3-node galera cluster named G1, G2 and G3.

  • writer hostgroup: 10 (includes G1)
  • backup writer hostgroup: 20 (includes G2, G3)
  • reader hostgroup: 30 (includes G2, G3)
  • offline hostgroup: 9999
  • writer_is_also_reader set to 2

Issue description

G3 is removed from the galera cluster and moved back a while after. This is causing an SST transfer from G2 (desync/donor) to G3 (joiner) using mariabackup as mechanism. G2 is wrongly set in the offline hostgroup.

Mariabackup/xtrabackup enables non blocking State Transfer that should NOT cause G2 to become part of the offline hostgroup. This feature is already taken into account by ProxySQL but it seems that the checks in the code has some issue.

https://github.com/sysown/proxysql/blob/v2.0.12/lib/MySQL_Monitor.cpp#L1814-L1851

On L.1814 we are changing the node to offline if not part of the primary partition OR wsrep_desync OR wsrep_local_state NOT 4. Thus the check of the wsrep_reject_queries or wsrep_sst_donor_rejects_queries is never fullfilled.

I guess this is similar to issue #2292

@mtryfoss
Copy link

I'm also having an issue which seems to be related to this one. After a while without restarting proxysql, all galera nodes end up offline and cluster stops responding.

@lots0logs
Copy link

@renecannao @pondix

bump

@lots0logs
Copy link

@maximumG Did the change in your PR actually resolve the issue for you? I applied your patch but I'm still seeing the problem where after some time all but one of my pxc nodes are removed from host groups by proxysql. I have to redeploy the proxysql pods in order for it to start working again.

@mtryfoss
Copy link

@maximumG Did the change in your PR actually resolve the issue for you? I applied your patch but I'm still seeing the problem where after some time all but one of my pxc nodes are removed from host groups by proxysql. I have to redeploy the proxysql pods in order for it to start working again.

Same for me. Ended up switching proxysql by haproxy (Percona xtraDB kubernetes setup), which have worked flawlessly. The only thing is that you'll have to specify a different db connection for writes if you want them to a single node.

@lots0logs
Copy link

Ugh...I really want to stick with proxysql but I may not have a choice if I don't find a solution to this problem soon. 😔

@dyipon
Copy link

dyipon commented Nov 13, 2020

same issues under v2.0.15

@lots0logs
Copy link

@dyipon @maximumG @mtryfoss I submitted this PR which seems to have resolved this particular issue on my cluster.

@JavierJF
Copy link
Collaborator

Hi, thanks @lots0logs for the PR, and @maximumG for reporting the issue.

We know about it, the problem with this change, is that it's a "breaking change", at the moment, infra could be relying in the current behavior and changing it could lead to potentially dangerous unintended behaviors. It's in our scope adding in the future an option to decide between the old behavior, and the new one proposed here, but it's something we are not going to add in the short term. This issue will be kept open until that option is added, thanks you!

@lots0logs
Copy link

lots0logs commented Nov 19, 2020

@JavierJF Do you understand that due to this issue using proxysql for a pxc cluster that does multiple backups an hour is completely broken? It requires manually restarting the proxysql containers every couple hours because perfectly working nodes get completely dropped from the mysql servers table and never picked back up until redeploying the proxysql pod. I do not understand how fixing this is a breaking change. The current behavior is BROKEN. It's causing people to not use proxysql and go with haproxy instead. I'm willing to bet that its part of the reason why Percona changed the default load balancer for their Kubernetes Operator from proxysql to haproxy.

Just to be clear, the change in my PR has zero effect on existing behavior for setups that use a blocking SST. For those setups it will work exactly the same. The only setups that will have different behavior are those using a non-blocking SST as the behavior currently is broken for those setups and with this PR it will work properly.

@JavierJF
Copy link
Collaborator

@lots0logs That's no the behavior that we have been observing, and we haven't see any other reports of the nodes not being back online after SST have finished. If that is your case, if you could please share your ProxySQL logs, we could follow with that issue. Also, as a tip, if for cause of a bug (or another reason we yet don't know) your nodes are not detected as online automatically, redeploying ProxySQL should not be necessary, a simple "LOAD MYSQL SERVERS TO RUNTIME" against the admin interface should force the reload of them. Hope that does the trick until further log inspection reveals more details about this.

@twarkie
Copy link

twarkie commented Nov 30, 2020

I just quickly want to +1 this fix.

We had a crash the other day where 2 out of 3 galera servers went down. Upon restoring the two servers the one and only working server went offline due to the sst process. Had to go back to use HAProxy until this one is fixed.

@lots0logs
Copy link

lots0logs commented Nov 30, 2020

I should note that while this fix does indeed fix the issue of going offline during SST, there apparently is still a bug elsewhere because I'm still seeing the servers table end up with only one node that proxysql won't even send queries to after a few hours of starting the cluster (all three pxc nodes are up and working perfectly fine when this happens). I'm in the process of switching to haproxy now because I've wasted too much time trying to get proxysql to work right.

@renecannao
Copy link
Contributor

Does anyone bother to attach an error log?

@lots0logs
Copy link

I've tried several times to capture an error log. However I don't have time to sit around waiting for the moment the failure occurs. By the time I have been able to get back to grab the logs too much time has passed since the error and it's no longer in the logs. However that is not relevant to this PR because the issue this PR resolves could not be more clear.

As for whatever the issue is causing a broken cluster that is not related to the issue this PR solves, it is not hard to reproduce. You can deploy Percona's Kubernetes Operator configured to use proxysql (it uses haproxy by default) with a backup job running every hour at 00, 15, and 30. It'll work fine for a random amount of time (usually several hours) and then it will break. It always happens at least once in a 24 hour period. I say that because I would reset it before leaving work for the day and by the time I get in the next day the cluster would be broken and need to be reset again.

@renecannao
Copy link
Contributor

Please allow me to understand.
You are saying it happens randomly (quoting : "It'll work fine for a random amount of time (usually several hours) and then it will break") , thus you don't have a reproducible test case.
You don't have any log to support your statement.
Nonetheless, you believe you know what the problem is and how to fix it.
Is my interpretation correct?

@lots0logs
Copy link

lots0logs commented Nov 30, 2020

Really? 🙄 Did you even read the original issue description? It explains the problem very clearly IN DETAIL. You don't need logs because there is no error. proxysql does what its coded to do currently. The problem is that the code does not currently account for the fact that pxc nodes have non-blocking SST. Thus, anytime a node becomes a DONOR proxysql marks it offline without checking to see if its actually blocking. I'm not sure what more you expect. What is it about the explanation provided here that you do not understand?

To be clear, in my earlier response I thought we were commenting on my PR. I didn't realize we were commenting on this issue. Sorry for the confusion.

@mtryfoss
Copy link

We also discontinued use of proxysql as a part of Percona Xtradb due to this error. In our case with nightly backups the error occured every 1-3 weeks. A nightmare to investigate. We found out Percona had switched away from proxysql in more recent releases, which made us to do that too. We needed a product that was stable right now.

@lots0logs
Copy link

The bottom line is that there is at least one if not multiple bugs in proxysql's galera support that make it unsuitable for use in production. Multiple users have reported this on multiple issues here on GitHub. Percona switched their default LB to haproxy a month or so ago and while they did not say why publicly, it's pretty safe to say it was for the reason I just mentioned. It is what it is. 🤷‍♂️

@dyipon
Copy link

dyipon commented Dec 2, 2020

Percona published a related article here

"The use of the scheduler with a properly developed script/application that handles the Galera support can guarantee better consistency and proper behavior in respect to your custom expectations. "

"If a node is the only one in a segment, the check will behave accordingly. IE if a node is the only one in the MAIN segment**, it will not put the node in OFFLINE_SOFT when the node become donor** to prevent the cluster to become unavailable for the applications. As mention is possible to declare a segment as MAIN, quite useful when managing prod and DR site."

renecannao added a commit that referenced this issue Jan 8, 2021
Closes #2953: Honor 'wsrep_sst_donor_rejects_queries' avoiding setting a DONOR node offline during a SST
@Bazze
Copy link

Bazze commented Jan 12, 2021

We were just about to go into production with new infrastructure, using ProxySQL (2.0.14) as the load balancing layer for our Galera cluster. However, as previously pointed out in this issue and reading articles online we have seen unpredictable behaviour. We too have experienced hosts disappearing from the runtime mysql servers, only coming back when manually loading from config and then to runtime. We just redid everything utilizing haproxy instead.

@JavierJF
Copy link
Collaborator

Hi, @Bazze

ProxySQL versions v2.0.16 and v2.1.0 pack several fixes for Galera cluster. V2.1.0 includes the requested change of behavior described in this issue, honoring 'wsrep_sst_donor_rejects_queries'.

In the release notes for v2.1.0 and v2.0.16 you will be able to find a detailed explanation of which Galera related bugs have been fixed. If those fixes doesn't cover your use cases of 'disappearing from the runtime mysql servers', please feel free to open a new issue with the information related to that unexpected behavior.

Because this issue has gone very off-topic at this point, I'm locking the conversation now.

Thank you.

@sysown sysown locked as off-topic and limited conversation to collaborators Jan 13, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
8 participants