Galera node wrongly set as offline during SST with xtrabackup / mariabackup #2953

maximumG · 2020-07-15T13:26:46Z

ProxySQL version

ProxySQL version 2.0.12-38-g58a909a0, codename Truls

OS Version

Debian 10

Infrastructure

ProxySQL in front of a 3-node galera cluster named G1, G2 and G3.

writer hostgroup: 10 (includes G1)
backup writer hostgroup: 20 (includes G2, G3)
reader hostgroup: 30 (includes G2, G3)
offline hostgroup: 9999
writer_is_also_reader set to 2

Issue description

G3 is removed from the galera cluster and moved back a while after. This is causing an SST transfer from G2 (desync/donor) to G3 (joiner) using mariabackup as mechanism. G2 is wrongly set in the offline hostgroup.

Mariabackup/xtrabackup enables non blocking State Transfer that should NOT cause G2 to become part of the offline hostgroup. This feature is already taken into account by ProxySQL but it seems that the checks in the code has some issue.

https://github.com/sysown/proxysql/blob/v2.0.12/lib/MySQL_Monitor.cpp#L1814-L1851

On L.1814 we are changing the node to offline if not part of the primary partition OR wsrep_desync OR wsrep_local_state NOT 4. Thus the check of the wsrep_reject_queries or wsrep_sst_donor_rejects_queries is never fullfilled.

I guess this is similar to issue #2292

The text was updated successfully, but these errors were encountered:

mtryfoss · 2020-08-25T07:54:06Z

I'm also having an issue which seems to be related to this one. After a while without restarting proxysql, all galera nodes end up offline and cluster stops responding.

lots0logs · 2020-10-25T03:16:37Z

@renecannao @pondix

lots0logs · 2020-10-28T17:44:27Z

@maximumG Did the change in your PR actually resolve the issue for you? I applied your patch but I'm still seeing the problem where after some time all but one of my pxc nodes are removed from host groups by proxysql. I have to redeploy the proxysql pods in order for it to start working again.

mtryfoss · 2020-10-29T08:41:57Z

@maximumG Did the change in your PR actually resolve the issue for you? I applied your patch but I'm still seeing the problem where after some time all but one of my pxc nodes are removed from host groups by proxysql. I have to redeploy the proxysql pods in order for it to start working again.

Same for me. Ended up switching proxysql by haproxy (Percona xtraDB kubernetes setup), which have worked flawlessly. The only thing is that you'll have to specify a different db connection for writes if you want them to a single node.

lots0logs · 2020-10-31T03:55:15Z

Ugh...I really want to stick with proxysql but I may not have a choice if I don't find a solution to this problem soon. 😔

dyipon · 2020-11-13T23:07:03Z

same issues under v2.0.15

lots0logs · 2020-11-18T01:04:39Z

@dyipon @maximumG @mtryfoss I submitted this PR which seems to have resolved this particular issue on my cluster.

JavierJF · 2020-11-19T17:26:02Z

Hi, thanks @lots0logs for the PR, and @maximumG for reporting the issue.

We know about it, the problem with this change, is that it's a "breaking change", at the moment, infra could be relying in the current behavior and changing it could lead to potentially dangerous unintended behaviors. It's in our scope adding in the future an option to decide between the old behavior, and the new one proposed here, but it's something we are not going to add in the short term. This issue will be kept open until that option is added, thanks you!

lots0logs · 2020-11-19T20:22:42Z

@JavierJF Do you understand that due to this issue using proxysql for a pxc cluster that does multiple backups an hour is completely broken? It requires manually restarting the proxysql containers every couple hours because perfectly working nodes get completely dropped from the mysql servers table and never picked back up until redeploying the proxysql pod. I do not understand how fixing this is a breaking change. The current behavior is BROKEN. It's causing people to not use proxysql and go with haproxy instead. I'm willing to bet that its part of the reason why Percona changed the default load balancer for their Kubernetes Operator from proxysql to haproxy.

Just to be clear, the change in my PR has zero effect on existing behavior for setups that use a blocking SST. For those setups it will work exactly the same. The only setups that will have different behavior are those using a non-blocking SST as the behavior currently is broken for those setups and with this PR it will work properly.

JavierJF · 2020-11-20T10:48:58Z

@lots0logs That's no the behavior that we have been observing, and we haven't see any other reports of the nodes not being back online after SST have finished. If that is your case, if you could please share your ProxySQL logs, we could follow with that issue. Also, as a tip, if for cause of a bug (or another reason we yet don't know) your nodes are not detected as online automatically, redeploying ProxySQL should not be necessary, a simple "LOAD MYSQL SERVERS TO RUNTIME" against the admin interface should force the reload of them. Hope that does the trick until further log inspection reveals more details about this.

twarkie · 2020-11-30T15:01:29Z

I just quickly want to +1 this fix.

We had a crash the other day where 2 out of 3 galera servers went down. Upon restoring the two servers the one and only working server went offline due to the sst process. Had to go back to use HAProxy until this one is fixed.

lots0logs · 2020-11-30T17:31:44Z

I should note that while this fix does indeed fix the issue of going offline during SST, there apparently is still a bug elsewhere because I'm still seeing the servers table end up with only one node that proxysql won't even send queries to after a few hours of starting the cluster (all three pxc nodes are up and working perfectly fine when this happens). I'm in the process of switching to haproxy now because I've wasted too much time trying to get proxysql to work right.

renecannao · 2020-11-30T17:36:22Z

Does anyone bother to attach an error log?

lots0logs · 2020-11-30T17:48:29Z

I've tried several times to capture an error log. However I don't have time to sit around waiting for the moment the failure occurs. By the time I have been able to get back to grab the logs too much time has passed since the error and it's no longer in the logs. However that is not relevant to this PR because the issue this PR resolves could not be more clear.

As for whatever the issue is causing a broken cluster that is not related to the issue this PR solves, it is not hard to reproduce. You can deploy Percona's Kubernetes Operator configured to use proxysql (it uses haproxy by default) with a backup job running every hour at 00, 15, and 30. It'll work fine for a random amount of time (usually several hours) and then it will break. It always happens at least once in a 24 hour period. I say that because I would reset it before leaving work for the day and by the time I get in the next day the cluster would be broken and need to be reset again.

renecannao · 2020-11-30T17:58:39Z

Please allow me to understand.
You are saying it happens randomly (quoting : "It'll work fine for a random amount of time (usually several hours) and then it will break") , thus you don't have a reproducible test case.
You don't have any log to support your statement.
Nonetheless, you believe you know what the problem is and how to fix it.
Is my interpretation correct?

lots0logs · 2020-11-30T18:20:32Z

Really? 🙄 Did you even read the original issue description? It explains the problem very clearly IN DETAIL. You don't need logs because there is no error. proxysql does what its coded to do currently. The problem is that the code does not currently account for the fact that pxc nodes have non-blocking SST. Thus, anytime a node becomes a DONOR proxysql marks it offline without checking to see if its actually blocking. I'm not sure what more you expect. What is it about the explanation provided here that you do not understand?

To be clear, in my earlier response I thought we were commenting on my PR. I didn't realize we were commenting on this issue. Sorry for the confusion.

mtryfoss · 2020-11-30T18:26:09Z

We also discontinued use of proxysql as a part of Percona Xtradb due to this error. In our case with nightly backups the error occured every 1-3 weeks. A nightmare to investigate. We found out Percona had switched away from proxysql in more recent releases, which made us to do that too. We needed a product that was stable right now.

lots0logs · 2020-11-30T18:35:38Z

The bottom line is that there is at least one if not multiple bugs in proxysql's galera support that make it unsuitable for use in production. Multiple users have reported this on multiple issues here on GitHub. Percona switched their default LB to haproxy a month or so ago and while they did not say why publicly, it's pretty safe to say it was for the reason I just mentioned. It is what it is. 🤷‍♂️

dyipon · 2020-12-02T21:23:01Z

Percona published a related article here

"The use of the scheduler with a properly developed script/application that handles the Galera support can guarantee better consistency and proper behavior in respect to your custom expectations. "

"If a node is the only one in a segment, the check will behave accordingly. IE if a node is the only one in the MAIN segment**, it will not put the node in OFFLINE_SOFT when the node become donor** to prevent the cluster to become unavailable for the applications. As mention is possible to declare a segment as MAIN, quite useful when managing prod and DR site."

Closes #2953: Honor 'wsrep_sst_donor_rejects_queries' avoiding setting a DONOR node offline during a SST

Bazze · 2021-01-12T10:43:04Z

We were just about to go into production with new infrastructure, using ProxySQL (2.0.14) as the load balancing layer for our Galera cluster. However, as previously pointed out in this issue and reading articles online we have seen unpredictable behaviour. We too have experienced hosts disappearing from the runtime mysql servers, only coming back when manually loading from config and then to runtime. We just redid everything utilizing haproxy instead.

JavierJF · 2021-01-13T09:44:34Z

Hi, @Bazze

ProxySQL versions v2.0.16 and v2.1.0 pack several fixes for Galera cluster. V2.1.0 includes the requested change of behavior described in this issue, honoring 'wsrep_sst_donor_rejects_queries'.

In the release notes for v2.1.0 and v2.0.16 you will be able to find a detailed explanation of which Galera related bugs have been fixed. If those fixes doesn't cover your use cases of 'disappearing from the runtime mysql servers', please feel free to open a new issue with the information related to that unexpected behavior.

Because this issue has gone very off-topic at this point, I'm locking the conversation now.

Thank you.

maximumG mentioned this issue Jul 16, 2020

Enhance Galera monitor for SST (fix #2953) #2956

Open

lots0logs mentioned this issue Nov 18, 2020

Update MySQL_Monitor.cpp #3162

Open

JavierJF mentioned this issue Dec 24, 2020

Closes #2953: Honor 'wsrep_sst_donor_rejects_queries' avoiding setting a DONOR node offline during a SST #3227

Merged

renecannao closed this as completed in #3227 Jan 8, 2021

renecannao added a commit that referenced this issue Jan 8, 2021

Merge pull request #3227 from sysown/v2.1.0-2953

da8979d

Closes #2953: Honor 'wsrep_sst_donor_rejects_queries' avoiding setting a DONOR node offline during a SST

sysown locked as off-topic and limited conversation to collaborators Jan 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Galera node wrongly set as offline during SST with xtrabackup / mariabackup #2953

Galera node wrongly set as offline during SST with xtrabackup / mariabackup #2953

maximumG commented Jul 15, 2020 •

edited

Loading

mtryfoss commented Aug 25, 2020

lots0logs commented Oct 25, 2020

lots0logs commented Oct 28, 2020

mtryfoss commented Oct 29, 2020

lots0logs commented Oct 31, 2020

dyipon commented Nov 13, 2020

lots0logs commented Nov 18, 2020

JavierJF commented Nov 19, 2020

lots0logs commented Nov 19, 2020 •

edited

Loading

JavierJF commented Nov 20, 2020

twarkie commented Nov 30, 2020 •

edited

Loading

lots0logs commented Nov 30, 2020 •

edited

Loading

renecannao commented Nov 30, 2020

lots0logs commented Nov 30, 2020

renecannao commented Nov 30, 2020

lots0logs commented Nov 30, 2020 •

edited

Loading

mtryfoss commented Nov 30, 2020

lots0logs commented Nov 30, 2020

dyipon commented Dec 2, 2020 •

edited

Loading

Bazze commented Jan 12, 2021

JavierJF commented Jan 13, 2021

Galera node wrongly set as offline during SST with xtrabackup / mariabackup #2953

Galera node wrongly set as offline during SST with xtrabackup / mariabackup #2953

Comments

maximumG commented Jul 15, 2020 • edited Loading

ProxySQL version

OS Version

Infrastructure

Issue description

mtryfoss commented Aug 25, 2020

lots0logs commented Oct 25, 2020

lots0logs commented Oct 28, 2020

mtryfoss commented Oct 29, 2020

lots0logs commented Oct 31, 2020

dyipon commented Nov 13, 2020

lots0logs commented Nov 18, 2020

JavierJF commented Nov 19, 2020

lots0logs commented Nov 19, 2020 • edited Loading

JavierJF commented Nov 20, 2020

twarkie commented Nov 30, 2020 • edited Loading

lots0logs commented Nov 30, 2020 • edited Loading

renecannao commented Nov 30, 2020

lots0logs commented Nov 30, 2020

renecannao commented Nov 30, 2020

lots0logs commented Nov 30, 2020 • edited Loading

mtryfoss commented Nov 30, 2020

lots0logs commented Nov 30, 2020

dyipon commented Dec 2, 2020 • edited Loading

Bazze commented Jan 12, 2021

JavierJF commented Jan 13, 2021

maximumG commented Jul 15, 2020 •

edited

Loading

lots0logs commented Nov 19, 2020 •

edited

Loading

twarkie commented Nov 30, 2020 •

edited

Loading

lots0logs commented Nov 30, 2020 •

edited

Loading

lots0logs commented Nov 30, 2020 •

edited

Loading

dyipon commented Dec 2, 2020 •

edited

Loading