[stable/rabbitmq] Recover from "Waiting for Mnesia tables" after all nodes forced shutdown #13485

denis111 · 2019-05-03T08:32:43Z

Is your feature request related to a problem? Please describe.
Yes, in our dev/staging environment in AWS we turn-off at night time (set scaling group to 0) the nodes of EKS cluster to save costs and we enabled persistence for rabbitmq, so it was like unexpected shutdown for all rabbitmq nodes and it can't recover with "Waiting for Mnesia tables", i tried setting podManagementPolicy and service.alpha.kubernetes.io/tolerate-unready-endpoints: "true" but it had no effect, still keeps looping with "Waiting for Mnesia tables"

Describe the solution you'd like
Can this solution be applied #9645 (comment) as an option? We'd really prefer availability over integrity.

Describe alternatives you've considered
Don't use persistence.

miguelaeh · 2019-05-15T07:11:08Z

Hi @denis111 ,
Are you able to know the order in which the nodes were shut down? I am not a RabbitMQ expert but in the documentation you can see this:

Normally when you shut down a RabbitMQ cluster altogether, the first node you restart should be the last one to go down, since it may have seen things happen that other nodes did not. But sometimes that's not possible: for instance if the entire cluster loses power then all nodes may think they were not the last to shut down.

Link to documentation: https://www.rabbitmq.com/rabbitmqctl.8.html#force_boot

Could you try if the force_boot works in case you not know the order?

denis111 · 2019-05-15T07:44:29Z

@miguelaeh Thank you for answering. No, we can't know the order, this is just autoscaling group with schedule to scale to 0 instances at night time when nobody's working. So it's unacceptable for us if anything is not able to recover from such "disaster" as sort of "unexpected" shut down of all nodes so in that case we prefer not using persistence because we prefer availability over integrity.

I hope I'll try to play with force_boot this friday and I will let you know if it worked.

miguelaeh · 2019-05-15T14:42:46Z

Thank you @denis111 ,
let me know what happen when you try with this option.

denis111 · 2019-05-17T09:17:41Z

Well, first, I can't execute rabbitmqctl force_boot because it says "Error: this command requires the 'rabbit' app to be stopped on the target node. Stop it with 'rabbitmqctl stop_app'.". But if we stop it the pod will just restart without giving a possibility to execute "abbitmqctl force_boot"...
So i created force_load file in "/opt/bitnami/rabbitmq/var/lib/rabbitmq/mnesia/rabbit@rabbitmq-pre-0.rabbitmq-pre-headless.pre.svc.cluster.local" in my case, and it worked!

miguelaeh · 2019-05-17T13:19:37Z

I'm glad it worked.
That is a cool solution.

denis111 · 2019-05-17T13:26:43Z

Yes but how to autmate it? I mean the creation of force_load file in some init container maybe...

miguelaeh · 2019-05-20T06:44:56Z

You could try mounting the file in the init container via a Configmap, but if you don't need to execute any command before the main container start, I guess you could just mount the file in the main container (also with a Configmap).

denis111 · 2019-05-20T08:14:50Z

I can't find the "helm way" to do it, existing configmap template in rabbitmq chart doesn't allow to add extra files as well as statefuset template doesn't allow to add some extra init container or extra volume mount...

miguelaeh · 2019-05-20T10:46:00Z

The Chart does not support that at this moment. You have to add it manually (you can clone the repository and modify the Chart with your needs),

denis111 · 2019-05-21T16:51:00Z

Well, we'd like to use mainstream chart, I'll see if I can make pull request then.

denis111 · 2019-06-04T15:48:55Z

I've detected if we enable forceBoot option on a new install without existing PVC (with clean new volume) then RabbitMQ is unable to start with Error: enoent. I'm creating a PR to address this issue.

…helm#13485) (helm#14491) * [stable/rabbitmq] fix Error: enoent with forceBoot on new install (see helm#13485) Signed-off-by: Denis Kalgushkin <[email protected]> * [stable/rabbitmq] Bump chart version for PR 14491 (see helm#13485) Signed-off-by: Denis Kalgushkin <[email protected]>

…helm#13485) (helm#14491) * [stable/rabbitmq] fix Error: enoent with forceBoot on new install (see helm#13485) Signed-off-by: Denis Kalgushkin <[email protected]> * [stable/rabbitmq] Bump chart version for PR 14491 (see helm#13485) Signed-off-by: Denis Kalgushkin <[email protected]> Signed-off-by: Andrii Nasinnyk <[email protected]>

mhyousefi · 2020-01-28T07:15:54Z

Hey @denis111, so I'm having this issue with the latest version of stable/rabbitmq-ha, i.e.

[warning] <0.311.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_durable_queue]}

Do I need to make any modifications to the values.yaml to make use of your adjustments?

mhyousefi · 2020-01-28T08:02:09Z

Actually, removing the pvcs before redeploying my rabbit resolved the problem for me.

akrepon · 2020-02-05T12:36:07Z

I think the the easiest solution is to have an init container which deletes the mnesia folder during startup.

andylippitt · 2020-02-06T23:25:50Z

if the rabbit node is waiting for the other nodes to come up, and it's not coming up because the statefulset is booting them sequentially, how about podManagementPolicy: Parallel

Parallel: "will create pods in parallel to match the desired scale without waiting, and on scale down will delete all pods at once" - https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.11/#statefulsetspec-v1-apps

andylippitt · 2020-02-07T00:00:16Z

some basic testing with: podManagementPolicy: Parallel

As expected all nodes start simultaneously and it seems to recover correctly. Further, with missing volumes, the node-discovery/cluster-init seems work as expected, however the cluster name was randomized if that matters to you.

vkasgit · 2020-02-12T22:48:56Z

@andylippitt smart idea to use podManagementPolicy: Parallel
However was thinking what would happen if you upgrade your RMQ cluster eg. new rmq image...in that case will it not take down all the pods at once and recreate all of them in parallel with new image?

vkasgit · 2020-02-13T00:26:54Z

ignore my comment above, the updateStrategy: RollingUpdate will take care of that situation.

Do you think in any edge case scenario having podManagementPolicy: Parallel will create an outage since it takes down all 3 pods?

andylippitt · 2020-02-18T23:28:13Z

I have found a problem with podManagementPolicy: Parallel. There's a race condition on initialization if you don't specify rabbitmqErlangCookie. I now have a condition where a single pod is running with a RABBITMQ_ERLANG_COOKIE which is different from the current value of the secret. I suspect this was a result of concurrent initialization and will try to reinstall with an explicit value.

vkasgit · 2020-02-18T23:37:50Z

Yes I did test with podManagementPolicy: Parallel but faced issues. sometimes pods did not comeback healthy. Instead setting force_boot flag to true which was suggested in the previous thread worked for me. I tested multiple times bringing the pods all together at once, bringing pods one at a time within like 2-5 mins creating some sort of a mess and with that flag on all the pods came back healthy.

Additional setting in custom values.yaml:
We are also setting the lifecycle for a pod so the pods have graceful termination when it is taken down.
lifecycle:
preStop:
exec:
command: ["rabbitmqctl","shutdown"]

andylippitt · 2020-02-18T23:57:14Z

@vkasgit were you specifying an explicit value for rabbitmqErlangCookie in your failed testing?

Edit: I think the issue is not a concurrency issue, rather in our case we just ran into this: https://github.com/helm/charts/issues/5167 tl;dr: specify rabbitmqErlangCookie in your prod installs

vkasgit · 2020-02-19T00:22:14Z

in our installation that was failing i had the secret pre-created with erlang cookie

zffocussss · 2020-03-31T13:33:15Z

Actually, removing the pvcs before redeploying my rabbit resolved the problem for me.

is it safe to remove the pvc?

hakanozanlagan · 2020-04-27T12:46:20Z

Actually, removing the pvcs before redeploying my rabbit resolved the problem for me.

is it safe to remove the pvc?

no it's not safe. this mounts keeps important files (db files etc)
volumeMounts:
- mountPath: /var/lib/rabbitmq
name: data
- mountPath: /etc/rabbitmq
name: config
- mountPath: /etc/definitions
name: definitions
readOnly: true
dnsPolicy: ClusterFirst

my problem solved with below method. (change clustername). run exec command when pod state is Running
kubectl exec -ti clustername-rbmq-rabbitmq-ha-0 /bin/sh
cd var/lib/rabbitmq/mnesia/rabbit@perfx-rbmq-rabbitmq-ha-0.clustername-rbmq-rabbitmq-ha-discovery.hrnext-prod.svc.cluster.local
touch force_load

watch for pod statuses

rakeshnambiar · 2020-05-13T21:08:32Z

@hakanozanlagan first of all thanks for posting your solution. I tried the same step and unfortunately, I am using the user rabbitmq which don't have the permission in the folder and I don't have any sudo access. Is there any alternative solution?

vkasgit · 2020-05-14T01:28:34Z

@rakeshnambiar In your helm chart values.yml Did you explicitly try setting the force_boot flag to true? Try that option. Also check your user permissions as well. Those can also be set throught he values.yml

rakeshnambiar · 2020-05-14T13:07:55Z

Hi @vkasgit thanks the force_boot solved the issue and I can also see the runAsUser etc on the values yaml. Btw - by default, it's created 3 PODs and I can see 3 PVCs as well. Is this expected?

ytjohn · 2020-05-19T14:36:59Z

@vkasgit We are occasionally running into this mnesia table issue ourselves (which we have been fixing by deleting the pvc). I was curious if by setting updateStrategy to RollingUpdate (instead of the default onDelete) eliminates the need for force_boot? Our podManagementPolicy is the default of OrderedReady. In fact, other than basic password and policies, our values are otherwise default.

vkasgit · 2020-05-19T16:53:39Z

@ytjohn Do you have the forceBoot: true flag on and still occasionally run into mnesia table?

The following are some settings I have and haven't run into mnesia issue so far(knock on the wood). Try adding lifecycle and see if that helps. What that does is when RMQ node is forcefully taken down the prestop command kicks in and makes a graceful termination.

podManagementPolicy: OrderedReady
updateStrategy: RollingUpdate
forceBoot: true

lifecycle:
preStop:
exec:
command: [rabbitmqctl, shutdown]

EDIT: Please check the indents of lifecyle. unable to indent properly

ytjohn · 2020-05-19T17:34:50Z

I haven't tried forceBoot: true yet , but it seemed a rolling update would pretty much take care of the need for for forceBoot. That said, I don't think it will hurt either, so we will go ahead and set them both and if that keeps the mnesia table issue from popping up, we'll call it good. Thank you.

Davidrjx · 2020-06-06T15:16:28Z

@akrepon deleting mnesia database would cause crashed pod as standalone or blank node

Davidrjx · 2020-06-07T05:35:47Z

Actually, removing the pvcs before redeploying my rabbit resolved the problem for me.

is it safe to remove the pvc?

no it's not safe. this mounts keeps important files (db files etc)
volumeMounts:

mountPath: /var/lib/rabbitmq
name: data

mountPath: /etc/rabbitmq
name: config

mountPath: /etc/definitions
name: definitions
readOnly: true
dnsPolicy: ClusterFirst

my problem solved with below method. (change clustername). run exec command when pod state is Running
kubectl exec -ti clustername-rbmq-rabbitmq-ha-0 /bin/sh
cd var/lib/rabbitmq/mnesia/rabbit@perfx-rbmq-rabbitmq-ha-0.clustername-rbmq-rabbitmq-ha-discovery.hrnext-prod.svc.cluster.local
touch force_load

watch for pod statuses

i do not fully understand your solution about cluster_name change, just only touching force_load file in mnesia data dir?

minhnguyenvan95 · 2020-10-28T02:48:47Z

Well, first, I can't execute rabbitmqctl force_boot because it says "Error: this command requires the 'rabbit' app to be stopped on the target node. Stop it with 'rabbitmqctl stop_app'.". But if we stop it the pod will just restart without giving a possibility to execute "abbitmqctl force_boot"...
So i created force_load file in "/opt/bitnami/rabbitmq/var/lib/rabbitmq/mnesia/rabbit@rabbitmq-pre-0.rabbitmq-pre-headless.pre.svc.cluster.local" in my case, and it worked!

for rabbitmq install from helm chart it is /var/lib/rabbitmq/mnesia/rabbit@rabbitmq-ha-0.rabbitmq-ha-discovery.acbo-queues.svc.cluster.local

This was referenced May 22, 2019

[stable/rabbitmq] Option to create force_load file (#13485) #14059

Closed

[stable/rabbitmq] Option to execute 'rabbitmqctl force_boot' #14149

Merged

k8s-ci-robot closed this as completed in #14149 May 27, 2019

denis111 mentioned this issue Jun 4, 2019

[stable/rabbitmq] fix Error: enoent with forceBoot on new install (see #13485) #14491

Merged

4 tasks

grahamn-gr mentioned this issue Nov 13, 2019

force_boot rabbitmq if mnesia database exists ansible/awx#5314

Closed

This was referenced Nov 13, 2019

Force_boot rabbit if mnesia exists ansible/awx#5316

Closed

Force boot rabbit if mnesia already exists ansible/awx#5317

Closed

Force boot rabbit if mnesia database already exists ansible/awx-rabbitmq#13

Merged

logicbomb421 mentioned this issue Jun 18, 2020

[bitnami/rabbitmq]: Error while waiting for Mnesia tables bitnami/charts#2868

Closed

moonrail mentioned this issue Nov 20, 2020

Updated RabbitMQ-Chart to 1.46.1 & improved Reboot-Resilience StackStorm/stackstorm-k8s#158

Merged

elsopapa mentioned this issue Sep 6, 2021

Problem with nfs deploy on job job-st2-register-content StackStorm/stackstorm-k8s#238

Closed

Kapildev2018 mentioned this issue Sep 26, 2022

Reinstalling the new custom pack is not getting rendered on the Web UI StackStorm/stackstorm-k8s#315

Open

achton mentioned this issue Dec 7, 2023

Broker pods erroring with "Waiting for Mnesia tables" uselagoon/lagoon#3610

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[stable/rabbitmq] Recover from "Waiting for Mnesia tables" after all nodes forced shutdown #13485

[stable/rabbitmq] Recover from "Waiting for Mnesia tables" after all nodes forced shutdown #13485

denis111 commented May 3, 2019

miguelaeh commented May 15, 2019

denis111 commented May 15, 2019

miguelaeh commented May 15, 2019

denis111 commented May 17, 2019

miguelaeh commented May 17, 2019

denis111 commented May 17, 2019

miguelaeh commented May 20, 2019

denis111 commented May 20, 2019

miguelaeh commented May 20, 2019

denis111 commented May 21, 2019

denis111 commented Jun 4, 2019

mhyousefi commented Jan 28, 2020 •

edited

Loading

mhyousefi commented Jan 28, 2020

akrepon commented Feb 5, 2020

andylippitt commented Feb 6, 2020

andylippitt commented Feb 7, 2020

vkasgit commented Feb 12, 2020

vkasgit commented Feb 13, 2020

andylippitt commented Feb 18, 2020

vkasgit commented Feb 18, 2020 •

edited

Loading

andylippitt commented Feb 18, 2020 •

edited

Loading

vkasgit commented Feb 19, 2020

zffocussss commented Mar 31, 2020 •

edited

Loading

hakanozanlagan commented Apr 27, 2020

rakeshnambiar commented May 13, 2020

vkasgit commented May 14, 2020

rakeshnambiar commented May 14, 2020

ytjohn commented May 19, 2020

vkasgit commented May 19, 2020 •

edited

Loading

ytjohn commented May 19, 2020

Davidrjx commented Jun 6, 2020

Davidrjx commented Jun 7, 2020

minhnguyenvan95 commented Oct 28, 2020

[stable/rabbitmq] Recover from "Waiting for Mnesia tables" after all nodes forced shutdown #13485

[stable/rabbitmq] Recover from "Waiting for Mnesia tables" after all nodes forced shutdown #13485

Comments

denis111 commented May 3, 2019

miguelaeh commented May 15, 2019

denis111 commented May 15, 2019

miguelaeh commented May 15, 2019

denis111 commented May 17, 2019

miguelaeh commented May 17, 2019

denis111 commented May 17, 2019

miguelaeh commented May 20, 2019

denis111 commented May 20, 2019

miguelaeh commented May 20, 2019

denis111 commented May 21, 2019

denis111 commented Jun 4, 2019

mhyousefi commented Jan 28, 2020 • edited Loading

mhyousefi commented Jan 28, 2020

akrepon commented Feb 5, 2020

andylippitt commented Feb 6, 2020

andylippitt commented Feb 7, 2020

vkasgit commented Feb 12, 2020

vkasgit commented Feb 13, 2020

andylippitt commented Feb 18, 2020

vkasgit commented Feb 18, 2020 • edited Loading

andylippitt commented Feb 18, 2020 • edited Loading

vkasgit commented Feb 19, 2020

zffocussss commented Mar 31, 2020 • edited Loading

hakanozanlagan commented Apr 27, 2020

rakeshnambiar commented May 13, 2020

vkasgit commented May 14, 2020

rakeshnambiar commented May 14, 2020

ytjohn commented May 19, 2020

vkasgit commented May 19, 2020 • edited Loading

ytjohn commented May 19, 2020

Davidrjx commented Jun 6, 2020

Davidrjx commented Jun 7, 2020

minhnguyenvan95 commented Oct 28, 2020

mhyousefi commented Jan 28, 2020 •

edited

Loading

vkasgit commented Feb 18, 2020 •

edited

Loading

andylippitt commented Feb 18, 2020 •

edited

Loading

zffocussss commented Mar 31, 2020 •

edited

Loading

vkasgit commented May 19, 2020 •

edited

Loading