-
Notifications
You must be signed in to change notification settings - Fork 16.8k
[stable/rabbitmq] Recover from "Waiting for Mnesia tables" after all nodes forced shutdown #13485
Comments
Hi @denis111 ,
Link to documentation: https://www.rabbitmq.com/rabbitmqctl.8.html#force_boot Could you try if the |
@miguelaeh Thank you for answering. No, we can't know the order, this is just autoscaling group with schedule to scale to 0 instances at night time when nobody's working. So it's unacceptable for us if anything is not able to recover from such "disaster" as sort of "unexpected" shut down of all nodes so in that case we prefer not using persistence because we prefer availability over integrity. I hope I'll try to play with force_boot this friday and I will let you know if it worked. |
Thank you @denis111 , |
Well, first, I can't execute rabbitmqctl force_boot because it says "Error: this command requires the 'rabbit' app to be stopped on the target node. Stop it with 'rabbitmqctl stop_app'.". But if we stop it the pod will just restart without giving a possibility to execute "abbitmqctl force_boot"... |
I'm glad it worked. |
Yes but how to autmate it? I mean the creation of force_load file in some init container maybe... |
You could try mounting the file in the init container via a Configmap, but if you don't need to execute any command before the main container start, I guess you could just mount the file in the main container (also with a Configmap). |
I can't find the "helm way" to do it, existing configmap template in rabbitmq chart doesn't allow to add extra files as well as statefuset template doesn't allow to add some extra init container or extra volume mount... |
The Chart does not support that at this moment. You have to add it manually (you can clone the repository and modify the Chart with your needs), |
Well, we'd like to use mainstream chart, I'll see if I can make pull request then. |
I've detected if we enable forceBoot option on a new install without existing PVC (with clean new volume) then RabbitMQ is unable to start with Error: enoent. I'm creating a PR to address this issue. |
…helm#13485) (helm#14491) * [stable/rabbitmq] fix Error: enoent with forceBoot on new install (see helm#13485) Signed-off-by: Denis Kalgushkin <[email protected]> * [stable/rabbitmq] Bump chart version for PR 14491 (see helm#13485) Signed-off-by: Denis Kalgushkin <[email protected]>
…helm#13485) (helm#14491) * [stable/rabbitmq] fix Error: enoent with forceBoot on new install (see helm#13485) Signed-off-by: Denis Kalgushkin <[email protected]> * [stable/rabbitmq] Bump chart version for PR 14491 (see helm#13485) Signed-off-by: Denis Kalgushkin <[email protected]> Signed-off-by: Andrii Nasinnyk <[email protected]>
Hey @denis111, so I'm having this issue with the latest version of
Do I need to make any modifications to the |
Actually, removing the |
I think the the easiest solution is to have an init container which deletes the mnesia folder during startup. |
if the rabbit node is waiting for the other nodes to come up, and it's not coming up because the statefulset is booting them sequentially, how about Parallel: "will create pods in parallel to match the desired scale without waiting, and on scale down will delete all pods at once" - https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.11/#statefulsetspec-v1-apps |
some basic testing with: As expected all nodes start simultaneously and it seems to recover correctly. Further, with missing volumes, the node-discovery/cluster-init seems work as expected, however the cluster name was randomized if that matters to you. |
@andylippitt smart idea to use podManagementPolicy: Parallel |
ignore my comment above, the updateStrategy: RollingUpdate will take care of that situation. Do you think in any edge case scenario having podManagementPolicy: Parallel will create an outage since it takes down all 3 pods? |
I have found a problem with podManagementPolicy: Parallel. There's a race condition on initialization if you don't specify |
Yes I did test with Additional setting in custom values.yaml: |
@vkasgit were you specifying an explicit value for Edit: I think the issue is not a concurrency issue, rather in our case we just ran into this: https://github.com/helm/charts/issues/5167 tl;dr: specify rabbitmqErlangCookie in your prod installs |
in our installation that was failing i had the secret pre-created with erlang cookie |
is it safe to remove the pvc? |
no it's not safe. this mounts keeps important files (db files etc) my problem solved with below method. (change clustername). run exec command when pod state is Running watch for pod statuses |
@hakanozanlagan first of all thanks for posting your solution. I tried the same step and unfortunately, I am using the user |
@rakeshnambiar In your helm chart values.yml Did you explicitly try setting the force_boot flag to true? Try that option. Also check your user permissions as well. Those can also be set throught he values.yml |
Hi @vkasgit thanks the |
@vkasgit We are occasionally running into this mnesia table issue ourselves (which we have been fixing by deleting the pvc). I was curious if by setting updateStrategy to RollingUpdate (instead of the default onDelete) eliminates the need for force_boot? Our podManagementPolicy is the default of OrderedReady. In fact, other than basic password and policies, our values are otherwise default. |
@ytjohn Do you have the The following are some settings I have and haven't run into mnesia issue so far(knock on the wood). Try adding lifecycle and see if that helps. What that does is when RMQ node is forcefully taken down the prestop command kicks in and makes a graceful termination. podManagementPolicy: OrderedReady lifecycle: EDIT: Please check the indents of lifecyle. unable to indent properly |
I haven't tried forceBoot: true yet , but it seemed a rolling update would pretty much take care of the need for for forceBoot. That said, I don't think it will hurt either, so we will go ahead and set them both and if that keeps the mnesia table issue from popping up, we'll call it good. Thank you. |
@akrepon deleting mnesia database would cause crashed pod as standalone or blank node |
i do not fully understand your solution about cluster_name change, just only touching force_load file in mnesia data dir? |
for rabbitmq install from helm chart it is /var/lib/rabbitmq/mnesia/rabbit@rabbitmq-ha-0.rabbitmq-ha-discovery.acbo-queues.svc.cluster.local |
Is your feature request related to a problem? Please describe.
Yes, in our dev/staging environment in AWS we turn-off at night time (set scaling group to 0) the nodes of EKS cluster to save costs and we enabled persistence for rabbitmq, so it was like unexpected shutdown for all rabbitmq nodes and it can't recover with "Waiting for Mnesia tables", i tried setting podManagementPolicy and service.alpha.kubernetes.io/tolerate-unready-endpoints: "true" but it had no effect, still keeps looping with "Waiting for Mnesia tables"
Describe the solution you'd like
Can this solution be applied #9645 (comment) as an option? We'd really prefer availability over integrity.
Describe alternatives you've considered
Don't use persistence.
The text was updated successfully, but these errors were encountered: