-
Notifications
You must be signed in to change notification settings - Fork 9.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bitnami/rabbitmq]: Error while waiting for Mnesia tables #2868
Comments
Since chart version 7.0.0 I can't even install the chart because of this error. After 15 minutes, only 2 of 3 pods are up and constantly restarting.
EDIT: I uninstalled the chart, reverted the values and installed chart v6.28.1 - everything works fine. |
This is probably related to this issue which was already addressed and fixed at this PR. Sorry for the inconveniences. Could you please give a try to the latest chart version? |
Hi @juan131, thanks for the reply! Unfortunately it appears I'm getting a YAML parse error when attempting to upgrade to
The command I used was:
The Edit: looks like this happens with a fresh install of
|
@juan131 This issue happens with 7.0.3 as well. |
@juan131 Helm chart v7.0.3 fixed the problem during install from scratch. All pods ready after 5 minutes, no restarts after 20 minutes. I didn't test upgrading though. |
Hi, We are having the same issue, seems like the dns resolution is lost between the nodes, and that is why they can not connect after updating the Seems like the names are not resolved:
The error is:
|
@juan131 I upgraded to image docker.io/bitnami/rabbitmq:3.8.5-debian-10-r14 with chart 7.1.1 |
@juan131 I tried a workaround - I added |
Did you specify |
@josefschabasser of course. I made sure it reflected in the secret and also I can see them in env variable inside the pod |
Hi, Any easy way to get into this situation is this:
It is unsafe to try to change the user/password after deployment and rabbitmq only inserts these users if no data is present, healthchecks start to fail and nodes can not recover. If this functionality is needed maybe the API can be used to insert/update the user instead of the config file. I am not sure if the original issue is still present as I could not reproduce it now. |
@luos Thats not what happened. |
I've just tried 7.1.1 with a single node, the only config I specified was to enable ingress and the hostname. I can't log into the management console using user and the password saved in the secret and the readiness/liveliness probes both return 401 which eventually kills the container. I've tried playing around by specifying the username and password with a clean install but to no effect. |
Hi everyone, As @josefschabasser mentioned, it's necessary to specify I know this can be very painful. And it's even more problematic when you manually change the password through the Management console after installing the cart. Another thing you must take into account is that PVCs are not removed when you uninstall a chart release. The must be manually removed. Therefore, if you install a chart using the same name used on a previous chart you can find conflicts since it'll reuse existing. See: $ helm install foo bitnami/rabbitmq
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
data-foo-rabbitmq-0 Bound pvc-06f78040-6bbb-4bdb-a14a-5a3c5392685a 8Gi RWO standard 4s
$ helm uninstall foo bitnami/rabbitmq
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
data-foo-rabbitmq-0 Bound pvc-06f78040-6bbb-4bdb-a14a-5a3c5392685a 8Gi RWO standard 45s This is a known issue I reported long time ago (see helm/helm#5156, so please ensure you remove PVCs after uninstall your charts. |
I seem to be encountering this error when my cluster hits its I just ran a stress test where I loaded messages in until I hit the
Followed by what appears to be a reboot sequence:
And finally culminating in this Mnesia table issue:
I also see some other errors below this, but I am not sure if they are relevant:
This is pretty concerning as it seems to indicate that, if I ever hit my disk alarm, the entire cluster is going to crash and enter an unrecoverable state that requires a complete reinstall of the chart. If this is the case, this would mean the system is completely unusable for my needs. Am I missing something obvious here? |
In current implementation the health check endpoint returns failure in case an alarm is active, which causes the node to be restarted. Usually this is not a reason to restart RabbitMQ, so maybe a different endpoint could be used? |
I think @luos is right and it could be related to K8s restarting the pod due to failing liveness probes. @logicbomb421 could you please do the same test disabling the probes ( |
This hypothesis was indeed correct -- disabling the probes causes RabbitMQ to handle the disk alarm as I would expect. I've also noticed the Thanks again for all the attention here! |
Thanks so much for helping with this. Maybe we can try using $ helm install my-release bitnami/rabbitmq --set image.repository=juanariza131/rabbitmq --set image.tag=3.8-debian-10` This image just contains the changes below: diff --git a/3.8/debian-10/rootfs/opt/bitnami/scripts/rabbitmq/healthcheck.sh b/3.8/debian-10/rootfs/opt/bitnami/scripts/rabbitmq/healthcheck.sh
index 7ef705e..a05ca56 100755
--- a/3.8/debian-10/rootfs/opt/bitnami/scripts/rabbitmq/healthcheck.sh
+++ b/3.8/debian-10/rootfs/opt/bitnami/scripts/rabbitmq/healthcheck.sh
@@ -14,7 +14,7 @@ set -o pipefail
eval "$(rabbitmq_env)"
if [[ -f "${RABBITMQ_LIB_DIR}/.start" ]]; then
- rabbitmqctl node_health_check
+ rabbitmqctl ping
RESULT=$?
if [[ $RESULT -ne 0 ]]; then
rabbitmqctl status |
Hi @juan131, It appears there is still a call to the deprecated health check somewhere. I have installed the chart with the suggested
Further, when I do hit the disk alarm, all RMQ nodes still end up restarting, though somehow I'm not hitting the Mnesia table issue this time. Unfortunately since all nodes end up restarting, any existing connections are lost, and I believe there is also a gap in the message persistince to disk (crashed at 26GiB left, when it rebooted, it showed 29GiB available, leading me to believe it crashed before fully persisting some messages). Here are the logs for one of the nodes from the time I began the test until crash:
Thanks again for the attention on this! Please let me know if there is anything else I can do to assist. |
Thanks for helping debugging this. Now it seems the issue is related with the API check. We're expecting the
Let's try a different approach. Please create a custom-values.yaml like the one below: readinessProbe:
enabled: false
customReadinessProbe:
exec:
command:
- sh
- -c
- rabbitmq-diagnostics -q check_running
initialDelaySeconds: 10
timeoutSeconds: 20
periodSeconds: 30
failureThreshold: 3
successThreshold: 1
livenessProbe:
enabled: false
customLivenessProbe:
exec:
command:
- sh
- -c
- rabbitmq-diagnostics -q check_running
initialDelaySeconds: 120
timeoutSeconds: 20
periodSeconds: 30
failureThreshold: 6
successThreshold: 1 Then, try installing the chart using:
Let's see if that diagnostic check is less problematic. In case it's still giving problems, we can relax the check and use |
@juan131 is there any update on the mnesia tables issue? |
Hi @ransoor2
I was not able to reproduce that issue on my side. Could you please provide this information?
|
@juan131 So, it seems that with a new deployment, the mnesia tables issue resolves when disabling the readiness probe. |
@juan131, I tried the custom probes and they seem to be more appropriate. From my tests, the probe properly wait for a fully booted node and does not seem to be affected by alarms. It should be the default in |
@mboutet that's great!! Glad to hear that. Let's wait for other users to confirm that they don't find issues with these probes and, if that's the case, I agree with making them the default ones. |
@ransoor2 as we mentioned in the chart README.md, upgrades from version 6 are not supported due to the amount of breaking changes included in the 7.0.0 major version: https://github.com/bitnami/charts/tree/master/bitnami/rabbitmq#to-700 |
I just created a PR to make these probes the default ones. Thanks so much for helping debuging this. |
Which chart:
bitnami/[email protected]
Describe the bug
It seems whenever the StatefulSet needs to update, but still has an existing PVC, we get stuck in a loop during initialization with the following error:
To Reproduce
Steps to reproduce the behavior:
values-production.yaml
), wait for stabilizationprometheus.return_per_object_metrics = true
toextraConfiguration
), runhelm upgrade
Expected behavior
The chart is able to update and rollout successfully.
Version of Helm and Kubernetes:
helm version
:kubectl version
:Additional context
I found helm/charts#13485 in the original
stable/rabbitmq
chart repo, which led me to read the section on restarting clustered nodes in the RabbitMQ docs. This seems to suggest that the fix for this is enablingforce_boot
, which I see this chart supports. I am new to running a RabbitMQ cluster myself, so I am slightly unclear on if this is okay to enable?This seems to suggest that, with
force_boot
enabled, if a node in my cluster goes down, when it comes back up it will not synchronize with the master at all? I could be completely wrong here, though.Please let me know if any additional information is needed. Thanks!
The text was updated successfully, but these errors were encountered: