Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bitnami/rabbitmq]: Error while waiting for Mnesia tables #2868

Closed
logicbomb421 opened this issue Jun 18, 2020 · 30 comments · Fixed by #3016
Closed

[bitnami/rabbitmq]: Error while waiting for Mnesia tables #2868

logicbomb421 opened this issue Jun 18, 2020 · 30 comments · Fixed by #3016

Comments

@logicbomb421
Copy link
Contributor

Which chart:
bitnami/[email protected]

Describe the bug
It seems whenever the StatefulSet needs to update, but still has an existing PVC, we get stuck in a loop during initialization with the following error:

rabbitmq-2 rabbitmq 2020-06-18 03:41:58.837 [info] <0.268.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
rabbitmq-2 rabbitmq 2020-06-18 03:41:58.837 [warning] <0.268.0> Error while waiting for Mnesia tables: {failed_waiting_for_tables,['[email protected]','[email protected]','[email protected]'],{node_not_running,'[email protected]'}}

To Reproduce
Steps to reproduce the behavior:

  1. Install the chart (used values-production.yaml), wait for stabilization
  2. Make a change to the chart (I added prometheus.return_per_object_metrics = true to extraConfiguration), run helm upgrade
  3. StatefulSet pods will start to rollout, but the first one that resets will continuously loop with the above error for 10 tries, then restart.

Expected behavior
The chart is able to update and rollout successfully.

Version of Helm and Kubernetes:

  • Output of helm version:
Client: &version.Version{SemVer:"v2.16.7", GitCommit:"5f2584fd3d35552c4af26036f0c464191287986b", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.16.7", GitCommit:"5f2584fd3d35552c4af26036f0c464191287986b", GitTreeState:"clean"}
  • Output of kubectl version:
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-16T23:35:15Z", GoVersion:"go1.14.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.10-gke.36", GitCommit:"34a615f32e9a0c9e97cdb9f749adb392758349a6", GitTreeState:"clean", BuildDate:"2020-04-06T16:33:17Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}

Additional context
I found helm/charts#13485 in the original stable/rabbitmq chart repo, which led me to read the section on restarting clustered nodes in the RabbitMQ docs. This seems to suggest that the fix for this is enabling force_boot, which I see this chart supports. I am new to running a RabbitMQ cluster myself, so I am slightly unclear on if this is okay to enable?

Alternatively force_boot rabbitmqctl command can be used on a node to make it boot without trying to sync with any peers (as if they were last to shut down)

This seems to suggest that, with force_boot enabled, if a node in my cluster goes down, when it comes back up it will not synchronize with the master at all? I could be completely wrong here, though.

Please let me know if any additional information is needed. Thanks!

@josefschabasser
Copy link

josefschabasser commented Jun 18, 2020

Since chart version 7.0.0 I can't even install the chart because of this error. After 15 minutes, only 2 of 3 pods are up and constantly restarting.

$ kubectl get pods
NAME         READY   STATUS    RESTARTS   AGE
rabbitmq-0   0/1     Running   3          15m
rabbitmq-1   0/1     Running   3          12m
# values.yaml
metrics:
  enabled: false
persistence:
  enabled: true
  size: 1Gi
auth:
  erlangCookie: [MY_ERLANG_COOKIE]
  password: [MY_PASSWORD]
  username: [MY_USERNAME]
rbac:
 create: true
replicaCount: 3
resources:
  requests:
    cpu: 100m
    memory: 256Mi
updateStrategyType: RollingUpdate

EDIT: I uninstalled the chart, reverted the values and installed chart v6.28.1 - everything works fine.

@juan131
Copy link
Contributor

juan131 commented Jun 19, 2020

Hi @logicbomb421

This is probably related to this issue which was already addressed and fixed at this PR. Sorry for the inconveniences.

Could you please give a try to the latest chart version?

@logicbomb421
Copy link
Contributor Author

logicbomb421 commented Jun 19, 2020

Hi @juan131, thanks for the reply!

Unfortunately it appears I'm getting a YAML parse error when attempting to upgrade to 7.0.2:

UPGRADE FAILED
Error: YAML parse error on rabbitmq/templates/statefulset.yaml: error converting YAML to JSON: yaml: line 111: did not find expected '-' indicator
Error: UPGRADE FAILED: YAML parse error on rabbitmq/templates/statefulset.yaml: error converting YAML to JSON: yaml: line 111: did not find expected '-' indicator

The command I used was:

helm upgrade rabbitmq bitnami/rabbitmq --version 7.0.2 \
  --values values.yaml \
  --set auth.password=$(kubectl get secret --namespace mhill rabbitmq -o jsonpath="{.data.rabbitmq-password}" | base64 --decode) \
  --set auth.earlangCookie=$(kubectl get secret --namespace mhill rabbitmq -o jsonpath="{.data.rabbitmq-password}" | base64 --decode) \
  --reuse-values

The values.yaml I am using is a copy of values-production.yaml with the configuration tweaks I have made in my environment. If you need to see these, please let me know and I'll be happy to provide. Thanks!


Edit: looks like this happens with a fresh install of 7.0.2 as well:

helm install -n rabbitmq bitnami/rabbitmq --version 7.0.2 --values values.yaml

@ransoor2
Copy link

@juan131 This issue happens with 7.0.3 as well.
To reproduce the issue I created a new deployment from scratch, works fine.
Then I deleted the entire stateful set and when upgrading again I get the "Error while waiting for Mnesia tables".
Trying the clustering.forceBoot option does not work as well (there is another error appearing).
Waiting for an update on this issue

@josefschabasser
Copy link

@juan131 Helm chart v7.0.3 fixed the problem during install from scratch. All pods ready after 5 minutes, no restarts after 20 minutes. I didn't test upgrading though.

@luos
Copy link

luos commented Jun 22, 2020

Hi,

We are having the same issue, seems like the dns resolution is lost between the nodes, and that is why they can not connect after updating the replicaCount in the values. Tested with version 7.1.0.

Seems like the names are not resolved:

2> erl_epmd:names("192.168.58.188").
{ok,[{"rabbit",25672}]}
3> erl_epmd:names("my-rmqgh-rabbitmq-0.my-rmqgh-rabbitmq-headless.default.svc.cluster.local").
{error,nxdomain}

The error is:

2020-06-22 15:51:44.517 [warning] <0.268.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit@my-rmqgh-rabbitmq-2.my-rmqgh-rabbitmq-headless.default.svc.cluster.local','rabbit@my-rmqgh-rabbitmq-1.my-rmqgh-rabbitmq-headless.default.svc.cluster.local','rabbit@my-rmqgh-rabbitmq-0.my-rmqgh-rabbitmq-headless.default.svc.cluster.local'],[rabbit_durable_queue]}

@juan131
Copy link
Contributor

juan131 commented Jun 23, 2020

Hi everyone! Thanks for reporting this and sorry for the incoveniences.

I think the issue should be solved with the fix done by @arodus at #2894

Could you please give it a try?

@ransoor2
Copy link

@juan131 I upgraded to image docker.io/bitnami/rabbitmq:3.8.5-debian-10-r14 with chart 7.1.1
The issue continues

@ransoor2
Copy link

@juan131 I tried a workaround - I added
podManagementPolicy: Parallel
And after some restarts it seems to be working, but, now I get:
PLAIN login refused: user '' - invalid credentials
Why was the password changed?

@josefschabasser
Copy link

@juan131 I tried a workaround - I added
podManagementPolicy: Parallel
And after some restarts it seems to be working, but, now I get:
PLAIN login refused: user '' - invalid credentials
Why was the password changed?

Did you specify auth.username, auth.password and auth.erlangCookie during upgrading? If not, new random values are generated. And the username user hints at no auth data at all.

@ransoor2
Copy link

@josefschabasser of course. I made sure it reflected in the secret and also I can see them in env variable inside the pod

@luos
Copy link

luos commented Jun 23, 2020

Hi,

Any easy way to get into this situation is this:

  1. deploy the cluster with a fixed user / password / cookie.
  2. change the username to something else
  3. upgrade the cluster, realize that this was a bad idea because now healthchecks are failing
  4. change the user back to the original
  5. upgrade the cluster to try to revert
  6. delete the pods one-by-one to recover, starting with -0.

It is unsafe to try to change the user/password after deployment and rabbitmq only inserts these users if no data is present, healthchecks start to fail and nodes can not recover.

If this functionality is needed maybe the API can be used to insert/update the user instead of the config file.

I am not sure if the original issue is still present as I could not reproduce it now.

@ransoor2
Copy link

@luos Thats not what happened.
When I first upgraded the chart and had the issues above, I tried the podManagementPolicy: Parallel workaround, then I first encountered the password issues.
After that I did remove the password from the yaml file and let it generate a random password, but thats only after the first issue with the passwords.
I was now able to cancel the health checks and change the password, but seems like the entire data was deleted from rabbit and cluster was reset. This sucks
Anyways, this issue seems not relevant to this thread, but my guess would be that it happened when I first tried to upgrade the chart and forgot to change the rabbitmq to auth

@Danh4
Copy link

Danh4 commented Jun 23, 2020

I've just tried 7.1.1 with a single node, the only config I specified was to enable ingress and the hostname.

I can't log into the management console using user and the password saved in the secret and the readiness/liveliness probes both return 401 which eventually kills the container. I've tried playing around by specifying the username and password with a clean install but to no effect.

@juan131
Copy link
Contributor

juan131 commented Jun 24, 2020

Hi everyone,

As @josefschabasser mentioned, it's necessary to specify auth.username, auth.password and auth.erlangCookie during upgrades. This is documented in the link below:

I know this can be very painful. And it's even more problematic when you manually change the password through the Management console after installing the cart.

Another thing you must take into account is that PVCs are not removed when you uninstall a chart release. The must be manually removed. Therefore, if you install a chart using the same name used on a previous chart you can find conflicts since it'll reuse existing. See:

$ helm install foo bitnami/rabbitmq
$ kubectl get pvc
NAME                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
data-foo-rabbitmq-0   Bound    pvc-06f78040-6bbb-4bdb-a14a-5a3c5392685a   8Gi        RWO            standard       4s
$ helm uninstall foo bitnami/rabbitmq
$ kubectl get pvc
NAME                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
data-foo-rabbitmq-0   Bound    pvc-06f78040-6bbb-4bdb-a14a-5a3c5392685a   8Gi        RWO            standard       45s

This is a known issue I reported long time ago (see helm/helm#5156, so please ensure you remove PVCs after uninstall your charts.

@logicbomb421
Copy link
Contributor Author

logicbomb421 commented Jun 24, 2020

I seem to be encountering this error when my cluster hits its disk_free_limit now, too (chart v7.0.3).

I just ran a stress test where I loaded messages in until I hit the disk_free_limit (25GB, 128GB total disk). When the limit was hit, I see the blocked message in the logs:

rabbitmq-1 rabbitmq 2020-06-24 21:50:47.355 [info] <0.384.0> Free disk space is insufficient. Free bytes: 26536415232. Limit: 26843545600
rabbitmq-1 rabbitmq 2020-06-24 21:50:47.355 [warning] <0.380.0> disk resource limit alarm set on node '[email protected]'.
rabbitmq-1 rabbitmq
rabbitmq-1 rabbitmq **********************************************************
rabbitmq-1 rabbitmq *** Publishers will be blocked until this alarm clears ***
rabbitmq-1 rabbitmq **********************************************************

Followed by what appears to be a reboot sequence:

rabbitmq-1 rabbitmq 2020-06-24 21:50:47.355 [info] <0.268.0> Running boot step code_server_cache defined by app rabbit
rabbitmq-1 rabbitmq 2020-06-24 21:50:47.355 [info] <0.268.0> Running boot step file_handle_cache defined by app rabbit
rabbitmq-1 rabbitmq 2020-06-24 21:50:47.355 [info] <0.387.0> Limiting to approx 1048479 file handles (943629 sockets)
rabbitmq-1 rabbitmq 2020-06-24 21:50:47.355 [info] <0.388.0> FHC read buffering:  OFF
rabbitmq-1 rabbitmq 2020-06-24 21:50:47.355 [info] <0.388.0> FHC write buffering: ON
rabbitmq-1 rabbitmq 2020-06-24 21:50:47.356 [info] <0.268.0> Running boot step worker_pool defined by app rabbit
rabbitmq-1 rabbitmq 2020-06-24 21:50:47.356 [info] <0.377.0> Will use 8 processes for default worker pool
rabbitmq-1 rabbitmq 2020-06-24 21:50:47.356 [info] <0.377.0> Starting worker pool 'worker_pool' with 8 processes in it
rabbitmq-1 rabbitmq 2020-06-24 21:50:47.356 [info] <0.268.0> Running boot step database defined by app rabbit

And finally culminating in this Mnesia table issue:

rabbitmq-1 rabbitmq 2020-06-24 21:50:47.364 [info] <0.268.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
rabbitmq-1 rabbitmq 2020-06-24 21:51:17.365 [warning] <0.268.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['[email protected]','[email protected]'],[rabbit_durable_queue]}
rabbitmq-1 rabbitmq 2020-06-24 21:51:17.365 [info] <0.268.0> Waiting for Mnesia tables for 30000 ms, 8 retries left
rabbitmq-1 rabbitmq 2020-06-24 21:51:47.366 [warning] <0.268.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['[email protected]','[email protected]'],[rabbit_durable_queue]}
rabbitmq-1 rabbitmq 2020-06-24 21:51:47.366 [info] <0.268.0> Waiting for Mnesia tables for 30000 ms, 7 retries left
rabbitmq-1 rabbitmq 2020-06-24 21:52:17.367 [warning] <0.268.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['[email protected]','[email protected]'],[rabbit_durable_queue]}
rabbitmq-1 rabbitmq 2020-06-24 21:52:17.367 [info] <0.268.0> Waiting for Mnesia tables for 30000 ms, 6 retries left
rabbitmq-1 rabbitmq 2020-06-24 21:52:47.368 [warning] <0.268.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['[email protected]','[email protected]'],[rabbit_durable_queue]}
rabbitmq-1 rabbitmq 2020-06-24 21:52:47.368 [info] <0.268.0> Waiting for Mnesia tables for 30000 ms, 5 retries left
rabbitmq-1 rabbitmq 2020-06-24 21:53:17.369 [warning] <0.268.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['[email protected]','[email protected]'],[rabbit_durable_queue]}
rabbitmq-1 rabbitmq 2020-06-24 21:53:17.369 [info] <0.268.0> Waiting for Mnesia tables for 30000 ms, 4 retries left
rabbitmq-1 rabbitmq 2020-06-24 21:53:47.370 [warning] <0.268.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['[email protected]','[email protected]'],[rabbit_durable_queue]}
rabbitmq-1 rabbitmq 2020-06-24 21:53:47.370 [info] <0.268.0> Waiting for Mnesia tables for 30000 ms, 3 retries left
rabbitmq-1 rabbitmq 2020-06-24 21:54:08.744 [info] <0.60.0> SIGTERM received - shutting down
rabbitmq-1 rabbitmq 2020-06-24 21:54:08.745 [warning] <0.268.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['[email protected]','[email protected]'],[rabbit_durable_queue]}
rabbitmq-1 rabbitmq 2020-06-24 21:54:08.745 [info] <0.268.0> Waiting for Mnesia tables for 30000 ms, 2 retries left
rabbitmq-1 rabbitmq 2020-06-24 21:54:08.745 [warning] <0.268.0> Error while waiting for Mnesia tables: {failed_waiting_for_tables,['[email protected]','[email protected]'],{node_not_running,'[email protected]'}}
rabbitmq-1 rabbitmq 2020-06-24 21:54:08.746 [info] <0.268.0> Waiting for Mnesia tables for 30000 ms, 1 retries left
rabbitmq-1 rabbitmq 2020-06-24 21:54:08.746 [warning] <0.268.0> Error while waiting for Mnesia tables: {failed_waiting_for_tables,['[email protected]','[email protected]'],{node_not_running,'[email protected]'}}
rabbitmq-1 rabbitmq 2020-06-24 21:54:08.746 [info] <0.268.0> Waiting for Mnesia tables for 30000 ms, 0 retries left

I also see some other errors below this, but I am not sure if they are relevant:

rabbitmq-1 rabbitmq 2020-06-24 21:54:08.746 [error] <0.268.0> Feature flag `quorum_queue`: migration function crashed: {error,{failed_waiting_for_tables,['[email protected]','[email protected]'],{node_not_running,'[email protected]'}}}
rabbitmq-1 rabbitmq [{rabbit_table,wait,3,[{file,"src/rabbit_table.erl"},{line,120}]},{rabbit_core_ff,quorum_queue_migration,3,[{file,"src/rabbit_core_ff.erl"},{line,60}]},{rabbit_feature_flags,run_migration_fun,3,[{file,"src/rabbit_feature_flags.erl"},{line,1611}]},{rabbit_feature_flags,'-verify_which_feature_flags_are_actually_enabled/0-fun-2-',3,[{file,"src/rabbit_feature_flags.erl"},{line,2278}]},{maps,fold_1,3,[{file,"maps.erl"},{line,232}]},{rabbit_feature_flags,verify_which_feature_flags_are_actually_enabled,0,[{file,"src/rabbit_feature_flags.erl"},{line,2276}]},{rabbit_feature_flags,sync_feature_flags_with_cluster,3,[{file,"src/rabbit_feature_flags.erl"},{line,2091}]},{rabbit_mnesia,ensure_feature_flags_are_in_sync,2,[{file,"src/rabbit_mnesia.erl"},{line,656}]}]
rabbitmq-1 rabbitmq 2020-06-24 21:54:08.746 [error] <0.268.0> Feature flag `virtual_host_metadata`: migration function crashed: {aborted,{no_exists,rabbit_vhost,attributes}}
rabbitmq-1 rabbitmq [{mnesia,abort,1,[{file,"mnesia.erl"},{line,355}]},{rabbit_core_ff,virtual_host_metadata_migration,3,[{file,"src/rabbit_core_ff.erl"},{line,123}]},{rabbit_feature_flags,run_migration_fun,3,[{file,"src/rabbit_feature_flags.erl"},{line,1611}]},{rabbit_feature_flags,'-verify_which_feature_flags_are_actually_enabled/0-fun-2-',3,[{file,"src/rabbit_feature_flags.erl"},{line,2278}]},{maps,fold_1,3,[{file,"maps.erl"},{line,232}]},{rabbit_feature_flags,verify_which_feature_flags_are_actually_enabled,0,[{file,"src/rabbit_feature_flags.erl"},{line,2276}]},{rabbit_feature_flags,sync_feature_flags_with_cluster,3,[{file,"src/rabbit_feature_flags.erl"},{line,2091}]},{rabbit_mnesia,ensure_feature_flags_are_in_sync,2,[{file,"src/rabbit_mnesia.erl"},{line,656}]}]

This is pretty concerning as it seems to indicate that, if I ever hit my disk alarm, the entire cluster is going to crash and enter an unrecoverable state that requires a complete reinstall of the chart. If this is the case, this would mean the system is completely unusable for my needs.

Am I missing something obvious here?

@luos
Copy link

luos commented Jun 25, 2020

In current implementation the health check endpoint returns failure in case an alarm is active, which causes the node to be restarted. Usually this is not a reason to restart RabbitMQ, so maybe a different endpoint could be used?

@juan131
Copy link
Contributor

juan131 commented Jun 25, 2020

I think @luos is right and it could be related to K8s restarting the pod due to failing liveness probes.

@logicbomb421 could you please do the same test disabling the probes (--set livenessProbe.enabled=false,readinessProbe.enabled=false)? Just to confirm the probes are the one causing this.

@logicbomb421
Copy link
Contributor Author

Hi @juan131 @luos,

This hypothesis was indeed correct -- disabling the probes causes RabbitMQ to handle the disk alarm as I would expect.

I've also noticed the rabbitmqctl command currently used in those probes is apparently deprecated. What is the preferred way to perform health checks now? Are there plans to update this chart to use whatever that may be?

Thanks again for all the attention here!

@juan131
Copy link
Contributor

juan131 commented Jun 26, 2020

Hi @logicbomb421

Thanks so much for helping with this.

Maybe we can try using rabbitmqctl ping (which is less intrusive) instead of rabbitmqctl node_health_check that is deprecated as you mentioned. Could you try enabling again the probes and using the custom image I just created for you juanariza131/rabbitmq:3.8-debian-10?

$ helm install my-release bitnami/rabbitmq --set image.repository=juanariza131/rabbitmq --set image.tag=3.8-debian-10`

This image just contains the changes below:

diff --git a/3.8/debian-10/rootfs/opt/bitnami/scripts/rabbitmq/healthcheck.sh b/3.8/debian-10/rootfs/opt/bitnami/scripts/rabbitmq/healthcheck.sh
index 7ef705e..a05ca56 100755
--- a/3.8/debian-10/rootfs/opt/bitnami/scripts/rabbitmq/healthcheck.sh
+++ b/3.8/debian-10/rootfs/opt/bitnami/scripts/rabbitmq/healthcheck.sh
@@ -14,7 +14,7 @@ set -o pipefail
 eval "$(rabbitmq_env)"

 if [[ -f "${RABBITMQ_LIB_DIR}/.start" ]]; then
-    rabbitmqctl node_health_check
+    rabbitmqctl ping
     RESULT=$?
     if [[ $RESULT -ne 0 ]]; then
         rabbitmqctl status

@logicbomb421
Copy link
Contributor Author

logicbomb421 commented Jun 27, 2020

Hi @juan131,

It appears there is still a call to the deprecated health check somewhere.

I have installed the chart with the suggested image.repository and image.tag, however I'm still noticing the following line in the logs:

rabbitmq-0 rabbitmq 2020-06-27 00:13:03.414 [warning] <0.2177.0> rabbitmqctl node_health_check and its HTTP API counterpart are DEPRECATED. See https://www.rabbitmq.com/monitoring.html#health-checks for replacement options.

Further, when I do hit the disk alarm, all RMQ nodes still end up restarting, though somehow I'm not hitting the Mnesia table issue this time. Unfortunately since all nodes end up restarting, any existing connections are lost, and I believe there is also a gap in the message persistince to disk (crashed at 26GiB left, when it rebooted, it showed 29GiB available, leading me to believe it crashed before fully persisting some messages).

Here are the logs for one of the nodes from the time I began the test until crash:

rabbitmq-0 rabbitmq 2020-06-27 00:14:40.299 [info] <0.2419.0> accepting AMQP connection <0.2419.0> (10.60.3.79:35828 -> 10.60.0.3:5672)
rabbitmq-0 rabbitmq 2020-06-27 00:14:40.333 [info] <0.2419.0> Connection <0.2419.0> (10.60.3.79:35828 -> 10.60.0.3:5672) has a client-provided name: perf-test-configuration-0
rabbitmq-0 rabbitmq 2020-06-27 00:14:40.339 [info] <0.2419.0> connection <0.2419.0> (10.60.3.79:35828 -> 10.60.0.3:5672 - perf-test-configuration-0): user 'user' authenticated and granted access to vhost '/'
rabbitmq-0 rabbitmq 2020-06-27 00:14:40.353 [error] <0.2431.0> Channel error on connection <0.2419.0> (10.60.3.79:35828 -> 10.60.0.3:5672, vhost: '/', user: 'user'), channel 2:
rabbitmq-0 rabbitmq operation exchange.declare caused a channel exception not_found: no exchange 'direct' in vhost '/'
rabbitmq-0 rabbitmq 2020-06-27 00:14:40.362 [error] <0.2438.0> Channel error on connection <0.2419.0> (10.60.3.79:35828 -> 10.60.0.3:5672, vhost: '/', user: 'user'), channel 2:
rabbitmq-0 rabbitmq operation queue.declare caused a channel exception not_found: no queue 'trigger-disk-alarm-02' in vhost '/'
rabbitmq-0 rabbitmq 2020-06-27 00:14:40.400 [info] <0.2443.0> Mirrored queue 'trigger-disk-alarm-02' in vhost '/': Adding mirror on node '[email protected]': <46899.2066.0>
rabbitmq-0 rabbitmq 2020-06-27 00:14:40.429 [info] <0.2443.0> Mirrored queue 'trigger-disk-alarm-02' in vhost '/': Synchronising: 0 messages to synchronise
rabbitmq-0 rabbitmq 2020-06-27 00:14:40.429 [info] <0.2443.0> Mirrored queue 'trigger-disk-alarm-02' in vhost '/': Synchronising: batch size: 4096
rabbitmq-0 rabbitmq 2020-06-27 00:14:40.434 [info] <0.2454.0> Mirrored queue 'trigger-disk-alarm-02' in vhost '/': Synchronising: all slaves already synced
rabbitmq-0 rabbitmq 2020-06-27 00:14:40.454 [info] <0.2463.0> accepting AMQP connection <0.2463.0> (10.60.3.79:35834 -> 10.60.0.3:5672)
rabbitmq-0 rabbitmq 2020-06-27 00:14:40.456 [info] <0.2463.0> Connection <0.2463.0> (10.60.3.79:35834 -> 10.60.0.3:5672) has a client-provided name: perf-test-producer-0
rabbitmq-0 rabbitmq 2020-06-27 00:14:40.459 [info] <0.2463.0> connection <0.2463.0> (10.60.3.79:35834 -> 10.60.0.3:5672 - perf-test-producer-0): user 'user' authenticated and granted access to vhost '/'
rabbitmq-0 rabbitmq 2020-06-27 00:15:33.581 [warning] <0.440.0> disk resource limit alarm set on node '[email protected]'.
rabbitmq-0 rabbitmq
rabbitmq-0 rabbitmq **********************************************************
rabbitmq-0 rabbitmq *** Publishers will be blocked until this alarm clears ***
rabbitmq-0 rabbitmq **********************************************************
rabbitmq-0 rabbitmq 2020-06-27 00:15:33.823 [info] <0.444.0> Free disk space is insufficient. Free bytes: 26819485696. Limit: 26843545600rabbitmq-0 rabbitmq 2020-06-27 00:15:33.823 [warning] <0.440.0> disk resource limit alarm set on node '[email protected]'.
rabbitmq-0 rabbitmq
rabbitmq-0 rabbitmq **********************************************************
rabbitmq-0 rabbitmq *** Publishers will be blocked until this alarm clears ***
rabbitmq-0 rabbitmq **********************************************************
rabbitmq-0 rabbitmq 2020-06-27 00:16:52.163 [info] <0.2443.0> Mirrored queue 'trigger-disk-alarm-02' in vhost '/': Synchronising: 22420 messages to synchronise
rabbitmq-0 rabbitmq 2020-06-27 00:16:52.163 [info] <0.2443.0> Mirrored queue 'trigger-disk-alarm-02' in vhost '/': Synchronising: batch size: 4096
rabbitmq-0 rabbitmq 2020-06-27 00:16:52.163 [info] <0.2828.0> Mirrored queue 'trigger-disk-alarm-02' in vhost '/': Synchronising: all slaves already synced
rabbitmq-0 rabbitmq 2020-06-27 00:16:52.177 [info] <0.2448.0> Mirrored queue 'trigger-disk-alarm-02' in vhost '/': Master <rabbit@rabbitmq-0.rabbitmq-headless.mhill.svc.cluster.local.2.2443.0> saw deaths of mirrors <rabbit@rabbitmq-1.rabbitmq-headless.mhill.svc.cluster.local.3.2066.0>
rabbitmq-0 rabbitmq 2020-06-27 00:16:52.479 [info] <0.481.0> rabbit on node '[email protected]' down
rabbitmq-0 rabbitmq 2020-06-27 00:16:52.489 [info] <0.481.0> Keeping [email protected] listeners: the node is already back
rabbitmq-0 rabbitmq 2020-06-27 00:16:52.490 [warning] <0.440.0> disk resource limit alarm cleared for dead node '[email protected]'
rabbitmq-0 rabbitmq 2020-06-27 00:16:52.640 [info] <0.481.0> node '[email protected]' down: connection_closed
rabbitmq-0 rabbitmq 2020-06-27 00:16:57.295 [info] <0.844.0> k8s endpoint listing returned nodes not yet ready: rabbitmq-1
rabbitmq-0 rabbitmq 2020-06-27 00:16:57.295 [warning] <0.844.0> Peer discovery: node [email protected] is unreachable
rabbitmq-0 rabbitmq 2020-06-27 00:17:02.535 [info] <0.481.0> node '[email protected]' up
rabbitmq-0 rabbitmq 2020-06-27 00:17:03.551 [info] <0.60.0> SIGTERM received - shutting down
rabbitmq-0 rabbitmq 2020-06-27 00:17:03.555 [warning] <0.642.0> HTTP listener registry could not find context rabbitmq_prometheus_tls
rabbitmq-0 rabbitmq 2020-06-27 00:17:03.565 [warning] <0.642.0> HTTP listener registry could not find context rabbitmq_management_tls
rabbitmq-0 rabbitmq 2020-06-27 00:17:03.571 [info] <0.268.0> Will unregister with peer discovery backend rabbit_peer_discovery_k8s
rabbitmq-0 rabbitmq 2020-06-27 00:17:03.573 [info] <0.581.0> stopped TLS (SSL) listener on [::]:5671
rabbitmq-0 rabbitmq 2020-06-27 00:17:03.575 [info] <0.565.0> stopped TCP listener on [::]:5672
rabbitmq-0 rabbitmq 2020-06-27 00:17:03.576 [error] <0.2419.0> Error on AMQP connection <0.2419.0> (10.60.3.79:35828 -> 10.60.0.3:5672 - perf-test-configuration-0, vhost: '/', user: 'user', state: running), channel 0:
rabbitmq-0 rabbitmq  operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"
rabbitmq-0 rabbitmq 2020-06-27 00:17:03.576 [error] <0.2463.0> Error on AMQP connection <0.2463.0> (10.60.3.79:35834 -> 10.60.0.3:5672 - perf-test-producer-0, vhost: '/', user: 'user', state: blocked), channel 0:
rabbitmq-0 rabbitmq  operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"
rabbitmq-0 rabbitmq 2020-06-27 00:17:03.578 [info] <0.502.0> Closing all connections in vhost '/' on node '[email protected]' because the vhost is stopping
rabbitmq-0 rabbitmq 2020-06-27 00:17:03.592 [warning] <0.2443.0> Mirrored queue 'trigger-disk-alarm-02' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available
rabbitmq-0 rabbitmq 2020-06-27 00:17:03.608 [info] <0.518.0> Stopping message store for directory '/bitnami/rabbitmq/mnesia/[email protected]/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent'
rabbitmq-0 rabbitmq 2020-06-27 00:17:03.652 [info] <0.518.0> Message store for directory '/bitnami/rabbitmq/mnesia/[email protected]/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent' is stopped
rabbitmq-0 rabbitmq 2020-06-27 00:17:03.652 [info] <0.514.0> Stopping message store for directory '/bitnami/rabbitmq/mnesia/[email protected]/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_transient'
rabbitmq-0 rabbitmq 2020-06-27 00:17:03.657 [info] <0.514.0> Message store for directory '/bitnami/rabbitmq/mnesia/[email protected]/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_transient' is stopped

Thanks again for the attention on this! Please let me know if there is anything else I can do to assist.

@juan131
Copy link
Contributor

juan131 commented Jun 29, 2020

Hi @logicbomb421

Thanks for helping debugging this. Now it seems the issue is related with the API check. We're expecting the curl command below to return: {"status":"ok"}

Let's try a different approach. Please create a custom-values.yaml like the one below:

readinessProbe:
  enabled: false
customReadinessProbe:
  exec:
    command:
      - sh
      - -c
      - rabbitmq-diagnostics -q check_running
  initialDelaySeconds: 10
  timeoutSeconds: 20
  periodSeconds: 30
  failureThreshold: 3
  successThreshold: 1
livenessProbe:
  enabled: false
customLivenessProbe:
  exec:
    command:
      - sh
      - -c
      - rabbitmq-diagnostics -q check_running
  initialDelaySeconds: 120
  timeoutSeconds: 20
  periodSeconds: 30
  failureThreshold: 6
  successThreshold: 1

Then, try installing the chart using:

$ helm install -f custom-values.yaml rabbitqm bitnami/rabbitmq

Let's see if that diagnostic check is less problematic. In case it's still giving problems, we can relax the check and use rabbitmq-diagnostics -q status instead.

@ransoor2
Copy link

@juan131 is there any update on the mnesia tables issue?
Currently if the deployment will go down / k8s will reschedule pods for some reason then rabbit will hit the mnesia issue and data will be deleted.
We really want to use this chart but if this issue is not solved soon we would need to look for other solutions

@juan131
Copy link
Contributor

juan131 commented Jun 30, 2020

Hi @ransoor2

Currently if the deployment will go down / k8s will reschedule pods for some reason then rabbit will hit the mnesia issue and data will be deleted.

I was not able to reproduce that issue on my side. Could you please provide this information?

  • What specific steps should I follow to reproduce it? Please share the exact commands so it's easier for me.
  • What kind of persistence volume are you using?
  • What K8s cluster distro are you using?
  • Are new pods scheduled in the same node?

@ransoor2
Copy link

@juan131 So, it seems that with a new deployment, the mnesia tables issue resolves when disabling the readiness probe.
We experienced the data loss when upgrading the chart from version 6.
Thank you for all your help

@mboutet
Copy link
Contributor

mboutet commented Jun 30, 2020

@juan131, I tried the custom probes and they seem to be more appropriate. From my tests, the probe properly wait for a fully booted node and does not seem to be affected by alarms. It should be the default in values.yaml in my opinion.

@juan131
Copy link
Contributor

juan131 commented Jul 1, 2020

@mboutet that's great!! Glad to hear that. Let's wait for other users to confirm that they don't find issues with these probes and, if that's the case, I agree with making them the default ones.

@juan131
Copy link
Contributor

juan131 commented Jul 1, 2020

We experienced the data loss when upgrading the chart from version 6.

@ransoor2 as we mentioned in the chart README.md, upgrades from version 6 are not supported due to the amount of breaking changes included in the 7.0.0 major version: https://github.com/bitnami/charts/tree/master/bitnami/rabbitmq#to-700

@x0day
Copy link

x0day commented Jul 3, 2020

@juan131 also tried the same way like @mboutet , problem solved.

@juan131
Copy link
Contributor

juan131 commented Jul 3, 2020

I just created a PR to make these probes the default ones. Thanks so much for helping debuging this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants