[bitnami/redis-cluster] Cluster init script hangs when external access is enabled in eks cluster using aws classic load balancers #16242

lauerdev · 2023-04-26T15:12:22Z

Name and Version

bitnami/redis-cluster 8.4.4

What architecture are you using?

None

What steps will reproduce the bug?

I've been following this guide to enable external access to redis-cluster in an amazon eks cluster: https://docs.bitnami.com/tutorials/deploy-redis-cluster-tmc-helm-chart/#step-5-deploy-the-bitnami-redis-cluster-helm-chart-by-enabling-external-access

The first step completes successfully and I am able to see that the load balancers are created with ports 6379 and 16379 exposed:

redis-cluster-jl                  ClusterIP      172.20.48.94     <none>                    6379/TCP                         18h
redis-cluster-jl-0-svc            LoadBalancer   172.20.243.179   *****.elb.amazonaws.com   6379:32112/TCP,16379:30316/TCP   18h
redis-cluster-jl-1-svc            LoadBalancer   172.20.229.93    *****.elb.amazonaws.com   6379:32357/TCP,16379:31750/TCP   18h
redis-cluster-jl-2-svc            LoadBalancer   172.20.101.213   *****.elb.amazonaws.com   6379:32219/TCP,16379:30617/TCP   18h
redis-cluster-jl-3-svc            LoadBalancer   172.20.19.236    *****.elb.amazonaws.com   6379:32177/TCP,16379:32397/TCP   18h
redis-cluster-jl-4-svc            LoadBalancer   172.20.90.157    *****.elb.amazonaws.com   6379:31838/TCP,16379:30064/TCP   18h
redis-cluster-jl-5-svc            LoadBalancer   172.20.224.185   *****.elb.amazonaws.com   6379:30587/TCP,16379:30976/TCP   18h
redis-cluster-jl-headless         ClusterIP      None             <none>                    6379/TCP,16379/TCP               18h

After adding the load balancer addresses to the cluster.externalAccess.service.loadbalancerIP array and performing the helm upgrade, the statefulset is created and after some time, all 6 nodes appear to come up and report healthy:

However, after further inspection of the logs on pod -0 it appears that the cluster init script is hanging on the following message:

>>> Sending CLUSTER MEET messages to join the cluster
Waiting for the cluster to join

In addition, cluster info reports that the cluster status is fail and that not all 6 nodes have joined successfully:

cluster_state:fail
cluster_slots_assigned:10923
cluster_slots_ok:10923
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:2
cluster_size:2
cluster_current_epoch:2
cluster_my_epoch:2
cluster_stats_messages_ping_sent:9005
cluster_stats_messages_meet_sent:1
cluster_stats_messages_sent:9006
cluster_stats_messages_pong_received:2336
cluster_stats_messages_received:2336
total_cluster_links_buffer_limit_exceeded:0

I have engaged aws cloud support to rule out connectivity issues, and we were able to successfully telnet into all of the loadbalancers on port 6379 as well as 16379 from a test pod within the k8s cluster.

Are you using any custom parameters or values?

cluster:
  externalAccess:
    enabled: true
    service:
      type: LoadBalancer
    loadbalancerIP:
    - load-balancer-0-ip
    - load-balancer-1-ip
    - load-balancer-2-ip
    - load-balancer-3-ip
    - load-balancer-4-ip
    - load-balancer-5-ip

What is the expected behavior?

The cluster boots up successfully and reports that the cluster status is ok. All 6 nodes are joined to the cluster and external clients are able to connect, auth, read and write keys.

What do you see instead?

Cluster init script hangs on the message:

>>> Sending CLUSTER MEET messages to join the cluster
Waiting for the cluster to join

And cluster info reports that the cluster status is fail. Not all 6 nodes have joined successfully:

cluster_state:fail
cluster_slots_assigned:10923
cluster_slots_ok:10923
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:2
cluster_size:2
cluster_current_epoch:2
cluster_my_epoch:2
cluster_stats_messages_ping_sent:9005
cluster_stats_messages_meet_sent:1
cluster_stats_messages_sent:9006
cluster_stats_messages_pong_received:2336
cluster_stats_messages_received:2336
total_cluster_links_buffer_limit_exceeded:0

Additional information

EKS cluster k8s version - 1.24.10
AWS load balancer controller version - 2.4.5

The text was updated successfully, but these errors were encountered:

jmturwy · 2023-04-26T19:35:23Z

Having the same exact issue

jotamartos · 2023-05-08T08:31:44Z

Sorry for the delay here, we are going to work on reproducing this issue to get more information.

jotamartos · 2023-05-08T09:19:24Z

I just tried to reproduce the issue in a local environment using minikube and everything worked as expected. I waited for the IPs to be ready

$ k get svc
NAME                          TYPE           CLUSTER-IP       EXTERNAL-IP      PORT(S)                          AGE
jota-redis-cluster            ClusterIP      10.97.77.222     <none>           6379/TCP                         20m
jota-redis-cluster-0-svc      LoadBalancer   10.105.127.214   10.105.127.214   6379:30710/TCP,16379:30059/TCP   20m
jota-redis-cluster-1-svc      LoadBalancer   10.102.1.60      10.102.1.60      6379:30810/TCP,16379:31186/TCP   20m
jota-redis-cluster-2-svc      LoadBalancer   10.109.203.53    10.109.203.53    6379:30581/TCP,16379:31015/TCP   20m
jota-redis-cluster-3-svc      LoadBalancer   10.106.104.164   10.106.104.164   6379:32320/TCP,16379:31245/TCP   20m
jota-redis-cluster-4-svc      LoadBalancer   10.106.10.60     10.106.10.60     6379:30768/TCP,16379:30978/TCP   20m
jota-redis-cluster-5-svc      LoadBalancer   10.103.211.126   10.103.211.126   6379:31037/TCP,16379:31152/TCP   20m
jota-redis-cluster-headless   ClusterIP      None             <none>           6379/TCP,16379/TCP               20m
kubernetes                    ClusterIP      10.96.0.1        <none>           443/TCP                          83d

And upgraded the deployment with those IPs

$ helm upgrade --namespace default jota-redis-cluster --set "cluster.externalAccess.enabled=true,cluster.externalAccess.service.type=LoadBalancer,cluster.externalAccess.service.loadBalancerIP[0]=10.105.127.214,cluster.externalAccess.service.loadBalancerIP[1]=10.102.1.60,cluster.externalAccess.service.loadBalancerIP[2]=10.109.203.53,cluster.externalAccess.service.loadBalancerIP[3]=10.106.104.164,cluster.externalAccess.service.loadBalancerIP[4]=10.106.10.60,cluster.externalAccess.service.loadBalancerIP[5]=10.103.211.126" --set password=$REDIS_PASSWORD bitnami/redis-cluster

After that, pods became available

$ k get pods
NAME                   READY   STATUS    RESTARTS   AGE
jota-redis-cluster-0   1/1     Running   0          18m
jota-redis-cluster-1   1/1     Running   0          18m
jota-redis-cluster-2   1/1     Running   0          18m
jota-redis-cluster-3   1/1     Running   0          18m
jota-redis-cluster-4   1/1     Running   0          18m
jota-redis-cluster-5   1/1     Running   0          18m

and confirmed that the logs

>>> Check slots coverage...
[OK] All 16384 slots covered.
Cluster correctly created
redis-server "${ARGS[@]}"
45:M 08 May 2023 08:57:23.023 * Background saving terminated with success
45:M 08 May 2023 08:57:23.023 * Synchronization with replica 10.244.1.42:6379 succeeded
45:M 08 May 2023 08:57:25.833 # Cluster state changed: ok

and the cluster info information is correct as well

127.0.0.1:6379> cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:6
cluster_my_epoch:1
cluster_stats_messages_ping_sent:338
cluster_stats_messages_pong_sent:342
cluster_stats_messages_sent:680
cluster_stats_messages_ping_received:337
cluster_stats_messages_pong_received:338
cluster_stats_messages_meet_received:5
cluster_stats_messages_received:680
total_cluster_links_buffer_limit_exceeded:0
127.0.0.1:6379>

Environment:

$ helm version
version.BuildInfo{Version:"v3.9.2", GitCommit:"1addefbfe665c350f4daf868a9adc5600cc064fd", GitTreeState:"clean", GoVersion:"go1.18.4"}

~                                                                                                                                                                                                                                    11:19:33
$ kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.1", GitCommit:"8f94681cd294aa8cfd3407b8191f6c70214973a4", GitTreeState:"clean", BuildDate:"2023-01-18T15:58:16Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"darwin/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.1", GitCommit:"8f94681cd294aa8cfd3407b8191f6c70214973a4", GitTreeState:"clean", BuildDate:"2023-01-18T15:51:25Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}

ntarora-audiohook · 2023-05-11T17:18:11Z

I am also having the same issue, when I test the connection to all pods from redis-0 it works, even using the IPs in the logs.

jotamartos · 2023-05-15T10:39:18Z

Hi @ntarora-audiohook,

As I mentioned above, I couldn't reproduce the issue in my environment. Could you please deploy the solution in a different environment, ensure you deploy the latest version of the solution and your environment uses a stable version of all the components included in the cluster?

lauerdev · 2023-05-16T15:53:10Z

Hey - I just want to make sure that everybody realizes that the issue is happening in an eks cluster. The behavior is very specific to Amazon/AWS and their handling of load balancers. Comparing against a minikube cluster is not an apples-to-apples comparison.

jotamartos · 2023-05-24T15:02:59Z

Correct, it seems to be a specific problem with the Amazon/AWS cluster networking configuration. We will try to reproduce the issue in the same environment to try to get more information but in the meantime, could you try to debug the issue on your side?

You can try to connect to the different nodes from one of the pods and confirm what the issue may be. You can use the redis-cli command line utility and try the following:

Try connecting from one node to the other using the pod's IP
or the private IP of the service
and the domain name of the service. It's really important to confirm the domain is resolved correctly

If the domain is resolved properly but the solution can't simply reach the other node, I think we should contact AWS to know more about what's happening with the cluster.

lauerdev · 2023-05-25T21:00:08Z

AWS support was engaged prior to submitting this issue. This is a summary of their findings:

Thank you for reaching out to AWS Premium Support. Below you will find a quick summary of the conversation on our call today.

You have reached out to us inquiring about an issue you were seeing with Redis pods not being able to communicate with other Redis servers via an external kubernetes service IP. You were following the documentation here [1] to deploy a scalable Redis cluster with Bitnami and Helm.

We were able to inspect the EKS cluster setup and understand that there were no networking issues as such between the pods, however the Redis was showing the following messages:
Sending CLUSTER MEET messages to join the cluster, Waiting for the cluster to join
We were then able to further discuss and get to a conclusion that being able to check Redis activity and understand what it takes to register a node could help in this case. Since the issue was more Redis related and not very much related to EKS, you were going to consider reaching out to the database support team to get their help with troubleshooting this issue further.

Please feel free to write back to us if you have any other questions from an EKS standpoint and I'll be glad to assist further.

Have a nice day!

As part of our troubleshooting, we were able to:

resolve the IP of the external load balancer address using a tool such as nslookup
telnet from one redis node to another over 6379 using both the internal cluster IP as well as the external load balancer address
telnet from a machine outside of the cluster to any of the redis nodes individually over 6379 using the external load balancer address

Therefore, I don't believe it to be an issue with DNS resolution or network connectivity.

jotamartos · 2023-05-31T12:20:47Z

If domains and IPs are accessible and they are resolved correctly, we need to confirm you can connect to the different nodes using the redis-cli command line utility. Can you try to access the different nodes using that tool and confirm they accept the connection? You can also try sending a CLUSTER MEET message to try to manually join the nodes to see if you get an error there (try using the domain o IP to see if you get an error).

https://redis.io/commands/cluster-meet/

lauerdev · 2023-06-02T15:28:49Z

I was able to use redis-cli to connect from one node to the other using the external load balancer host name. However, it appears that it may not be possible to issue CLUSTER MEET commands using hostname:

I have no name!@redis-cluster-jl-0:/$ redis-cli -h <node-5-hostname>

<node-5-hostname>:6379> get foo
(error) CLUSTERDOWN Hash slot not served

<node-5-hostname>:6379> cluster meet <node-0-hostname> 6379
(error) ERR Invalid node address specified:<node-0-hostname>:6379

<node-5-hostname>:6379> cluster meet <node-0-ip-address> 6379
OK

After some digging, it appears that there are a few existing issues relating to problems with using hostnames:

Are you sure this functionality works for you in EKS clusters? I'm curious what steps were taken in order to get this to succeed? I am able to reproduce this consistently using the steps in my initial post.

yo-ga · 2023-06-27T10:44:00Z

Here is our new workaround.

Set the cluster.externalAccess.enabled to true and other required values. Then, deploy it.
It will create in the first installation only 6 LoadBalancer services, one for each Redis® node, once you have the nlb hostname of each service you will need to perform an upgrade passing those hostnames to the cluster.externalAccess.service.loadbalancerIP array.
Until there is only a IP assigned the nlb hostname, perform an upgrade passing the following value to redis.args

         - |
             # Backwards compatibility change
             if ! [[ -f /opt/bitnami/redis/etc/redis.conf ]]; then
                 cp /opt/bitnami/redis/etc/redis-default.conf /opt/bitnami/redis/etc/redis.conf
             fi
             pod_index=($(echo "$POD_NAME" | tr "-" "\n"))
             pod_index="${pod_index[-1]}"
             hosts=($(echo "{{ .Values.cluster.externalAccess.service.loadBalancerIP }}" | cut -d [ -f2 | cut -d ] -f 1))
             ip=$(getent hosts ${hosts[$pod_index]} | awk '{ print $1 }')
             export REDIS_CLUSTER_ANNOUNCE_IP="${ip}"
             export REDIS_CLUSTER_ANNOUNCE_HOSTNAME="${hosts[$pod_index]}"
             export REDIS_CLUSTER_PREFERRED_ENDPOINT_TYPE=hostname
             export REDIS_NODES="${hosts[@]}"
             {{- if .Values.cluster.init }}
             rm -rf /bitnami/redis/data
             if [[ "$pod_index" == "0" ]]; then
               export REDIS_CLUSTER_CREATOR="yes"
               export REDIS_CLUSTER_REPLICAS="{{ .Values.cluster.replicas }}"
             fi
             {{- end }}
             /opt/bitnami/scripts/redis-cluster/entrypoint.sh /opt/bitnami/scripts/redis-cluster/run.sh

Confirm that cluster is healthy. Perform an upgrade passing false to the cluster.init value

corico44 · 2023-06-30T15:50:34Z

Thank you @yo-ga. Did you finally open an issue in the official Redis repo so that they can modify it from their side and be able to obtain the changes in new releases?

lauerdev · 2023-06-30T16:21:23Z

@corico44 Is this not a bitnami chart issue? I could be mistaken, but wouldn't the change that @yo-ga is suggesting be made inside the redis-cluster chart here:

https://github.com/bitnami/charts/blob/main/bitnami/redis-cluster/templates/redis-statefulset.yaml#L110

corico44 · 2023-07-12T07:57:39Z

Thank you both. Would you like to open a PR to apply that changes @yo-ga? We will happy to review it!

yo-ga · 2023-07-13T02:52:29Z

Hi @corico44 , the above is the workaround way. Also, I would need to confirm that all the actions to be covered, including adding nodes, fail over. It would take some time if you can wait for the patch.

github-actions · 2023-08-04T01:27:12Z

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

carrodher · 2023-08-11T07:00:04Z

Thank you for submitting the associated Pull Request. Our team will review and provide feedback. Once the PR is merged, the issue will automatically close.

Your contribution is greatly appreciated!

morttrager · 2023-08-16T06:29:55Z

HI,

I am facing the same issue.

lauerdev added the tech-issues The user has a technical issue about an application label Apr 26, 2023

github-actions bot added the triage Triage is needed label Apr 26, 2023

bitnami-bot assigned carrodher Apr 26, 2023

carrodher added the redis-cluster label May 3, 2023

github-actions bot added in-progress and removed triage Triage is needed labels May 3, 2023

bitnami-bot assigned jotamartos and unassigned carrodher May 3, 2023

github-actions bot added the in-progress label Jun 27, 2023

bitnami-bot assigned corico44 and unassigned jotamartos Jun 27, 2023

github-actions bot added triage Triage is needed and removed in-progress labels Jun 27, 2023

github-actions bot added bitnami and removed triage Triage is needed labels Jun 27, 2023

github-actions bot added the in-progress label Jun 27, 2023

github-actions bot removed the stale 15 days without activity label Jun 28, 2023

carrodher removed the bitnami label Jul 19, 2023

yo-ga mentioned this issue Jul 29, 2023

[bitnami/redis-cluster] support external host to cluster announcement preffered type #18028

Merged

4 tasks

github-actions bot added the stale 15 days without activity label Aug 4, 2023

corico44 removed the stale 15 days without activity label Aug 7, 2023

aoterolorenzo closed this as completed in #18028 Aug 17, 2023

github-actions bot added solved and removed in-progress labels Aug 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bitnami/redis-cluster] Cluster init script hangs when external access is enabled in eks cluster using aws classic load balancers #16242

[bitnami/redis-cluster] Cluster init script hangs when external access is enabled in eks cluster using aws classic load balancers #16242

lauerdev commented Apr 26, 2023

jmturwy commented Apr 26, 2023

jotamartos commented May 8, 2023

jotamartos commented May 8, 2023 •

edited

Loading

ntarora-audiohook commented May 11, 2023

jotamartos commented May 15, 2023 •

edited

Loading

lauerdev commented May 16, 2023

jotamartos commented May 24, 2023

lauerdev commented May 25, 2023

jotamartos commented May 31, 2023

lauerdev commented Jun 2, 2023

yo-ga commented Jun 27, 2023 •

edited

Loading

corico44 commented Jun 30, 2023

lauerdev commented Jun 30, 2023

corico44 commented Jul 12, 2023

yo-ga commented Jul 13, 2023

github-actions bot commented Aug 4, 2023

carrodher commented Aug 11, 2023

morttrager commented Aug 16, 2023

[bitnami/redis-cluster] Cluster init script hangs when external access is enabled in eks cluster using aws classic load balancers #16242

[bitnami/redis-cluster] Cluster init script hangs when external access is enabled in eks cluster using aws classic load balancers #16242

Comments

lauerdev commented Apr 26, 2023

Name and Version

What architecture are you using?

What steps will reproduce the bug?

Are you using any custom parameters or values?

What is the expected behavior?

What do you see instead?

Additional information

jmturwy commented Apr 26, 2023

jotamartos commented May 8, 2023

jotamartos commented May 8, 2023 • edited Loading

ntarora-audiohook commented May 11, 2023

jotamartos commented May 15, 2023 • edited Loading

lauerdev commented May 16, 2023

jotamartos commented May 24, 2023

lauerdev commented May 25, 2023

jotamartos commented May 31, 2023

lauerdev commented Jun 2, 2023

yo-ga commented Jun 27, 2023 • edited Loading

corico44 commented Jun 30, 2023

lauerdev commented Jun 30, 2023

corico44 commented Jul 12, 2023

yo-ga commented Jul 13, 2023

github-actions bot commented Aug 4, 2023

carrodher commented Aug 11, 2023

morttrager commented Aug 16, 2023

jotamartos commented May 8, 2023 •

edited

Loading

jotamartos commented May 15, 2023 •

edited

Loading

yo-ga commented Jun 27, 2023 •

edited

Loading