Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bitnami/redis-cluster] Cluster init script hangs when external access is enabled in eks cluster using aws classic load balancers #16242

Closed
lauerdev opened this issue Apr 26, 2023 · 22 comments · Fixed by #18028
Assignees
Labels
redis-cluster solved tech-issues The user has a technical issue about an application

Comments

@lauerdev
Copy link

Name and Version

bitnami/redis-cluster 8.4.4

What architecture are you using?

None

What steps will reproduce the bug?

I've been following this guide to enable external access to redis-cluster in an amazon eks cluster: https://docs.bitnami.com/tutorials/deploy-redis-cluster-tmc-helm-chart/#step-5-deploy-the-bitnami-redis-cluster-helm-chart-by-enabling-external-access

The first step completes successfully and I am able to see that the load balancers are created with ports 6379 and 16379 exposed:

redis-cluster-jl                  ClusterIP      172.20.48.94     <none>                    6379/TCP                         18h
redis-cluster-jl-0-svc            LoadBalancer   172.20.243.179   *****.elb.amazonaws.com   6379:32112/TCP,16379:30316/TCP   18h
redis-cluster-jl-1-svc            LoadBalancer   172.20.229.93    *****.elb.amazonaws.com   6379:32357/TCP,16379:31750/TCP   18h
redis-cluster-jl-2-svc            LoadBalancer   172.20.101.213   *****.elb.amazonaws.com   6379:32219/TCP,16379:30617/TCP   18h
redis-cluster-jl-3-svc            LoadBalancer   172.20.19.236    *****.elb.amazonaws.com   6379:32177/TCP,16379:32397/TCP   18h
redis-cluster-jl-4-svc            LoadBalancer   172.20.90.157    *****.elb.amazonaws.com   6379:31838/TCP,16379:30064/TCP   18h
redis-cluster-jl-5-svc            LoadBalancer   172.20.224.185   *****.elb.amazonaws.com   6379:30587/TCP,16379:30976/TCP   18h
redis-cluster-jl-headless         ClusterIP      None             <none>                    6379/TCP,16379/TCP               18h

After adding the load balancer addresses to the cluster.externalAccess.service.loadbalancerIP array and performing the helm upgrade, the statefulset is created and after some time, all 6 nodes appear to come up and report healthy:

image

However, after further inspection of the logs on pod -0 it appears that the cluster init script is hanging on the following message:

>>> Sending CLUSTER MEET messages to join the cluster
Waiting for the cluster to join

In addition, cluster info reports that the cluster status is fail and that not all 6 nodes have joined successfully:

cluster_state:fail
cluster_slots_assigned:10923
cluster_slots_ok:10923
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:2
cluster_size:2
cluster_current_epoch:2
cluster_my_epoch:2
cluster_stats_messages_ping_sent:9005
cluster_stats_messages_meet_sent:1
cluster_stats_messages_sent:9006
cluster_stats_messages_pong_received:2336
cluster_stats_messages_received:2336
total_cluster_links_buffer_limit_exceeded:0

I have engaged aws cloud support to rule out connectivity issues, and we were able to successfully telnet into all of the loadbalancers on port 6379 as well as 16379 from a test pod within the k8s cluster.

Are you using any custom parameters or values?

cluster:
  externalAccess:
    enabled: true
    service:
      type: LoadBalancer
    loadbalancerIP:
    - load-balancer-0-ip
    - load-balancer-1-ip
    - load-balancer-2-ip
    - load-balancer-3-ip
    - load-balancer-4-ip
    - load-balancer-5-ip

What is the expected behavior?

The cluster boots up successfully and reports that the cluster status is ok. All 6 nodes are joined to the cluster and external clients are able to connect, auth, read and write keys.

What do you see instead?

Cluster init script hangs on the message:

>>> Sending CLUSTER MEET messages to join the cluster
Waiting for the cluster to join

And cluster info reports that the cluster status is fail. Not all 6 nodes have joined successfully:

cluster_state:fail
cluster_slots_assigned:10923
cluster_slots_ok:10923
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:2
cluster_size:2
cluster_current_epoch:2
cluster_my_epoch:2
cluster_stats_messages_ping_sent:9005
cluster_stats_messages_meet_sent:1
cluster_stats_messages_sent:9006
cluster_stats_messages_pong_received:2336
cluster_stats_messages_received:2336
total_cluster_links_buffer_limit_exceeded:0

Additional information

EKS cluster k8s version - 1.24.10
AWS load balancer controller version - 2.4.5

@lauerdev lauerdev added the tech-issues The user has a technical issue about an application label Apr 26, 2023
@github-actions github-actions bot added the triage Triage is needed label Apr 26, 2023
@jmturwy
Copy link

jmturwy commented Apr 26, 2023

Having the same exact issue

@github-actions github-actions bot added in-progress and removed triage Triage is needed labels May 3, 2023
@bitnami-bot bitnami-bot assigned jotamartos and unassigned carrodher May 3, 2023
@jotamartos
Copy link
Contributor

Sorry for the delay here, we are going to work on reproducing this issue to get more information.

@jotamartos
Copy link
Contributor

jotamartos commented May 8, 2023

I just tried to reproduce the issue in a local environment using minikube and everything worked as expected. I waited for the IPs to be ready

$ k get svc
NAME                          TYPE           CLUSTER-IP       EXTERNAL-IP      PORT(S)                          AGE
jota-redis-cluster            ClusterIP      10.97.77.222     <none>           6379/TCP                         20m
jota-redis-cluster-0-svc      LoadBalancer   10.105.127.214   10.105.127.214   6379:30710/TCP,16379:30059/TCP   20m
jota-redis-cluster-1-svc      LoadBalancer   10.102.1.60      10.102.1.60      6379:30810/TCP,16379:31186/TCP   20m
jota-redis-cluster-2-svc      LoadBalancer   10.109.203.53    10.109.203.53    6379:30581/TCP,16379:31015/TCP   20m
jota-redis-cluster-3-svc      LoadBalancer   10.106.104.164   10.106.104.164   6379:32320/TCP,16379:31245/TCP   20m
jota-redis-cluster-4-svc      LoadBalancer   10.106.10.60     10.106.10.60     6379:30768/TCP,16379:30978/TCP   20m
jota-redis-cluster-5-svc      LoadBalancer   10.103.211.126   10.103.211.126   6379:31037/TCP,16379:31152/TCP   20m
jota-redis-cluster-headless   ClusterIP      None             <none>           6379/TCP,16379/TCP               20m
kubernetes                    ClusterIP      10.96.0.1        <none>           443/TCP                          83d

And upgraded the deployment with those IPs

$ helm upgrade --namespace default jota-redis-cluster --set "cluster.externalAccess.enabled=true,cluster.externalAccess.service.type=LoadBalancer,cluster.externalAccess.service.loadBalancerIP[0]=10.105.127.214,cluster.externalAccess.service.loadBalancerIP[1]=10.102.1.60,cluster.externalAccess.service.loadBalancerIP[2]=10.109.203.53,cluster.externalAccess.service.loadBalancerIP[3]=10.106.104.164,cluster.externalAccess.service.loadBalancerIP[4]=10.106.10.60,cluster.externalAccess.service.loadBalancerIP[5]=10.103.211.126" --set password=$REDIS_PASSWORD bitnami/redis-cluster

After that, pods became available

$ k get pods
NAME                   READY   STATUS    RESTARTS   AGE
jota-redis-cluster-0   1/1     Running   0          18m
jota-redis-cluster-1   1/1     Running   0          18m
jota-redis-cluster-2   1/1     Running   0          18m
jota-redis-cluster-3   1/1     Running   0          18m
jota-redis-cluster-4   1/1     Running   0          18m
jota-redis-cluster-5   1/1     Running   0          18m

and confirmed that the logs

>>> Check slots coverage...
[OK] All 16384 slots covered.
Cluster correctly created
redis-server "${ARGS[@]}"
45:M 08 May 2023 08:57:23.023 * Background saving terminated with success
45:M 08 May 2023 08:57:23.023 * Synchronization with replica 10.244.1.42:6379 succeeded
45:M 08 May 2023 08:57:25.833 # Cluster state changed: ok

and the cluster info information is correct as well

127.0.0.1:6379> cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:6
cluster_my_epoch:1
cluster_stats_messages_ping_sent:338
cluster_stats_messages_pong_sent:342
cluster_stats_messages_sent:680
cluster_stats_messages_ping_received:337
cluster_stats_messages_pong_received:338
cluster_stats_messages_meet_received:5
cluster_stats_messages_received:680
total_cluster_links_buffer_limit_exceeded:0
127.0.0.1:6379>

Environment:

$ helm version
version.BuildInfo{Version:"v3.9.2", GitCommit:"1addefbfe665c350f4daf868a9adc5600cc064fd", GitTreeState:"clean", GoVersion:"go1.18.4"}

~                                                                                                                                                                                                                                    11:19:33
$ kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.1", GitCommit:"8f94681cd294aa8cfd3407b8191f6c70214973a4", GitTreeState:"clean", BuildDate:"2023-01-18T15:58:16Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"darwin/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.1", GitCommit:"8f94681cd294aa8cfd3407b8191f6c70214973a4", GitTreeState:"clean", BuildDate:"2023-01-18T15:51:25Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}

@ntarora-audiohook
Copy link

I am also having the same issue, when I test the connection to all pods from redis-0 it works, even using the IPs in the logs.

@jotamartos
Copy link
Contributor

jotamartos commented May 15, 2023

Hi @ntarora-audiohook,

As I mentioned above, I couldn't reproduce the issue in my environment. Could you please deploy the solution in a different environment, ensure you deploy the latest version of the solution and your environment uses a stable version of all the components included in the cluster?

@lauerdev
Copy link
Author

Hey - I just want to make sure that everybody realizes that the issue is happening in an eks cluster. The behavior is very specific to Amazon/AWS and their handling of load balancers. Comparing against a minikube cluster is not an apples-to-apples comparison.

@jotamartos
Copy link
Contributor

Correct, it seems to be a specific problem with the Amazon/AWS cluster networking configuration. We will try to reproduce the issue in the same environment to try to get more information but in the meantime, could you try to debug the issue on your side?

You can try to connect to the different nodes from one of the pods and confirm what the issue may be. You can use the redis-cli command line utility and try the following:

  • Try connecting from one node to the other using the pod's IP
  • or the private IP of the service
  • and the domain name of the service. It's really important to confirm the domain is resolved correctly

If the domain is resolved properly but the solution can't simply reach the other node, I think we should contact AWS to know more about what's happening with the cluster.

@lauerdev
Copy link
Author

AWS support was engaged prior to submitting this issue. This is a summary of their findings:

Thank you for reaching out to AWS Premium Support. Below you will find a quick summary of the conversation on our call today.

You have reached out to us inquiring about an issue you were seeing with Redis pods not being able to communicate with other Redis servers via an external kubernetes service IP. You were following the documentation here [1] to deploy a scalable Redis cluster with Bitnami and Helm.

We were able to inspect the EKS cluster setup and understand that there were no networking issues as such between the pods, however the Redis was showing the following messages:

Sending CLUSTER MEET messages to join the cluster, Waiting for the cluster to join

We were then able to further discuss and get to a conclusion that being able to check Redis activity and understand what it takes to register a node could help in this case. Since the issue was more Redis related and not very much related to EKS, you were going to consider reaching out to the database support team to get their help with troubleshooting this issue further.

Please feel free to write back to us if you have any other questions from an EKS standpoint and I'll be glad to assist further.

Have a nice day!

As part of our troubleshooting, we were able to:

  • resolve the IP of the external load balancer address using a tool such as nslookup
  • telnet from one redis node to another over 6379 using both the internal cluster IP as well as the external load balancer address
  • telnet from a machine outside of the cluster to any of the redis nodes individually over 6379 using the external load balancer address

Therefore, I don't believe it to be an issue with DNS resolution or network connectivity.

@jotamartos
Copy link
Contributor

If domains and IPs are accessible and they are resolved correctly, we need to confirm you can connect to the different nodes using the redis-cli command line utility. Can you try to access the different nodes using that tool and confirm they accept the connection? You can also try sending a CLUSTER MEET message to try to manually join the nodes to see if you get an error there (try using the domain o IP to see if you get an error).

https://redis.io/commands/cluster-meet/

@lauerdev
Copy link
Author

lauerdev commented Jun 2, 2023

I was able to use redis-cli to connect from one node to the other using the external load balancer host name. However, it appears that it may not be possible to issue CLUSTER MEET commands using hostname:

I have no name!@redis-cluster-jl-0:/$ redis-cli -h <node-5-hostname>

<node-5-hostname>:6379> get foo
(error) CLUSTERDOWN Hash slot not served

<node-5-hostname>:6379> cluster meet <node-0-hostname> 6379
(error) ERR Invalid node address specified:<node-0-hostname>:6379

<node-5-hostname>:6379> cluster meet <node-0-ip-address> 6379
OK

After some digging, it appears that there are a few existing issues relating to problems with using hostnames:

Are you sure this functionality works for you in EKS clusters? I'm curious what steps were taken in order to get this to succeed? I am able to reproduce this consistently using the steps in my initial post.

@bitnami-bot bitnami-bot assigned corico44 and unassigned jotamartos Jun 27, 2023
@github-actions github-actions bot added triage Triage is needed and removed in-progress labels Jun 27, 2023
@github-actions github-actions bot added bitnami and removed triage Triage is needed labels Jun 27, 2023
@yo-ga
Copy link
Contributor

yo-ga commented Jun 27, 2023

Here is our new workaround.

  1. Set the cluster.externalAccess.enabled to true and other required values. Then, deploy it.
  2. It will create in the first installation only 6 LoadBalancer services, one for each Redis® node, once you have the nlb hostname of each service you will need to perform an upgrade passing those hostnames to the cluster.externalAccess.service.loadbalancerIP array.
  3. Until there is only a IP assigned the nlb hostname, perform an upgrade passing the following value to redis.args
         - |
             # Backwards compatibility change
             if ! [[ -f /opt/bitnami/redis/etc/redis.conf ]]; then
                 cp /opt/bitnami/redis/etc/redis-default.conf /opt/bitnami/redis/etc/redis.conf
             fi
             pod_index=($(echo "$POD_NAME" | tr "-" "\n"))
             pod_index="${pod_index[-1]}"
             hosts=($(echo "{{ .Values.cluster.externalAccess.service.loadBalancerIP }}" | cut -d [ -f2 | cut -d ] -f 1))
             ip=$(getent hosts ${hosts[$pod_index]} | awk '{ print $1 }')
             export REDIS_CLUSTER_ANNOUNCE_IP="${ip}"
             export REDIS_CLUSTER_ANNOUNCE_HOSTNAME="${hosts[$pod_index]}"
             export REDIS_CLUSTER_PREFERRED_ENDPOINT_TYPE=hostname
             export REDIS_NODES="${hosts[@]}"
             {{- if .Values.cluster.init }}
             rm -rf /bitnami/redis/data
             if [[ "$pod_index" == "0" ]]; then
               export REDIS_CLUSTER_CREATOR="yes"
               export REDIS_CLUSTER_REPLICAS="{{ .Values.cluster.replicas }}"
             fi
             {{- end }}
             /opt/bitnami/scripts/redis-cluster/entrypoint.sh /opt/bitnami/scripts/redis-cluster/run.sh
  1. Confirm that cluster is healthy. Perform an upgrade passing false to the cluster.init value

@github-actions github-actions bot removed the stale 15 days without activity label Jun 28, 2023
@corico44
Copy link
Contributor

Thank you @yo-ga. Did you finally open an issue in the official Redis repo so that they can modify it from their side and be able to obtain the changes in new releases?

@lauerdev
Copy link
Author

@corico44 Is this not a bitnami chart issue? I could be mistaken, but wouldn't the change that @yo-ga is suggesting be made inside the redis-cluster chart here:

https://github.com/bitnami/charts/blob/main/bitnami/redis-cluster/templates/redis-statefulset.yaml#L110

@corico44
Copy link
Contributor

Thank you both. Would you like to open a PR to apply that changes @yo-ga? We will happy to review it!

@yo-ga
Copy link
Contributor

yo-ga commented Jul 13, 2023

Hi @corico44 , the above is the workaround way. Also, I would need to confirm that all the actions to be covered, including adding nodes, fail over. It would take some time if you can wait for the patch.

@github-actions
Copy link

github-actions bot commented Aug 4, 2023

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

@github-actions github-actions bot added the stale 15 days without activity label Aug 4, 2023
@corico44 corico44 removed the stale 15 days without activity label Aug 7, 2023
@carrodher
Copy link
Member

Thank you for submitting the associated Pull Request. Our team will review and provide feedback. Once the PR is merged, the issue will automatically close.

Your contribution is greatly appreciated!

@morttrager
Copy link

HI,

I am facing the same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
redis-cluster solved tech-issues The user has a technical issue about an application
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants