Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bitnami/redis-cluster] update job fails to get new node IP address #3876

Closed
tpolekhin opened this issue Oct 2, 2020 · 15 comments
Closed

[bitnami/redis-cluster] update job fails to get new node IP address #3876

tpolekhin opened this issue Oct 2, 2020 · 15 comments
Labels
stale 15 days without activity

Comments

@tpolekhin
Copy link

Which chart:
redis-cluster-3.2.4.tgz

Describe the bug
wait_for_dns_lookup bash function from https://github.com/bitnami/bitnami-docker-redis-cluster/blob/master/6.0/debian-10/prebuildfs/opt/bitnami/scripts/libnet.sh#L33 failing to return a correct IP address of new pod

To Reproduce

  1. helm install redis-cluster bitnami/redis-cluster --set 'cluster.replicas=0'
  2. helm upgrade redis-cluster bitnami/redis-cluster --set 'cluster.nodes=6,cluster.replicas=0,cluster.init=false,cluster.update.addNodes=true,cluster.update.currentNumberOfNodes=3'
  3. tail the logs of the cluster-update pod

Expected behavior
All new pods discovered and added to the existing cluster

Version of Helm and Kubernetes:

  • Output of helm version:
version.BuildInfo{Version:"v3.3.3", GitCommit:"55e3ca022e40fe200fbc855938995f40b2a68ce0", GitTreeState:"dirty", GoVersion:"go1.15.2"}
  • Output of kubectl version:
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", GitCommit:"dff82dc0de47299ab66c83c626e08b245ab19037", GitTreeState:"clean", BuildDate:"2020-07-15T23:30:39Z", GoVersion:"go1.14.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.12+IKS", GitCommit:"d09005b98837bb6061c0f643a27383c02b003205", GitTreeState:"clean", BuildDate:"2020-09-16T21:47:16Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

Additional context
I've added set -x to the update job script to trace the issue
here's output of the pod/redis-cluster-cluster-update-9x4nt
you can see that dns lookup worked once to break out of the loop here
but next dns_lookup that's used to return IP address fails to find IP address

++ sleep 5
++ (( i+=1  ))
++ (( i <= retries  ))
++ check_host redis-cluster-3.redis-cluster-headless
+++ dns_lookup redis-cluster-3.redis-cluster-headless
+++ local host=redis-cluster-3.redis-cluster-headless
+++ getent ahosts redis-cluster-3.redis-cluster-headless
+++ awk '/STREAM/ {print $1 }'
++ [[ 172.21.166.44 == '' ]]
++ true
++ return_value=0
++ break
++ return 0
++ dns_lookup redis-cluster-3.redis-cluster-headless
++ local host=redis-cluster-3.redis-cluster-headless
++ getent ahosts redis-cluster-3.redis-cluster-headless
++ awk '/STREAM/ {print $1 }'
+ new_node_ip=
++ redis-cli -h '' -p 6379 ping
Could not connect to Redis at :6379: Name or service not known
+ [[ '' != \P\O\N\G ]]
+ echo 'Node  not ready, waiting for all the nodes to be ready...'
+ sleep 5
@EswarRams
Copy link

I also faced the same issue when I try to do an upgrade. I do see another issue as well.
Did you try with the 3 master and 4 slaves by any chance? When I restart the master I see the slave become master, but new pod which came up with the new ip didn't join the cluster. Pod can restart any time so it should join back the cluster. Is that something you faced or you didn't try that yet?

@javsalgar
Copy link
Contributor

Hi,

@rafariossaa is checking an issue with the upgrade, pinging him in case it is related.

@rafariossaa
Copy link
Contributor

Hi,
I am looking into this. I will come back as soon as I have news.

@tpolekhin
Copy link
Author

@EswarRams I did tried the default setup with 3 masters and 3 followers and everything was okay for me. I deleted a master pod and follower got promoted. When old master came back it joined the cluster and became a follower for new master.
Make sure you're running with persistence enabled, otherwise the config is lost when you kill a pod and it will not join the cluster. Possibly an issue? I don't know what was the intent of the developers here. Since there's an environment variable with all the cluster pods DNS names one would hope that PVC is not that important, but who knows.
Maybe devs can comment on this

@rafariossaa
Copy link
Contributor

Hi,
@EswarRams issue seems to be different than this that is related to the deployment upgrade.
@tpolekhin , joining the cluster should not be related to persistence. As you indicated, nodes names are set to the pod in a environment var.
Please, if you are experimenting a different issue open a new issue and link the issues you think are related.

@tpolekhin
Copy link
Author

@rafariossaa I've created a new issue as you suggested #3933

@rafariossaa
Copy link
Contributor

@tpolekhin Thanks .

@EswarRams
Copy link

EswarRams commented Oct 8, 2020 via email

@stale
Copy link

stale bot commented Oct 24, 2020

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

@stale stale bot added the stale 15 days without activity label Oct 24, 2020
@tpolekhin
Copy link
Author

Hello,
I can see this issue received a stale label, is there anyone still looking at this, or you weren't able to reproduce the issue?

@stale stale bot removed the stale 15 days without activity label Oct 26, 2020
@rafariossaa
Copy link
Contributor

Hi,
Sorry for the delay.
For one part, a new release of the chart fixing some issues to decide the role of the node when restarting (or scaling) was made. So, maybe it worth it to give it a try and check if the issue persists.
Also, I was waiting some feedback from @EswarRams .

@tpolekhin
Copy link
Author

Hello @rafariossaa
I'm currently stuck with another issue on cluster upgrade #4064
Hopefully it will resolve this issue as well when fixed

@rafariossaa
Copy link
Contributor

Hi,
Please, let me know when #4064 is solved and you can continue with your upgrade.

@stale
Copy link

stale bot commented Nov 21, 2020

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

@stale stale bot added the stale 15 days without activity label Nov 21, 2020
@stale
Copy link

stale bot commented Nov 29, 2020

Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.

@stale stale bot closed this as completed Nov 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale 15 days without activity
Projects
None yet
Development

No branches or pull requests

4 participants