Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

Fix 2797 on Kubernetes - Cluster goes down because all IPs become unreachable #3149

Merged
merged 16 commits into from
Nov 14, 2017

Conversation

bricef
Copy link
Contributor

@bricef bricef commented Oct 23, 2017

PR to fix #2797. Includes changes from #3022.

@bricef
Copy link
Contributor Author

bricef commented Oct 23, 2017

Not just a location change for the API, but also a functional change. Will need to refactor code to use new API methods.

@bricef bricef requested review from brb, awh and bboreham October 24, 2017 15:50
@bricef bricef changed the title Fix 2797 - Clusters go down because IPs cannot be allocated due to unreachable nodes Fix 2797 on Kubernetes - Cluster goes down because all IPs become unreachable Oct 24, 2017
- apiVersion: rbac.authorization.k8s.io/v1beta1
kind: Role
metadata:
name: weave-net2

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

@@ -47,6 +47,44 @@ items:
- kind: ServiceAccount
name: weave-net
namespace: kube-system
- apiVersion: rbac.authorization.k8s.io/v1beta1

This comment was marked as abuse.

This comment was marked as abuse.

Copy link
Contributor

@brb brb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more things not mentioned in my comments:

  1. The commit 71bc6f7 message mentions "should break up this commit".
  2. It would be useful to have some short explanations in a form of commit message for some non-trivial commits explaining why a change is needed.
  3. How safe do we fell about this change? Have we inspected a code coverage (you can find it among CircleCI artifacts) that all branches of reclaimRemovedPeers are tested?
  4. What happens when some nodes in a cluster are powered by older version of weave-kube? Have we tested that upgrade works?
  5. A few sentences (probably in a form of pkg doc) about how all this thing works would be helpful.

- apiVersion: rbac.authorization.k8s.io/v1beta1
kind: Role
metadata:
name: weave-net2

This comment was marked as abuse.

greyly echo "Setting up kubernetes cluster"
tear_down_kubeadm;

# Make an ipset, so we can check it doesn't get wiped out by Weave Net

This comment was marked as abuse.

function check_no_lost_ip_addresses {
for host in $HOSTS; do
unreachable_count=$(run_on $host "sudo weave status ipam" | grep "unreachable" | wc -l)
if [ "$unreachable_count" -gt "0" ]; then

This comment was marked as abuse.

@@ -0,0 +1,122 @@
#! /bin/bash

This comment was marked as abuse.

This comment was marked as abuse.


check_no_lost_ip_addresses;

force_drop_node;

This comment was marked as abuse.

)

func (cml *configMapAnnotations) Init() error {
for { // Loop only if we call Create() and it's already there

This comment was marked as abuse.

This comment was marked as abuse.

return
}
err = f()
if err != nil && kubeErrors.IsConflict(err) {

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.


// Step 3-5 is to protect against two simultaneous rmpeers of X
// Step 4 is to pick up again after a restart between step 5 and step 7b
// If the peer doing the reclaim disappears between steps 5 and 7a, then someone will clean it up in step 7aa

This comment was marked as abuse.

This comment was marked as abuse.

@@ -47,6 +47,44 @@ items:
- kind: ServiceAccount
name: weave-net
namespace: kube-system
- apiVersion: rbac.authorization.k8s.io/v1beta1

This comment was marked as abuse.

return cml.UpdateAnnotation(KubePeersAnnotationKey, string(recordBytes))
}

func (cml *configMapAnnotations) GetAnnotation(key string) (string, bool) {

This comment was marked as abuse.

This comment was marked as abuse.

@bricef bricef self-assigned this Oct 31, 2017
@bricef bricef force-pushed the fix-2797-loosing-ipam-ips branch 3 times, most recently from df1a05f to 05ac956 Compare October 31, 2017 14:55
@skny5
Copy link

skny5 commented Nov 3, 2017

Any thoughts on when this will be merged into a release? This is an extremely critical piece of functionality when running K8S on a dynamic infrastructure, weave becomes pretty much unusable when an error (#2797) occurs.

@bboreham
Copy link
Contributor

bboreham commented Nov 3, 2017

@skny5 right now we're trying to clear up all the points and give it another thorough review; hopefully days rather than weeks.
As you may understand it is a rather destructive operation if we get it wrong.

@bricef
Copy link
Contributor Author

bricef commented Nov 6, 2017

@skny5 Working on it as we speak. Hoping this will pass review and hit mainline this week.

@bboreham
Copy link
Contributor

bboreham commented Nov 8, 2017

Some notes from my testing:

I fired up a 3-node Kubernetes cluster using the test 840 script.
Checked logs from all three weave containers.
Checked weave status ipam
Deleted a node with kubectl delete node brya-2
Initially the weave network still had three peers, but then the eviction manager kicked in on brya-2 and deleted all pods, killing all containers including Weave.

Next I restarted one of the remaining weave-net pods, so it would notice the node had gone; this all appeared to go to plan:

INFO: 2017/11/08 12:14:09.239399 Added myself to peer list &{[{76:a1:dd:6c:12:51 brya-0} {0e:e4:4e:dd:92:ca brya-2} {ba:23:5a:0c:f3:10 brya-1}]}
DEBU: 2017/11/08 12:14:09.239596 Nodes that have disappeared: map[brya-2:{0e:e4:4e:dd:92:ca brya-2}]
DEBU: 2017/11/08 12:14:09.239643 Preparing to remove disappeared peer {0e:e4:4e:dd:92:ca brya-2}
DEBU: 2017/11/08 12:14:09.239665 Noting I plan to remove  0e:e4:4e:dd:92:ca
DEBU: 2017/11/08 12:14:09.245074 Nodes that have disappeared: map[brya-2:{0e:e4:4e:dd:92:ca brya-2}]
DEBU: 2017/11/08 12:14:09.245149 Preparing to remove disappeared peer {0e:e4:4e:dd:92:ca brya-2}
DEBU: 2017/11/08 12:14:09.245171 Existing annotation 76:a1:dd:6c:12:51
DEBU: 2017/11/08 12:14:09.245191 weave DELETE to http://127.0.0.1:6784/peer/0e:e4:4e:dd:92:ca with map[]
INFO: 2017/11/08 12:14:09.252523 rmpeer of 0e:e4:4e:dd:92:ca : 393216 IPs taken over from 0e:e4:4e:dd:92:ca

I attempted to rejoin brya-2 using kubeadm reset then kubeadm join [...]
This resulted in the peer picking up the previous /var/lib/weave/weave-netdata.db so now it cannot join the cluster because it has an incompatible update. This was previously noted

I manually deleted the persistence file then went through kubectl delete node kubeadm reset kubeadm join again. Weave Net fired up ok.

slightly puzzling ipam state, presumably because it reclaimed the bridge address and nothing further:

# weave status ipam
76:a1:dd:6c:12:51(brya-0)               917503 IPs (87.5% of total) (1 active)
0e:e4:4e:dd:92:ca(brya-2)                    1 IPs (00.0% of total) 
ba:23:5a:0c:f3:10(brya-1)               131072 IPs (12.5% of total) 

After moving one of the nettest pods onto brya-2 it is more reasonable:

# weave status ipam
76:a1:dd:6c:12:51(brya-0)               655359 IPs (62.5% of total) (1 active)
0e:e4:4e:dd:92:ca(brya-2)               262145 IPs (25.0% of total) 
ba:23:5a:0c:f3:10(brya-1)               131072 IPs (12.5% of total) 

Next I am going to abruptly shut down brya-1 and see what happens.

[24 hours later...]

Node status goes to NotReady; pods on that node go to status NodeLost, but the node is not deleted and the code in this PR does not reclaim its IPs. This is all as expected.

@bricef
Copy link
Contributor Author

bricef commented Nov 8, 2017

I attempted to rejoin brya-2 using kubeadm reset then kubeadm join [...]
This resulted in the peer picking up the previous /var/lib/weave/weave-netdata.db so now it cannot join the cluster because it has an incompatible update. This was previously noted

I manually deleted the persistence file then went through kubectl delete node kubeadm reset kubeadm join again. Weave Net fired up ok.

@bboreham I haven't encountered this. Running kubeadm reset then kubeadm join ... leaves me with a (eventually) clean IPAM table across the cluster. I'm not doing kubectl delete though.

I am getting the following from somewhere when listening to the k8s_weave_weave-net-... container:

INFO: 2017/11/08 16:13:55.261671 Removed unreachable peer e6:78:31:8a:e6:e1(vagr-1)

Although apparently not from the weave-net codebase? (ran a search for "Removed" and "unreachable", and git couldn't find it!)

Is it at this point (or after some retries) that we should be removing the IP allocations for that peer?

This is the log I get when removing a peer with kubeadm reset

INFO: 2017/11/08 16:26:14.364090 ->[10.128.0.19:56777|e6:78:31:8a:e6:e1(vagr-1)]: connection shutting down due to error: read tcp4 10.128.0.20:6783->10.128.0.19:56777: read: connection reset by peer
INFO: 2017/11/08 16:26:14.364386 ->[10.128.0.19:56777|e6:78:31:8a:e6:e1(vagr-1)]: connection deleted
INFO: 2017/11/08 16:26:14.365670 Removed unreachable peer e6:78:31:8a:e6:e1(vagr-1)

But sudo weave status ipam still shows up unreachable addresses.

9a:33:6c:88:6b:5f(vagr-0)               393216 IPs (37.5% of total) (1 active)
2e:eb:78:2c:d6:d8(vagr-2)               131072 IPs (12.5% of total) 
e6:78:31:8a:e6:e1(vagr-1)               524288 IPs (50.0% of total) - unreachable!

Pretty sure I'm running your latest version. (I get the [kube-peers] tagged logs), but I never get the Preparing to remove disappeared peer message, and the 860 test is (correctly) failing.

@bboreham
Copy link
Contributor

bboreham commented Nov 8, 2017

kubeadm reset removes the k8s installation and stopps running processes, but doesn't delete the node from the api-server. So this PR doesn't reclaim those IPs.

@bricef
Copy link
Contributor Author

bricef commented Nov 9, 2017

Ok, I'll amend the test to delete the node using kubectl delete node instead.

ipam_status=$(run_on $host "sudo weave status ipam")
echo $ipam_status
unreachable_count=$(echo $ipam_status | grep "unreachable" | wc -l)
if [ "$unreachable_count" != "0" ]; then

This comment was marked as abuse.

@bricef bricef force-pushed the fix-2797-loosing-ipam-ips branch 4 times, most recently from 01c78de to e12d9de Compare November 13, 2017 14:34
@bboreham bboreham added this to the 2.1 milestone Nov 13, 2017
@bboreham
Copy link
Contributor

I think this is in fairly good shape; I am minded to merge it and raise new issues for follow-up.

One area we need to cover off is what happens when you upgrade from a previous release - the current code will delete persisted data which is ok if you do it one node at a time but not idea.

Also we need to update the cloud.weave.works config generator.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove deleted k8s nodes from Weave Net
5 participants