Fix 2797 on Kubernetes - Cluster goes down because all IPs become unreachable #3149

bricef · 2017-10-23T14:55:07Z

PR to fix #2797. Includes changes from #3022.

bricef · 2017-10-23T17:44:50Z

Not just a location change for the API, but also a functional change. Will need to refactor code to use new API methods.

prog/weave-kube/weave-daemonset-k8s-1.6.yaml

+  - apiVersion: rbac.authorization.k8s.io/v1beta1
+    kind: Role
+    metadata:
+      name: weave-net2


prog/weave-kube/weave-daemonset-k8s-1.6.yaml

@@ -47,6 +47,44 @@ items:
      - kind: ServiceAccount
        name: weave-net
        namespace: kube-system
+  - apiVersion: rbac.authorization.k8s.io/v1beta1


brb

A few more things not mentioned in my comments:

The commit 71bc6f7 message mentions "should break up this commit".
It would be useful to have some short explanations in a form of commit message for some non-trivial commits explaining why a change is needed.
How safe do we fell about this change? Have we inspected a code coverage (you can find it among CircleCI artifacts) that all branches of reclaimRemovedPeers are tested?
What happens when some nodes in a cluster are powered by older version of weave-kube? Have we tested that upgrade works?
A few sentences (probably in a form of pkg doc) about how all this thing works would be helpful.

prog/weave-kube/weave-daemonset-k8s-1.6.yaml

+  - apiVersion: rbac.authorization.k8s.io/v1beta1
+    kind: Role
+    metadata:
+      name: weave-net2


test/860_weave_removes_dead_nodes_from_ipam_3_test.sh

+    greyly echo "Setting up kubernetes cluster"
+    tear_down_kubeadm;
+
+    # Make an ipset, so we can check it doesn't get wiped out by Weave Net


test/860_weave_removes_dead_nodes_from_ipam_3_test.sh

+function check_no_lost_ip_addresses {
+    for host in $HOSTS; do
+        unreachable_count=$(run_on $host "sudo weave status ipam" | grep "unreachable" | wc -l)
+        if [ "$unreachable_count" -gt "0" ]; then


test/860_weave_removes_dead_nodes_from_ipam_3_test.sh

@@ -0,0 +1,122 @@
+#! /bin/bash


test/860_weave_removes_dead_nodes_from_ipam_3_test.sh

+
+    check_no_lost_ip_addresses;
+
+    force_drop_node;


prog/kube-peers/annotations.go

+)
+
+func (cml *configMapAnnotations) Init() error {
+	for { // Loop only if we call Create() and it's already there


prog/kube-peers/annotations.go

+			return
+		}
+		err = f()
+		if err != nil && kubeErrors.IsConflict(err) {


prog/kube-peers/main.go

+
+			// Step 3-5 is to protect against two simultaneous rmpeers of X
+			// Step 4 is to pick up again after a restart between step 5 and step 7b
+			// If the peer doing the reclaim disappears between steps 5 and 7a, then someone will clean it up in step 7aa


prog/weave-kube/weave-daemonset-k8s-1.6.yaml

@@ -47,6 +47,44 @@ items:
      - kind: ServiceAccount
        name: weave-net
        namespace: kube-system
+  - apiVersion: rbac.authorization.k8s.io/v1beta1


prog/kube-peers/annotations.go

+	return cml.UpdateAnnotation(KubePeersAnnotationKey, string(recordBytes))
+}
+
+func (cml *configMapAnnotations) GetAnnotation(key string) (string, bool) {


skny5 · 2017-11-03T16:53:57Z

Any thoughts on when this will be merged into a release? This is an extremely critical piece of functionality when running K8S on a dynamic infrastructure, weave becomes pretty much unusable when an error (#2797) occurs.

bboreham · 2017-11-03T17:34:20Z

@skny5 right now we're trying to clear up all the points and give it another thorough review; hopefully days rather than weeks.
As you may understand it is a rather destructive operation if we get it wrong.

bricef · 2017-11-06T18:04:49Z

@skny5 Working on it as we speak. Hoping this will pass review and hit mainline this week.

bboreham · 2017-11-08T12:33:14Z

Some notes from my testing:

I fired up a 3-node Kubernetes cluster using the test 840 script.
Checked logs from all three weave containers.
Checked weave status ipam
Deleted a node with kubectl delete node brya-2
Initially the weave network still had three peers, but then the eviction manager kicked in on brya-2 and deleted all pods, killing all containers including Weave.

Next I restarted one of the remaining weave-net pods, so it would notice the node had gone; this all appeared to go to plan:

INFO: 2017/11/08 12:14:09.239399 Added myself to peer list &{[{76:a1:dd:6c:12:51 brya-0} {0e:e4:4e:dd:92:ca brya-2} {ba:23:5a:0c:f3:10 brya-1}]}
DEBU: 2017/11/08 12:14:09.239596 Nodes that have disappeared: map[brya-2:{0e:e4:4e:dd:92:ca brya-2}]
DEBU: 2017/11/08 12:14:09.239643 Preparing to remove disappeared peer {0e:e4:4e:dd:92:ca brya-2}
DEBU: 2017/11/08 12:14:09.239665 Noting I plan to remove  0e:e4:4e:dd:92:ca
DEBU: 2017/11/08 12:14:09.245074 Nodes that have disappeared: map[brya-2:{0e:e4:4e:dd:92:ca brya-2}]
DEBU: 2017/11/08 12:14:09.245149 Preparing to remove disappeared peer {0e:e4:4e:dd:92:ca brya-2}
DEBU: 2017/11/08 12:14:09.245171 Existing annotation 76:a1:dd:6c:12:51
DEBU: 2017/11/08 12:14:09.245191 weave DELETE to http://127.0.0.1:6784/peer/0e:e4:4e:dd:92:ca with map[]
INFO: 2017/11/08 12:14:09.252523 rmpeer of 0e:e4:4e:dd:92:ca : 393216 IPs taken over from 0e:e4:4e:dd:92:ca

I attempted to rejoin brya-2 using kubeadm reset then kubeadm join [...]
This resulted in the peer picking up the previous /var/lib/weave/weave-netdata.db so now it cannot join the cluster because it has an incompatible update. This was previously noted

I manually deleted the persistence file then went through kubectl delete node kubeadm reset kubeadm join again. Weave Net fired up ok.

slightly puzzling ipam state, presumably because it reclaimed the bridge address and nothing further:

# weave status ipam
76:a1:dd:6c:12:51(brya-0)               917503 IPs (87.5% of total) (1 active)
0e:e4:4e:dd:92:ca(brya-2)                    1 IPs (00.0% of total) 
ba:23:5a:0c:f3:10(brya-1)               131072 IPs (12.5% of total)

After moving one of the nettest pods onto brya-2 it is more reasonable:

# weave status ipam
76:a1:dd:6c:12:51(brya-0)               655359 IPs (62.5% of total) (1 active)
0e:e4:4e:dd:92:ca(brya-2)               262145 IPs (25.0% of total) 
ba:23:5a:0c:f3:10(brya-1)               131072 IPs (12.5% of total)

Next I am going to abruptly shut down brya-1 and see what happens.

[24 hours later...]

Node status goes to NotReady; pods on that node go to status NodeLost, but the node is not deleted and the code in this PR does not reclaim its IPs. This is all as expected.

bricef · 2017-11-08T16:37:11Z

I attempted to rejoin brya-2 using kubeadm reset then kubeadm join [...]
This resulted in the peer picking up the previous /var/lib/weave/weave-netdata.db so now it cannot join the cluster because it has an incompatible update. This was previously noted

I manually deleted the persistence file then went through kubectl delete node kubeadm reset kubeadm join again. Weave Net fired up ok.

@bboreham I haven't encountered this. Running kubeadm reset then kubeadm join ... leaves me with a (eventually) clean IPAM table across the cluster. I'm not doing kubectl delete though.

I am getting the following from somewhere when listening to the k8s_weave_weave-net-... container:

INFO: 2017/11/08 16:13:55.261671 Removed unreachable peer e6:78:31:8a:e6:e1(vagr-1)

Although apparently not from the weave-net codebase? (ran a search for "Removed" and "unreachable", and git couldn't find it!)

Is it at this point (or after some retries) that we should be removing the IP allocations for that peer?

This is the log I get when removing a peer with kubeadm reset

INFO: 2017/11/08 16:26:14.364090 ->[10.128.0.19:56777|e6:78:31:8a:e6:e1(vagr-1)]: connection shutting down due to error: read tcp4 10.128.0.20:6783->10.128.0.19:56777: read: connection reset by peer
INFO: 2017/11/08 16:26:14.364386 ->[10.128.0.19:56777|e6:78:31:8a:e6:e1(vagr-1)]: connection deleted
INFO: 2017/11/08 16:26:14.365670 Removed unreachable peer e6:78:31:8a:e6:e1(vagr-1)

But sudo weave status ipam still shows up unreachable addresses.

9a:33:6c:88:6b:5f(vagr-0)               393216 IPs (37.5% of total) (1 active)
2e:eb:78:2c:d6:d8(vagr-2)               131072 IPs (12.5% of total) 
e6:78:31:8a:e6:e1(vagr-1)               524288 IPs (50.0% of total) - unreachable!

Pretty sure I'm running your latest version. (I get the [kube-peers] tagged logs), but I never get the Preparing to remove disappeared peer message, and the 860 test is (correctly) failing.

bboreham · 2017-11-08T20:11:56Z

kubeadm reset removes the k8s installation and stopps running processes, but doesn't delete the node from the api-server. So this PR doesn't reclaim those IPs.

bricef · 2017-11-09T10:35:55Z

Ok, I'll amend the test to delete the node using kubectl delete node instead.

test/860_weave_removes_dead_nodes_from_ipam_3_test.sh

+    ipam_status=$(run_on $host "sudo weave status ipam")
+    echo $ipam_status
+    unreachable_count=$(echo $ipam_status | grep "unreachable" | wc -l)
+    if [ "$unreachable_count" != "0" ]; then


bboreham · 2017-11-14T12:47:34Z

I think this is in fairly good shape; I am minded to merge it and raise new issues for follow-up.

One area we need to cover off is what happens when you upgrade from a previous release - the current code will delete persisted data which is ok if you do it one node at a time but not idea.

Also we need to update the cloud.weave.works config generator.

This is prework for a feature which will remove 'dead' peers

Loop using jitter until the update succeeds

Also tag kube-peer log messages with "[kube-peers]": in a standard deployment they end up interspersed with weaver log messages so this makes it easier to pick them out.

Add features to weaver and kube-peers to find the peer name and check if it is in the list, then in launch.sh delete persisted data if not.

…host

bricef force-pushed the fix-2797-loosing-ipam-ips branch from a0e9ab6 to 2c32a49 Compare October 24, 2017 13:27

bricef requested review from brb, awh and bboreham October 24, 2017 15:50

bricef changed the title ~~Fix 2797 - Clusters go down because IPs cannot be allocated due to unreachable nodes~~ Fix 2797 on Kubernetes - Cluster goes down because all IPs become unreachable Oct 24, 2017

bricef mentioned this pull request Oct 25, 2017

Remove deleted k8s nodes from Weave Net #2797

Closed

mikebryant suggested changes Oct 25, 2017

View reviewed changes

brb suggested changes Oct 25, 2017

View reviewed changes

bricef self-assigned this Oct 31, 2017

bricef force-pushed the fix-2797-loosing-ipam-ips branch 3 times, most recently from df1a05f to 05ac956 Compare October 31, 2017 14:55

mikebryant approved these changes Nov 6, 2017

View reviewed changes

bboreham reviewed Nov 9, 2017

View reviewed changes

test/860_weave_removes_dead_nodes_from_ipam_3_test.sh Outdated

ipam_status=$(run_on $host "sudo weave status ipam")

echo $ipam_status

unreachable_count=$(echo $ipam_status | grep "unreachable" | wc -l)

if [ "$unreachable_count" != "0" ]; then

This comment was marked as abuse.

Sign in to view

bboreham mentioned this pull request Nov 9, 2017

WIP: remove peers that have disappeared from kubernetes #3022

Closed

bricef force-pushed the fix-2797-loosing-ipam-ips branch 4 times, most recently from 01c78de to e12d9de Compare November 13, 2017 14:34

bboreham force-pushed the fix-2797-loosing-ipam-ips branch from 9ee88aa to 678f574 Compare November 13, 2017 15:07

bboreham added this to the 2.1 milestone Nov 13, 2017

bboreham force-pushed the fix-2797-loosing-ipam-ips branch from 678f574 to 1e1d7ab Compare November 13, 2017 15:38

bboreham and others added 16 commits November 14, 2017 14:06

Refactor: create config in main function

f431ee2

RBAC to read and write ConfigMaps

ed8d92c

Maintain a list of peers in a Kubernetes annotation

71b3948

This is prework for a feature which will remove 'dead' peers

Improve config locking by using optimistic locking strategy

497ddda

Loop using jitter until the update succeeds

Pseudo-code to clean up dead peers

267bd8b

Code to reclaim IP address space from removed peers

a552f19

Minor refactor for clarity

6a0e39e

Document: Add file-level docstrings for kube-peers

cb0c9e5

Refactor: Improve naming

fd2f91e

Add log messages to detect dying peer edge case

3afb8d3

Make kube-peers logging consistent and add -log-level flag

70a13c9

Also tag kube-peer log messages with "[kube-peers]": in a standard deployment they end up interspersed with weaver log messages so this makes it easier to pick them out.

Cope with a Kubernetes node being deleted and coming back

3501ad4

Add features to weaver and kube-peers to find the peer name and check if it is in the list, then in launch.sh delete persisted data if not.

Temporarily raise kube-peers log level to debug

c67bc70

Add test function to recover clean output from command run on remote …

49ed40a

…host

Add up command to integration tests to make launching cluster easier

987ac98

Add test to ensure that weave recovers unreachable IPs on launch

90f4f7b

bboreham force-pushed the fix-2797-loosing-ipam-ips branch from 318aa04 to 90f4f7b Compare November 14, 2017 14:18

brb approved these changes Nov 14, 2017

View reviewed changes

bboreham mentioned this pull request Nov 14, 2017

Improve behaviour of kube peer deletion on upgrade #3170

Closed

bboreham merged commit 07459f0 into master Nov 14, 2017

bricef mentioned this pull request Nov 14, 2017

WIP: 2797 should recover ips on peer loss #3171

Closed

brb deleted the fix-2797-loosing-ipam-ips branch January 6, 2018 17:58

sstarcher mentioned this pull request Apr 10, 2018

Pod sandbox issues and stuck at ContainerCreating kubernetes/kops#4327

Closed

Fix 2797 on Kubernetes - Cluster goes down because all IPs become unreachable #3149

Fix 2797 on Kubernetes - Cluster goes down because all IPs become unreachable #3149

Conversation

bricef commented Oct 23, 2017 • edited Loading

bricef commented Oct 23, 2017

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

brb left a comment

Choose a reason for hiding this comment

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

skny5 commented Nov 3, 2017

bboreham commented Nov 3, 2017

bricef commented Nov 6, 2017

bboreham commented Nov 8, 2017 • edited Loading

bricef commented Nov 8, 2017 • edited Loading

bboreham commented Nov 8, 2017

bricef commented Nov 9, 2017 • edited Loading

This comment was marked as abuse.

bboreham commented Nov 14, 2017

bricef commented Oct 23, 2017 •

edited

Loading

bboreham commented Nov 8, 2017 •

edited

Loading

bricef commented Nov 8, 2017 •

edited

Loading

bricef commented Nov 9, 2017 •

edited

Loading