Flannel not starting on a new node. #770

doronoffir · 2017-07-13T09:10:25Z

Hi Guys,

We have a k8s 1.6.2 cluster, running on AWS. We have used KOPS 1.6.2 to create the cluster.

We have several node pools for the different application roles, as DBs, workers and proxies.
We are also using spot instances for the workers, managed by Spotinst with their customized k8s Autoscaling pod.
We have made our tests on k8s staging cluster of version 1.5.6 and 1.6.0, and we did not encounter the issue I'll describe.
The issue presented itself in the production k8s cluster (k8s 1.6.2).
Due to some performance issues at the beginning we have scaled out the cluster, to a point it had almost 120 nodes, thats when the problems started.

Expected Behavior

The workers node pool is configured is consistent of spot instances, managed by Spotinst services, there is a customize Autoscaller pod that send scaling notice to spotinst.
When a new node is added to the pool, it should start servicing the cluster resource need.

Current Behavior

About 40% of a newly added nodes seems to have Flannel issue.
The node is connected to the cluster, and reported healthy by the cluster, but when new pods are been registered to it, they are stuck in "ContainerCreating". Examining the pods log show the following error:

Warning FailedSync Error syncing pod, skipping: failed to "CreatePodSandbox" for "analyzer-315055607-kdn0x_default(ac80c569-670c-11e7-bcbc-0a7044fda3e6)" with CreatePodSandboxError: "CreatePodSandbox for pod \"analyzer-315055607-kdn0x_default(ac80c569-670c-11e7-bcbc-0a7044fda3e6)\" failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod \"analyzer-315055607-kdn0x_default\" network: open /run/flannel/subnet.env: no such file or directory"

Restarting/Delete the pod did not solved this.

Possible Solution

Our initial mitigation was reboot the node with the problem, and it helped at the first few nodes, but not for most of the failed nodes; We have added a large number of nodes at once, about 50, and we had about 20 failed nodes.
Our next step was terminating the failed nodes, this had similar effect, it helped for some, but in most cases, the new node had the same issue.
To cut a long story short, we have ended with the workaround of restarting the Flannel pod in the failed nodes, and this solve the problem.

Steps to Reproduce (for bugs)

For us it would be adding a batch of 30 servers to the cluster, at least 10 of them will present this issue.

Context

Since we need to monitor the nodes creation, I can not really trust the Autoscaller or any other "healing" procedure that require a node restart or replacement.

Your Environment

A k8s cluster version 1.6.2, created by KOPS 1.6.2, running on AWS.

Thank you!

The text was updated successfully, but these errors were encountered:

tomdee · 2017-07-13T22:03:57Z

Hi @Doron-offir, sorry you hit these issues. I suspect that you were hitting the problem where flannel wouldn't start with kube-subnet-mgr and >100 nodes. #719 This is fixed in v0.8.0 (https://github.com/coreos/flannel/releases/tag/v0.8.0) so I suggest you raise an issue with Kops to get it updated to the new flannel release.

doronoffir · 2017-07-16T08:28:18Z

Hi Tom, Thank you for your answer, I have reached KOPS guys as well. I have another question regarding flannel performance I hope you could provide me with an insight. In this setup we are running also Elasticsearch cluster that are been feed from out scrappers, and we experience high latency. Could it be relevant in any way to Flannel or should we look in another direction? Thanks again and all of the very best,

…

-Doron

On 14 July 2017 at 01:04, Tom Denham ***@***.***> wrote: Hi @Doron-offir <https://github.com/doron-offir>, sorry you hit these issues. I suspect that you were hitting the problem where flannel wouldn't start with kube-subnet-mgr and >100 nodes. #719 <#719> This is fixed in v0.8.0 ( https://github.com/coreos/flannel/releases/tag/v0.8.0) so I suggest you raise an issue with Kops to get it updated to the new flannel release. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#770 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGpQ4u282_VZ9VbV1wevSWb_GiMMODcQks5sNpRRgaJpZM4OWtho> .

tomdee closed this as completed Jul 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flannel not starting on a new node. #770

Flannel not starting on a new node. #770

doronoffir commented Jul 13, 2017

tomdee commented Jul 13, 2017

doronoffir commented Jul 16, 2017 via email

Flannel not starting on a new node. #770

Flannel not starting on a new node. #770

Comments

doronoffir commented Jul 13, 2017

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

tomdee commented Jul 13, 2017

doronoffir commented Jul 16, 2017 via email