Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flannel not starting on a new node. #770

Closed
doronoffir opened this issue Jul 13, 2017 · 2 comments
Closed

Flannel not starting on a new node. #770

doronoffir opened this issue Jul 13, 2017 · 2 comments

Comments

@doronoffir
Copy link

Hi Guys,

We have a k8s 1.6.2 cluster, running on AWS. We have used KOPS 1.6.2 to create the cluster.

We have several node pools for the different application roles, as DBs, workers and proxies.
We are also using spot instances for the workers, managed by Spotinst with their customized k8s Autoscaling pod.
We have made our tests on k8s staging cluster of version 1.5.6 and 1.6.0, and we did not encounter the issue I'll describe.
The issue presented itself in the production k8s cluster (k8s 1.6.2).
Due to some performance issues at the beginning we have scaled out the cluster, to a point it had almost 120 nodes, thats when the problems started.

Expected Behavior

The workers node pool is configured is consistent of spot instances, managed by Spotinst services, there is a customize Autoscaller pod that send scaling notice to spotinst.
When a new node is added to the pool, it should start servicing the cluster resource need.

Current Behavior

About 40% of a newly added nodes seems to have Flannel issue.
The node is connected to the cluster, and reported healthy by the cluster, but when new pods are been registered to it, they are stuck in "ContainerCreating". Examining the pods log show the following error:

Warning FailedSync Error syncing pod, skipping: failed to "CreatePodSandbox" for "analyzer-315055607-kdn0x_default(ac80c569-670c-11e7-bcbc-0a7044fda3e6)" with CreatePodSandboxError: "CreatePodSandbox for pod \"analyzer-315055607-kdn0x_default(ac80c569-670c-11e7-bcbc-0a7044fda3e6)\" failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod \"analyzer-315055607-kdn0x_default\" network: open /run/flannel/subnet.env: no such file or directory"

Restarting/Delete the pod did not solved this.

Possible Solution

Our initial mitigation was reboot the node with the problem, and it helped at the first few nodes, but not for most of the failed nodes; We have added a large number of nodes at once, about 50, and we had about 20 failed nodes.
Our next step was terminating the failed nodes, this had similar effect, it helped for some, but in most cases, the new node had the same issue.
To cut a long story short, we have ended with the workaround of restarting the Flannel pod in the failed nodes, and this solve the problem.

Steps to Reproduce (for bugs)

For us it would be adding a batch of 30 servers to the cluster, at least 10 of them will present this issue.

Context

Since we need to monitor the nodes creation, I can not really trust the Autoscaller or any other "healing" procedure that require a node restart or replacement.

Your Environment

A k8s cluster version 1.6.2, created by KOPS 1.6.2, running on AWS.

Thank you!

@tomdee
Copy link
Contributor

tomdee commented Jul 13, 2017

Hi @Doron-offir, sorry you hit these issues. I suspect that you were hitting the problem where flannel wouldn't start with kube-subnet-mgr and >100 nodes. #719 This is fixed in v0.8.0 (https://github.com/coreos/flannel/releases/tag/v0.8.0) so I suggest you raise an issue with Kops to get it updated to the new flannel release.

@tomdee tomdee closed this as completed Jul 13, 2017
@doronoffir
Copy link
Author

doronoffir commented Jul 16, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants