Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

“kube-subnet-mgr” doesn't work over 100 nodes #719

Closed
drinktee opened this issue May 14, 2017 · 5 comments
Closed

“kube-subnet-mgr” doesn't work over 100 nodes #719

drinktee opened this issue May 14, 2017 · 5 comments

Comments

@drinktee
Copy link

Expected Behavior

I have setup a 170 nodes kubernetes cluster. I used this daemonset yaml to deploy flannel. Network was set 172.17.0.0/16. I found that only aboud 100 nodes works and other nodes didn't hava flannel0 , flannel.1.

Current Behavior

The log prints 'Waiting %s for node controller to syn'. I found that the code seems to hang at this place.
After I delete '--kube-subnet-mgr' and use etcd to config network , the cluster works.Every node has a flannel interface.

Your Environment

  • Flannel version: 0.7.1
  • Backend used (e.g. vxlan or udp):vxlan
  • Etcd version: 3.1.7
  • Kubernetes version (if used): 1.6.2
  • Operating System and version:Centos6
@tomdee
Copy link
Contributor

tomdee commented May 17, 2017

@drinktee Do you have the logs from the API server when this was happening?

@tomdee
Copy link
Contributor

tomdee commented May 17, 2017

It's possible (though I can't see how) that it could be related to the "100" here - https://github.com/coreos/flannel/blob/master/subnet/kube/kube.go#L129

@Capitrium
Copy link

I'm running into the same issue - kube-flannel seems to hang on new nodes once the cluster has reached 100 running nodes. The kube-flannel logs shows "Waiting 10m0s for node controller to sync", but that timeout never seems to expire. I don't see any red flags in the logs myself, but I've included them below.

Logs from a broken kube-flannel pod, which has been running for ~30 minutes now:

$ kubectl logs -n kube-system kube-flannel-ds-0763k -f -c kube-flannel
I0518 20:21:27.924866       1 kube.go:111] Waiting 10m0s for node controller to sync
I0518 20:21:27.924939       1 kube.go:315] starting kube subnet manager

API server logs:

$ kubectl logs -n kube-system kube-apiserver-ip-10-0-15-238.ec2.internal -f
I0518 19:41:16.120565       1 aws.go:762] Building AWS cloudprovider
I0518 19:41:16.120654       1 aws.go:725] Zone not specified in configuration file; querying AWS metadata service
I0518 19:41:16.365252       1 tags.go:76] AWS cloud filtering on ClusterID: dev-cluster
E0518 19:41:16.840996       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.LimitRange: Get https://localhost:443/api/v1/limitranges?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
E0518 19:41:16.841107       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.Secret: Get https://localhost:443/api/v1/secrets?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
E0518 19:41:16.841199       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *rbac.Role: Get https://localhost:443/apis/rbac.authorization.k8s.io/v1beta1/roles?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
E0518 19:41:16.841308       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *rbac.RoleBinding: Get https://localhost:443/apis/rbac.authorization.k8s.io/v1beta1/rolebindings?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
E0518 19:41:16.841308       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *rbac.ClusterRole: Get https://localhost:443/apis/rbac.authorization.k8s.io/v1beta1/clusterroles?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
E0518 19:41:16.841451       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *rbac.ClusterRoleBinding: Get https://localhost:443/apis/rbac.authorization.k8s.io/v1beta1/clusterrolebindings?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
E0518 19:41:16.841489       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.Namespace: Get https://localhost:443/api/v1/namespaces?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
E0518 19:41:16.841519       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.ResourceQuota: Get https://localhost:443/api/v1/resourcequotas?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
E0518 19:41:16.841592       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.ServiceAccount: Get https://localhost:443/api/v1/serviceaccounts?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
[restful] 2017/05/18 19:41:16 log.go:30: [restful/swagger] listing is available at https://204.236.223.202/swaggerapi/
[restful] 2017/05/18 19:41:16 log.go:30: [restful/swagger] https://204.236.223.202/swaggerui/ is mapped to folder /swagger-ui/
I0518 19:41:16.922771       1 serve.go:79] Serving securely on 0.0.0.0:443
I0518 19:41:16.922907       1 serve.go:94] Serving insecurely on 127.0.0.1:8080
I0518 19:41:17.863623       1 trace.go:61] Trace "Create /api/v1/namespaces/kube-system/serviceaccounts" (started 2017-05-18 19:41:17.058770376 +0000 UTC):
[39.912µs] [39.912µs] About to convert to expected version
[97.171µs] [57.259µs] Conversion done
[801.502671ms] [801.4055ms] About to store object in database
[804.695623ms] [3.192952ms] Object stored in database
[804.698656ms] [3.033µs] Self-link added
"Create /api/v1/namespaces/kube-system/serviceaccounts" [804.801614ms] [102.958µs] END
I0518 19:41:17.877272       1 trace.go:61] Trace "Create /api/v1/namespaces/default/services" (started 2017-05-18 19:41:16.958558439 +0000 UTC):
[37.004µs] [37.004µs] About to convert to expected version
[104.49µs] [67.486µs] Conversion done
[905.520287ms] [905.415797ms] About to store object in database
[918.659621ms] [13.139334ms] Object stored in database
[918.664693ms] [5.072µs] Self-link added
"Create /api/v1/namespaces/default/services" [918.690463ms] [25.77µs] END
I0518 19:41:17.937227       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/cluster-admin
I0518 19:41:17.943815       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:discovery
I0518 19:41:17.943925       1 trace.go:61] Trace "Create /api/v1/namespaces/kube-system/configmaps" (started 2017-05-18 19:41:16.938161583 +0000 UTC):
[25.066µs] [25.066µs] About to convert to expected version
[89.957µs] [64.891µs] Conversion done
[1.001978266s] [1.001888309s] About to store object in database
[1.005687379s] [3.709113ms] Object stored in database
[1.005692485s] [5.106µs] Self-link added
"Create /api/v1/namespaces/kube-system/configmaps" [1.00572877s] [36.285µs] END
I0518 19:41:17.950363       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:basic-user
I0518 19:41:17.964645       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/admin
I0518 19:41:17.970782       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/edit
I0518 19:41:17.977394       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/view
I0518 19:41:17.991109       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:heapster
I0518 19:41:17.996917       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:node
I0518 19:41:18.002949       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:node-problem-detector
I0518 19:41:18.011007       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:node-proxier
I0518 19:41:18.017204       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:node-bootstrapper
I0518 19:41:18.023128       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:auth-delegator
I0518 19:41:18.029558       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:kube-aggregator
I0518 19:41:18.035836       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:kube-controller-manager
I0518 19:41:18.042544       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:kube-scheduler
I0518 19:41:18.048470       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:kube-dns
I0518 19:41:18.054258       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:persistent-volume-provisioner
I0518 19:41:18.060100       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:controller:attachdetach-controller
I0518 19:41:18.066027       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:controller:cronjob-controller
I0518 19:41:18.072763       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:controller:daemon-set-controller
I0518 19:41:18.078582       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:controller:deployment-controller
I0518 19:41:18.084329       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:controller:disruption-controller
I0518 19:41:18.094122       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:controller:endpoint-controller
I0518 19:41:18.101085       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:controller:generic-garbage-collector
I0518 19:41:18.107027       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:controller:horizontal-pod-autoscaler
I0518 19:41:18.113418       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:controller:job-controller
I0518 19:41:18.119160       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:controller:namespace-controller
I0518 19:41:18.125149       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:controller:node-controller
I0518 19:41:18.130977       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:controller:persistent-volume-binder
I0518 19:41:18.136699       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:controller:pod-garbage-collector
I0518 19:41:18.142724       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:controller:replicaset-controller
I0518 19:41:18.148499       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:controller:replication-controller
I0518 19:41:18.154469       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:controller:resourcequota-controller
I0518 19:41:18.160132       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:controller:route-controller
I0518 19:41:18.165782       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:controller:service-account-controller
I0518 19:41:18.171829       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:controller:service-controller
I0518 19:41:18.179999       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:controller:statefulset-controller
I0518 19:41:18.185971       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:controller:ttl-controller
I0518 19:41:18.192247       1 storage_rbac.go:168] created clusterrole.rbac.authorization.k8s.io/system:controller:certificate-controller
I0518 19:41:18.198330       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/cluster-admin
I0518 19:41:18.204286       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/system:discovery
I0518 19:41:18.214992       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/system:basic-user
I0518 19:41:18.221462       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/system:node
I0518 19:41:18.227672       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/system:node-proxier
I0518 19:41:18.233582       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/system:kube-controller-manager
I0518 19:41:18.239339       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/system:kube-dns
I0518 19:41:18.245567       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/system:kube-scheduler
I0518 19:41:18.251685       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/system:controller:attachdetach-controller
I0518 19:41:18.262448       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/system:controller:cronjob-controller
I0518 19:41:18.271834       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/system:controller:daemon-set-controller
I0518 19:41:18.278787       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/system:controller:deployment-controller
I0518 19:41:18.284864       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/system:controller:disruption-controller
I0518 19:41:18.290780       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/system:controller:endpoint-controller
I0518 19:41:18.296790       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/system:controller:generic-garbage-collector
I0518 19:41:18.303179       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/system:controller:horizontal-pod-autoscaler
I0518 19:41:18.308858       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/system:controller:job-controller
I0518 19:41:18.314994       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/system:controller:namespace-controller
I0518 19:41:18.328195       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/system:controller:node-controller
I0518 19:41:18.333903       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/system:controller:persistent-volume-binder
I0518 19:41:18.367297       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/system:controller:pod-garbage-collector
I0518 19:41:18.407192       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/system:controller:replicaset-controller
I0518 19:41:18.448378       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/system:controller:replication-controller
I0518 19:41:18.487373       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/system:controller:resourcequota-controller
I0518 19:41:18.528101       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/system:controller:route-controller
I0518 19:41:18.567229       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/system:controller:service-account-controller
I0518 19:41:18.607665       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/system:controller:service-controller
I0518 19:41:18.647557       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/system:controller:statefulset-controller
I0518 19:41:18.687547       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/system:controller:ttl-controller
I0518 19:41:18.727327       1 storage_rbac.go:196] created clusterrolebinding.rbac.authorization.k8s.io/system:controller:certificate-controller
I0518 19:41:18.771112       1 storage_rbac.go:227] created role.rbac.authorization.k8s.io/system:controller:bootstrap-signer in kube-public 
I0518 19:41:18.807424       1 storage_rbac.go:227] created role.rbac.authorization.k8s.io/extension-apiserver-authentication-reader in kube-system 
I0518 19:41:18.847425       1 storage_rbac.go:227] created role.rbac.authorization.k8s.io/system:controller:bootstrap-signer in kube-system 
I0518 19:41:18.887679       1 storage_rbac.go:227] created role.rbac.authorization.k8s.io/system:controller:token-cleaner in kube-system 
I0518 19:41:18.927976       1 storage_rbac.go:257] created rolebinding.rbac.authorization.k8s.io/system:controller:bootstrap-signer in kube-system
I0518 19:41:18.967479       1 storage_rbac.go:257] created rolebinding.rbac.authorization.k8s.io/system:controller:token-cleaner in kube-system
I0518 19:41:19.007231       1 storage_rbac.go:257] created rolebinding.rbac.authorization.k8s.io/system:controller:bootstrap-signer in kube-public
I0518 19:51:16.638058       1 compact.go:159] etcd: compacted rev (175), endpoints ([https://etcd-events.k8s:2379])
I0518 19:51:16.647254       1 compact.go:159] etcd: compacted rev (421), endpoints ([https://etcd.k8s:2379])
I0518 19:56:16.642789       1 compact.go:159] etcd: compacted rev (272), endpoints ([https://etcd-events.k8s:2379])
I0518 19:56:16.663078       1 compact.go:159] etcd: compacted rev (813), endpoints ([https://etcd.k8s:2379])
E0518 19:58:00.454192       1 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
I0518 20:01:16.660297       1 compact.go:159] etcd: compacted rev (344), endpoints ([https://etcd-events.k8s:2379])
I0518 20:01:16.673474       1 compact.go:159] etcd: compacted rev (1233), endpoints ([https://etcd.k8s:2379])
I0518 20:06:16.671250       1 compact.go:159] etcd: compacted rev (6057), endpoints ([https://etcd-events.k8s:2379])
I0518 20:06:16.680355       1 compact.go:159] etcd: compacted rev (4858), endpoints ([https://etcd.k8s:2379])
I0518 20:11:16.678306       1 compact.go:159] etcd: compacted rev (9157), endpoints ([https://etcd-events.k8s:2379])
I0518 20:11:16.687403       1 compact.go:159] etcd: compacted rev (9120), endpoints ([https://etcd.k8s:2379])
E0518 20:13:23.553856       1 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
I0518 20:16:16.693806       1 compact.go:159] etcd: compacted rev (9158), endpoints ([https://etcd-events.k8s:2379])
I0518 20:16:16.695677       1 compact.go:159] etcd: compacted rev (12387), endpoints ([https://etcd.k8s:2379])
I0518 20:21:16.705018       1 compact.go:159] etcd: compacted rev (11916), endpoints ([https://etcd-events.k8s:2379])
I0518 20:21:16.713789       1 compact.go:159] etcd: compacted rev (16688), endpoints ([https://etcd.k8s:2379])
I0518 20:26:16.715901       1 compact.go:159] etcd: compacted rev (25686), endpoints ([https://etcd-events.k8s:2379])
I0518 20:26:16.722670       1 compact.go:159] etcd: compacted rev (22112), endpoints ([https://etcd.k8s:2379])
E0518 20:27:09.623635       1 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
I0518 20:31:16.725031       1 compact.go:159] etcd: compacted rev (42694), endpoints ([https://etcd-events.k8s:2379])
I0518 20:31:16.730585       1 compact.go:159] etcd: compacted rev (27279), endpoints ([https://etcd.k8s:2379])

@tomdee
Copy link
Contributor

tomdee commented May 19, 2017

Thanks @Capitrium - I managed to repro the problem. I've added a fix in #729

@tomdee
Copy link
Contributor

tomdee commented May 19, 2017

See the flannel-git repo on quay if you want an image to try out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants