-
Notifications
You must be signed in to change notification settings - Fork 3
eirini-loggregator-bridge crashing after a while #4
Comments
@mudler @jimmykarily Any ideas? |
I am observing the same behavior with the latest kubecf. The As a consequence of this, if I do |
@stoyanr - @jimmykarily is debugging that. Thanks for confirming. |
I created a cluster and got that error . For reference:
seeing the logs 90 times is "normal". It's just the fact that the watcher was not supposed to die so whenever the pod start we request all the logs from kube and send them over to the loggregator. We can decide if that's a good strategy or not when we find out why the watcher dies. Initially we thought it might have something to do with inotify watcher limits because that was the problem we were facing in the past when we tried to push many apps on scf running on a local cluster. I intentionally lowered the limits by a lot to see if that would cause this error. E.g.
these made the bridge to fail but with different errors:
and
so afaict it is not the limits that is causing the issue. Also I read this issue: kubernetes/client-go#547 and looking on our code (https://github.com/SUSE/eirinix/blob/master/manager.go#L428) we make no effort to reconnect to the channel if it's closed (we just return an error: https://github.com/SUSE/eirinix/blob/master/manager.go#L448). That means, if for some reason the connection times out, the channel is closed and we exit although reconnecting to the channel could be an option. The question is, why is the connection lost and why wasn't this happening on scf? @mudler should we try to add a retry block and see how that behaves? |
Or we can increase the watcher timeout to see if it helps: https://github.com/kubernetes/client-go/blob/master/kubernetes/typed/core/v1/pod.go#L102 btw this issue could be a reason why we get timeouts here and not on (previous version of) scf: cloudfoundry/loggregator-release#401 (too many requests flooding the network or something like that?) |
We now return a special error when the channel closes (cloudfoundry-incubator/eirinix#19) and we are planning to consume it like this: a849314 now we need some time to test with these changes and see if this fixes the problem or creates more. |
It happened on SCF as well? I didn't observed this behavior on SCF with cluster left running overnight. I'll check if it is reproducible there |
@jandubois Have you noticed this behaviour with SCF? |
@f0rmiga I have no idea; I only ever deploy Eirini for some quick tests. |
I can confirm this applies to SCF as well. I was able to reproduce with the latest public release, leaving the cluster running overnight. The problem is hidden as monit silently restart the process, and no pod restart can be observed.
The process list (right before deploying):
After a while:
|
It seems a known problem with watchers - I see helpers already there to workaround the issue https://github.com/kubernetes/client-go/blob/master/tools/watch/retrywatcher.go#L59 . Probably we should consume those (or give a way to opt) in EiriniX instead. |
With the patches linked above, I don't see anymore pod restarts, but there is #6 to take care about (discovered meanwhile testing)
I'm closing this issue for now, but we have to make sure we tag and consume a new version of the eirini-bosh-release in KubeCF (that's another story). |
I left a kubecf cluster running over-night and I got:
And the number of restarts:
It's not clear why it crashes.
The text was updated successfully, but these errors were encountered: