Skip to content
This repository has been archived by the owner on Oct 22, 2021. It is now read-only.

eirini-loggregator-bridge crashing after a while #4

Closed
f0rmiga opened this issue Jan 22, 2020 · 12 comments
Closed

eirini-loggregator-bridge crashing after a while #4

f0rmiga opened this issue Jan 22, 2020 · 12 comments
Labels
bug Something isn't working

Comments

@f0rmiga
Copy link
Member

f0rmiga commented Jan 22, 2020

I left a kubecf cluster running over-night and I got:

{"level":"info","ts":1579721769.5985267,"caller":"config/getter.go:53","msg":"Using in-cluster kube config"}
{"level":"info","ts":1579721769.5985787,"caller":"config/checker.go:36","msg":"Checking kube config"}
Watcher channel closed

And the number of restarts:

NAME              READY   STATUS    RESTARTS   AGE
kubecf-eirini-0   9/9     Running   19         15h

It's not clear why it crashes.

@f0rmiga f0rmiga added the bug Something isn't working label Jan 22, 2020
@f0rmiga
Copy link
Member Author

f0rmiga commented Jan 22, 2020

@mudler @jimmykarily Any ideas?

@stoyanr
Copy link
Member

stoyanr commented Jan 23, 2020

I am observing the same behavior with the latest kubecf. The eirini-loggregator-bridge crashed around 90 times for 15h. The only difference in the logs between a running container and a previously crashed container is the line Watcher channel closed at the end. The eirini container image is docker.io/cfcontainerization/eirini:SLE_15_SP1-15.1-7.0.0_374.gb8e8e6af-0.0.25.

As a consequence of this, if I do cf logs <app> --recent, I see the same logs repeated 90 times.

@f0rmiga
Copy link
Member Author

f0rmiga commented Jan 23, 2020

@stoyanr - @jimmykarily is debugging that. Thanks for confirming.

@jimmykarily
Copy link
Contributor

jimmykarily commented Jan 24, 2020

I created a cluster and got that error . For reference:

seeing the logs 90 times is "normal". It's just the fact that the watcher was not supposed to die so whenever the pod start we request all the logs from kube and send them over to the loggregator. We can decide if that's a good strategy or not when we find out why the watcher dies.

Initially we thought it might have something to do with inotify watcher limits because that was the problem we were facing in the past when we tried to push many apps on scf running on a local cluster. I intentionally lowered the limits by a lot to see if that would cause this error. E.g.

sudo sysctl fs.inotify.max_user_instances=20
sudo sysctl fs.inotify.max_user_watches=20

these made the bridge to fail but with different errors:

failed to create fsnotify watcher: too many open files

and

failed to watch file "/var/log/pods/kubecf_kubecf-eirini-0_c4004c18-382c-4e7d-8a53-fac163f3cc69/eirini-loggregator-bridge-eirini-loggregator-bridge/3.log": no space left on device

so afaict it is not the limits that is causing the issue. Also I read this issue: kubernetes/client-go#547

and looking on our code (https://github.com/SUSE/eirinix/blob/master/manager.go#L428) we make no effort to reconnect to the channel if it's closed (we just return an error: https://github.com/SUSE/eirinix/blob/master/manager.go#L448).

That means, if for some reason the connection times out, the channel is closed and we exit although reconnecting to the channel could be an option. The question is, why is the connection lost and why wasn't this happening on scf?

@mudler should we try to add a retry block and see how that behaves?

@jimmykarily
Copy link
Contributor

Or we can increase the watcher timeout to see if it helps: https://github.com/kubernetes/client-go/blob/master/kubernetes/typed/core/v1/pod.go#L102

btw this issue could be a reason why we get timeouts here and not on (previous version of) scf: cloudfoundry/loggregator-release#401 (too many requests flooding the network or something like that?)

@jimmykarily
Copy link
Contributor

We now return a special error when the channel closes (cloudfoundry-incubator/eirinix#19) and we are planning to consume it like this: a849314

now we need some time to test with these changes and see if this fixes the problem or creates more.

@mudler
Copy link
Member

mudler commented Jan 28, 2020

I left a kubecf cluster running over-night and I got:

{"level":"info","ts":1579721769.5985267,"caller":"config/getter.go:53","msg":"Using in-cluster kube config"}
{"level":"info","ts":1579721769.5985787,"caller":"config/checker.go:36","msg":"Checking kube config"}
Watcher channel closed

And the number of restarts:

NAME              READY   STATUS    RESTARTS   AGE
kubecf-eirini-0   9/9     Running   19         15h

It's not clear why it crashes.

It happened on SCF as well? I didn't observed this behavior on SCF with cluster left running overnight. I'll check if it is reproducible there

@f0rmiga
Copy link
Member Author

f0rmiga commented Jan 28, 2020

@jandubois Have you noticed this behaviour with SCF?

@jandubois
Copy link
Member

@jandubois Have you noticed this behaviour with SCF?

@f0rmiga I have no idea; I only ever deploy Eirini for some quick tests.

@mudler
Copy link
Member

mudler commented Jan 29, 2020

I can confirm this applies to SCF as well. I was able to reproduce with the latest public release, leaving the cluster running overnight. The problem is hidden as monit silently restart the process, and no pod restart can be observed.

eirini/0:/var/vcap/sys# cat log/eirini-loggregator-bridge/eirini-loggregator-bridge.stdout.log
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed

The process list (right before deploying):

$> ps aux
root         1  0.0  0.0    244    48 ?        Ss   14:03   0:00 /usr/bin/dumb-init /opt/fissile/run.sh
root         7  0.0  0.0  23652  6532 ?        Ss   14:03   0:00 /bin/bash /opt/fissile/run.sh
root       306  0.0  0.0  93572  2700 ?        Sl   14:07   0:03 monit -I
root       441  0.0  0.0  26832  2788 ?        S    14:07   0:00 cron -n
syslog     457  0.0  0.0 290916  5820 ?        Sl   14:07   0:00 /usr/sbin/rsyslogd -i /var/run/rsyslogd.monit.pid -n
vcap      1082  0.0  0.0   4200   744 ?        Ss   14:12   0:00 /var/vcap/packages/bpm/bin/tini -w -s -- /var/vcap/packages/eirini/bin/opi connect --config=/var/vcap/jobs/opi/config/opi.yml
vcap      1090  0.3  0.2 781748 40196 ?        Sl   14:12   0:18 /var/vcap/packages/eirini/bin/opi connect --config=/var/vcap/jobs/opi/config/opi.yml
vcap      3174  0.0  0.0   4200   744 ?        Ss   14:45   0:00 /var/vcap/packages/bpm/bin/tini -w -s -- /var/vcap/packages/eirini-loggregator-bridge/bin/eirini-loggregator-bridge --config /var/vcap/jobs/eirini-loggregator-bridge/confi
vcap      3183  0.7  0.2 704916 39476 ?        Sl   14:45   0:19 /var/vcap/packages/eirini-loggregator-bridge/bin/eirini-loggregator-bridge --config /var/vcap/jobs/eirini-loggregator-bridge/config/eirini-loggregator-bridge.yaml
root      5022  0.0  0.0  24452  7524 pts/0    Ss   15:15   0:00 bash
root      5331  0.0  0.0   4288   820 pts/0    S+   15:19   0:00 tail -f log/eirini-loggregator-bridge/bpm.log log/eirini-loggregator-bridge/eirini-loggregator-bridge.stderr.log log/eirini-loggregator-bridge/eirini-loggregator-bridge.st
root      5722  0.2  0.0  24452  7556 pts/1    Ss   15:26   0:00 bash
root      5898  0.0  0.0  12308  3632 pts/1    S+   15:27   0:00 watch ps aux
root      6148  0.0  0.0  12304  1776 pts/1    S+   15:29   0:00 watch ps aux
root      6149  4.3  0.0  41464  3376 pts/1    R+   15:29   0:00 ps aux

After a while:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0    244    48 ?        Ss   Jan28   0:00 /usr/bin/dumb-init /opt/fissile/run.sh
root         7  0.0  0.0  23652  6532 ?        Ss   Jan28   0:00 /bin/bash /opt/fissile/run.sh
root       306  0.0  0.0  93572  2704 ?        Sl   Jan28   0:34 monit -I
root       441  0.0  0.0  26832  2788 ?        S    Jan28   0:00 cron -n
syslog     457  0.0  0.0 290916  5820 ?        Sl   Jan28   0:00 /usr/sbin/rsyslogd -i /var/run/rsyslogd.monit.pid -n
vcap      1082  0.0  0.0   4200   744 ?        Ss   Jan28   0:01 /var/vcap/packages/bpm/bin/tini -w -s -- /var/vcap/packages/eirini/bin/opi connect --config=/var/vcap/jobs/opi/config/opi.yml
vcap      1090  0.4  0.2 781748 42572 ?        Sl   Jan28   4:48 /var/vcap/packages/eirini/bin/opi connect --config=/var/vcap/jobs/opi/config/opi.yml
root      5022  0.0  0.0  24452  7524 pts/0    Ss   Jan28   0:00 bash
root      5331  0.0  0.0   4288   820 pts/0    S+   Jan28   0:01 tail -f log/eirini-loggregator-bridge/bpm.log log/eirini-loggregator-bridge/eirini-loggregator-bridge.stderr.log log/eirini-loggregator-bridge/eirini-loggregator-bridge.st
root      5722  0.0  0.0  24452  7556 pts/1    Ss   Jan28   0:00 bash
root      5898  0.0  0.0  12328  3644 pts/1    S+   Jan28   0:46 watch ps aux
vcap     13693  0.0  0.0   4200   776 ?        Ss   06:21   0:00 /var/vcap/packages/bpm/bin/tini -w -s -- /var/vcap/packages/eirini-loggregator-bridge/bin/eirini-loggregator-bridge --config /var/vcap/jobs/eirini-loggregator-bridge/confi
vcap     13703  1.6  0.3 598624 57336 ?        Sl   06:22   0:32 /var/vcap/packages/eirini-loggregator-bridge/bin/eirini-loggregator-bridge --config /var/vcap/jobs/eirini-loggregator-bridge/config/eirini-loggregator-bridge.yaml
root     17404  0.0  0.0  12324  1788 pts/1    S+   06:53   0:00 watch ps aux
root     17405 50.0  0.0  41464  3408 pts/1    R+   06:53   0:00 ps aux

@mudler
Copy link
Member

mudler commented Jan 29, 2020

It seems a known problem with watchers - I see helpers already there to workaround the issue https://github.com/kubernetes/client-go/blob/master/tools/watch/retrywatcher.go#L59 . Probably we should consume those (or give a way to opt) in EiriniX instead.

@mudler
Copy link
Member

mudler commented Mar 25, 2020

With the patches linked above, I don't see anymore pod restarts, but there is #6 to take care about (discovered meanwhile testing)

susecf-scf-eirini-0                    9/9     Running   0          22h

I'm closing this issue for now, but we have to make sure we tag and consume a new version of the eirini-bosh-release in KubeCF (that's another story).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants