eirini-loggregator-bridge crashing after a while #4

f0rmiga · 2020-01-22T20:49:46Z

I left a kubecf cluster running over-night and I got:

{"level":"info","ts":1579721769.5985267,"caller":"config/getter.go:53","msg":"Using in-cluster kube config"}
{"level":"info","ts":1579721769.5985787,"caller":"config/checker.go:36","msg":"Checking kube config"}
Watcher channel closed

And the number of restarts:

NAME              READY   STATUS    RESTARTS   AGE
kubecf-eirini-0   9/9     Running   19         15h

It's not clear why it crashes.

The text was updated successfully, but these errors were encountered:

f0rmiga · 2020-01-22T20:50:20Z

@mudler @jimmykarily Any ideas?

stoyanr · 2020-01-23T08:27:27Z

I am observing the same behavior with the latest kubecf. The eirini-loggregator-bridge crashed around 90 times for 15h. The only difference in the logs between a running container and a previously crashed container is the line Watcher channel closed at the end. The eirini container image is docker.io/cfcontainerization/eirini:SLE_15_SP1-15.1-7.0.0_374.gb8e8e6af-0.0.25.

As a consequence of this, if I do cf logs <app> --recent, I see the same logs repeated 90 times.

f0rmiga · 2020-01-23T18:03:38Z

@stoyanr - @jimmykarily is debugging that. Thanks for confirming.

jimmykarily · 2020-01-24T08:33:51Z

I created a cluster and got that error . For reference:

The app I deployed: https://github.com/SUSE/catapult/tree/master/contrib/samples/ticking_app
Cluster was build with kind 0.6.1

seeing the logs 90 times is "normal". It's just the fact that the watcher was not supposed to die so whenever the pod start we request all the logs from kube and send them over to the loggregator. We can decide if that's a good strategy or not when we find out why the watcher dies.

Initially we thought it might have something to do with inotify watcher limits because that was the problem we were facing in the past when we tried to push many apps on scf running on a local cluster. I intentionally lowered the limits by a lot to see if that would cause this error. E.g.

sudo sysctl fs.inotify.max_user_instances=20
sudo sysctl fs.inotify.max_user_watches=20

these made the bridge to fail but with different errors:

failed to create fsnotify watcher: too many open files

and

failed to watch file "/var/log/pods/kubecf_kubecf-eirini-0_c4004c18-382c-4e7d-8a53-fac163f3cc69/eirini-loggregator-bridge-eirini-loggregator-bridge/3.log": no space left on device

so afaict it is not the limits that is causing the issue. Also I read this issue: kubernetes/client-go#547

and looking on our code (https://github.com/SUSE/eirinix/blob/master/manager.go#L428) we make no effort to reconnect to the channel if it's closed (we just return an error: https://github.com/SUSE/eirinix/blob/master/manager.go#L448).

That means, if for some reason the connection times out, the channel is closed and we exit although reconnecting to the channel could be an option. The question is, why is the connection lost and why wasn't this happening on scf?

@mudler should we try to add a retry block and see how that behaves?

jimmykarily · 2020-01-24T09:15:51Z

Or we can increase the watcher timeout to see if it helps: https://github.com/kubernetes/client-go/blob/master/kubernetes/typed/core/v1/pod.go#L102

btw this issue could be a reason why we get timeouts here and not on (previous version of) scf: cloudfoundry/loggregator-release#401 (too many requests flooding the network or something like that?)

jimmykarily · 2020-01-27T14:53:11Z

We now return a special error when the channel closes (cloudfoundry-incubator/eirinix#19) and we are planning to consume it like this: a849314

now we need some time to test with these changes and see if this fixes the problem or creates more.

mudler · 2020-01-28T09:16:23Z

I left a kubecf cluster running over-night and I got:

{"level":"info","ts":1579721769.5985267,"caller":"config/getter.go:53","msg":"Using in-cluster kube config"}
{"level":"info","ts":1579721769.5985787,"caller":"config/checker.go:36","msg":"Checking kube config"}
Watcher channel closed

And the number of restarts:

NAME              READY   STATUS    RESTARTS   AGE
kubecf-eirini-0   9/9     Running   19         15h

It's not clear why it crashes.

It happened on SCF as well? I didn't observed this behavior on SCF with cluster left running overnight. I'll check if it is reproducible there

f0rmiga · 2020-01-28T17:00:25Z

@jandubois Have you noticed this behaviour with SCF?

jandubois · 2020-01-28T18:32:28Z

@jandubois Have you noticed this behaviour with SCF?

@f0rmiga I have no idea; I only ever deploy Eirini for some quick tests.

mudler · 2020-01-29T07:52:48Z

I can confirm this applies to SCF as well. I was able to reproduce with the latest public release, leaving the cluster running overnight. The problem is hidden as monit silently restart the process, and no pod restart can be observed.

eirini/0:/var/vcap/sys# cat log/eirini-loggregator-bridge/eirini-loggregator-bridge.stdout.log
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed
Watcher channel closed

The process list (right before deploying):

$> ps aux
root         1  0.0  0.0    244    48 ?        Ss   14:03   0:00 /usr/bin/dumb-init /opt/fissile/run.sh
root         7  0.0  0.0  23652  6532 ?        Ss   14:03   0:00 /bin/bash /opt/fissile/run.sh
root       306  0.0  0.0  93572  2700 ?        Sl   14:07   0:03 monit -I
root       441  0.0  0.0  26832  2788 ?        S    14:07   0:00 cron -n
syslog     457  0.0  0.0 290916  5820 ?        Sl   14:07   0:00 /usr/sbin/rsyslogd -i /var/run/rsyslogd.monit.pid -n
vcap      1082  0.0  0.0   4200   744 ?        Ss   14:12   0:00 /var/vcap/packages/bpm/bin/tini -w -s -- /var/vcap/packages/eirini/bin/opi connect --config=/var/vcap/jobs/opi/config/opi.yml
vcap      1090  0.3  0.2 781748 40196 ?        Sl   14:12   0:18 /var/vcap/packages/eirini/bin/opi connect --config=/var/vcap/jobs/opi/config/opi.yml
vcap      3174  0.0  0.0   4200   744 ?        Ss   14:45   0:00 /var/vcap/packages/bpm/bin/tini -w -s -- /var/vcap/packages/eirini-loggregator-bridge/bin/eirini-loggregator-bridge --config /var/vcap/jobs/eirini-loggregator-bridge/confi
vcap      3183  0.7  0.2 704916 39476 ?        Sl   14:45   0:19 /var/vcap/packages/eirini-loggregator-bridge/bin/eirini-loggregator-bridge --config /var/vcap/jobs/eirini-loggregator-bridge/config/eirini-loggregator-bridge.yaml
root      5022  0.0  0.0  24452  7524 pts/0    Ss   15:15   0:00 bash
root      5331  0.0  0.0   4288   820 pts/0    S+   15:19   0:00 tail -f log/eirini-loggregator-bridge/bpm.log log/eirini-loggregator-bridge/eirini-loggregator-bridge.stderr.log log/eirini-loggregator-bridge/eirini-loggregator-bridge.st
root      5722  0.2  0.0  24452  7556 pts/1    Ss   15:26   0:00 bash
root      5898  0.0  0.0  12308  3632 pts/1    S+   15:27   0:00 watch ps aux
root      6148  0.0  0.0  12304  1776 pts/1    S+   15:29   0:00 watch ps aux
root      6149  4.3  0.0  41464  3376 pts/1    R+   15:29   0:00 ps aux

After a while:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0    244    48 ?        Ss   Jan28   0:00 /usr/bin/dumb-init /opt/fissile/run.sh
root         7  0.0  0.0  23652  6532 ?        Ss   Jan28   0:00 /bin/bash /opt/fissile/run.sh
root       306  0.0  0.0  93572  2704 ?        Sl   Jan28   0:34 monit -I
root       441  0.0  0.0  26832  2788 ?        S    Jan28   0:00 cron -n
syslog     457  0.0  0.0 290916  5820 ?        Sl   Jan28   0:00 /usr/sbin/rsyslogd -i /var/run/rsyslogd.monit.pid -n
vcap      1082  0.0  0.0   4200   744 ?        Ss   Jan28   0:01 /var/vcap/packages/bpm/bin/tini -w -s -- /var/vcap/packages/eirini/bin/opi connect --config=/var/vcap/jobs/opi/config/opi.yml
vcap      1090  0.4  0.2 781748 42572 ?        Sl   Jan28   4:48 /var/vcap/packages/eirini/bin/opi connect --config=/var/vcap/jobs/opi/config/opi.yml
root      5022  0.0  0.0  24452  7524 pts/0    Ss   Jan28   0:00 bash
root      5331  0.0  0.0   4288   820 pts/0    S+   Jan28   0:01 tail -f log/eirini-loggregator-bridge/bpm.log log/eirini-loggregator-bridge/eirini-loggregator-bridge.stderr.log log/eirini-loggregator-bridge/eirini-loggregator-bridge.st
root      5722  0.0  0.0  24452  7556 pts/1    Ss   Jan28   0:00 bash
root      5898  0.0  0.0  12328  3644 pts/1    S+   Jan28   0:46 watch ps aux
vcap     13693  0.0  0.0   4200   776 ?        Ss   06:21   0:00 /var/vcap/packages/bpm/bin/tini -w -s -- /var/vcap/packages/eirini-loggregator-bridge/bin/eirini-loggregator-bridge --config /var/vcap/jobs/eirini-loggregator-bridge/confi
vcap     13703  1.6  0.3 598624 57336 ?        Sl   06:22   0:32 /var/vcap/packages/eirini-loggregator-bridge/bin/eirini-loggregator-bridge --config /var/vcap/jobs/eirini-loggregator-bridge/config/eirini-loggregator-bridge.yaml
root     17404  0.0  0.0  12324  1788 pts/1    S+   06:53   0:00 watch ps aux
root     17405 50.0  0.0  41464  3408 pts/1    R+   06:53   0:00 ps aux

mudler · 2020-01-29T08:01:00Z

It seems a known problem with watchers - I see helpers already there to workaround the issue https://github.com/kubernetes/client-go/blob/master/tools/watch/retrywatcher.go#L59 . Probably we should consume those (or give a way to opt) in EiriniX instead.

mudler · 2020-03-25T10:00:32Z

With the patches linked above, I don't see anymore pod restarts, but there is #6 to take care about (discovered meanwhile testing)

susecf-scf-eirini-0                    9/9     Running   0          22h

I'm closing this issue for now, but we have to make sure we tag and consume a new version of the eirini-bosh-release in KubeCF (that's another story).

f0rmiga added the bug Something isn't working label Jan 22, 2020

f0rmiga mentioned this issue Jan 22, 2020

Make Eirini apps work with our DNS cloudfoundry-incubator/kubecf#284

Closed

This was referenced Jan 29, 2020

Consume RetryWatcher, or provide an option to switch default implementation cloudfoundry-incubator/eirinix#20

Closed

Bump eirini-loggregator-bridge and eirini-ssh cloudfoundry-community/eirini-bosh-release#74

Merged

mudler closed this as completed Mar 25, 2020

mudler mentioned this issue Mar 31, 2020

Consume latest eirini-loggregator-bridge cloudfoundry-incubator/kubecf#575

Closed

mudler mentioned this issue Apr 15, 2020

Bump eirini-bosh release to 0.0.27 cloudfoundry-incubator/kubecf#681

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eirini-loggregator-bridge crashing after a while #4

eirini-loggregator-bridge crashing after a while #4

f0rmiga commented Jan 22, 2020 •

edited

Loading

f0rmiga commented Jan 22, 2020

stoyanr commented Jan 23, 2020

f0rmiga commented Jan 23, 2020

jimmykarily commented Jan 24, 2020 •

edited

Loading

jimmykarily commented Jan 24, 2020

jimmykarily commented Jan 27, 2020

mudler commented Jan 28, 2020 •

edited

Loading

f0rmiga commented Jan 28, 2020

jandubois commented Jan 28, 2020

mudler commented Jan 29, 2020 •

edited

Loading

mudler commented Jan 29, 2020

mudler commented Mar 25, 2020

eirini-loggregator-bridge crashing after a while #4

eirini-loggregator-bridge crashing after a while #4

Comments

f0rmiga commented Jan 22, 2020 • edited Loading

f0rmiga commented Jan 22, 2020

stoyanr commented Jan 23, 2020

f0rmiga commented Jan 23, 2020

jimmykarily commented Jan 24, 2020 • edited Loading

jimmykarily commented Jan 24, 2020

jimmykarily commented Jan 27, 2020

mudler commented Jan 28, 2020 • edited Loading

f0rmiga commented Jan 28, 2020

jandubois commented Jan 28, 2020

mudler commented Jan 29, 2020 • edited Loading

mudler commented Jan 29, 2020

mudler commented Mar 25, 2020

f0rmiga commented Jan 22, 2020 •

edited

Loading

jimmykarily commented Jan 24, 2020 •

edited

Loading

mudler commented Jan 28, 2020 •

edited

Loading

mudler commented Jan 29, 2020 •

edited

Loading