Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump eirini-loggregator-bridge and eirini-ssh #74

Merged
merged 2 commits into from
Mar 24, 2020
Merged

Conversation

mudler
Copy link
Collaborator

@mudler mudler commented Feb 4, 2020

They consume the new version of EiriniX which includes the RetryWatcher implementation instead of plain kubernetes watchers.

See issue cloudfoundry-incubator/eirini-loggregator-bridge#4 and EiriniX PR: cloudfoundry-incubator/eirinix#21

@mudler
Copy link
Collaborator Author

mudler commented Mar 24, 2020

I've tested the bumps locally (note, the 0.99 version is refering to this one 😃 ) which includes the component bumped to latest:

  eirini-loggregator-bridge-eirini-loggregator-bridge:                                                                                                                                            Container ID:  containerd://a750b8d888f79a05022d2c48163107a595602934b1617d004ccb00684fa4a414           
    Image:         docker.io/cfcontainerization/eirini:SLE_15_SP1-15.1-7.0.0_374.gb8e8e6af-0.99                                                                                              
    Image ID:      sha256:cee3030ee2c86798501e4fac5bae0c4a3b3d0cfc2027a6aa34b98ff6b041cdf1   

There seems to be another issue, that I think is not related to the EiriniX bump ( which includes the RetryWatcher mechanism ) but probably related to the kernel watcher limits. I've pushed the sample-ticking app, and scaled to 10 instances:

   2020-03-24T12:27:06.00+0000 [APP/PROC/WEB/6] OUT [0e7a5668-2a3c-4650-ae06-737b6d42a877:6] Ticking 2020-03-24 12:27:06                                                                 [0/0]
   2020-03-24T12:27:07.00+0000 [APP/PROC/WEB/4] OUT [267fd49a-a829-448f-b5fd-00387263e940:4] Ticking 2020-03-24 12:27:07                                                                     
   2020-03-24T12:27:07.00+0000 [APP/PROC/WEB/5] OUT [bca85f4d-a745-4539-a13f-297136c703fe:5] Ticking 2020-03-24 12:27:07                                                                     
   2020-03-24T12:27:08.00+0000 [APP/PROC/WEB/0] OUT [695cfbda-e685-415d-b304-8a43f0877904:0] Ticking 2020-03-24 12:27:08                                                                     
   2020-03-24T12:27:08.00+0000 [APP/PROC/WEB/6] OUT [0e7a5668-2a3c-4650-ae06-737b6d42a877:6] Ticking 2020-03-24 12:27:08                                                                     
   2020-03-24T12:27:09.00+0000 [APP/PROC/WEB/5] OUT [bca85f4d-a745-4539-a13f-297136c703fe:5] Ticking 2020-03-24 12:27:09                                                                     
   2020-03-24T12:27:09.00+0000 [APP/PROC/WEB/4] OUT [267fd49a-a829-448f-b5fd-00387263e940:4] Ticking 2020-03-24 12:27:09                                                                     
   2020-03-24T12:27:10.00+0000 [APP/PROC/WEB/0] OUT [695cfbda-e685-415d-b304-8a43f0877904:0] Ticking 2020-03-24 12:27:10                                                                     
   2020-03-24T12:27:10.00+0000 [APP/PROC/WEB/6] OUT [0e7a5668-2a3c-4650-ae06-737b6d42a877:6] Ticking 2020-03-24 12:27:10                                                                     
   2020-03-24T12:27:11.00+0000 [APP/PROC/WEB/4] OUT [267fd49a-a829-448f-b5fd-00387263e940:4] Ticking 2020-03-24 12:27:11                                                                     
   2020-03-24T12:27:11.00+0000 [APP/PROC/WEB/5] OUT [bca85f4d-a745-4539-a13f-297136c703fe:5] Ticking 2020-03-24 12:27:11                                                                     
   2020-03-24T12:27:12.00+0000 [APP/PROC/WEB/6] OUT [0e7a5668-2a3c-4650-ae06-737b6d42a877:6] Ticking 2020-03-24 12:27:12                                                                     
   2020-03-24T12:27:12.00+0000 [APP/PROC/WEB/0] OUT [695cfbda-e685-415d-b304-8a43f0877904:0] Ticking 2020-03-24 12:27:12                                                                     
   2020-03-24T12:27:13.00+0000 [APP/PROC/WEB/5] OUT [bca85f4d-a745-4539-a13f-297136c703fe:5] Ticking 2020-03-24 12:27:13                                                                     
   2020-03-24T12:27:13.00+0000 [APP/PROC/WEB/4] OUT [267fd49a-a829-448f-b5fd-00387263e940:4] Ticking 2020-03-24 12:27:13                                                                     
   2020-03-24T12:27:14.00+0000 [APP/PROC/WEB/0] OUT [695cfbda-e685-415d-b304-8a43f0877904:0] Ticking 2020-03-24 12:27:14                                                                     
   2020-03-24T12:27:14.00+0000 [APP/PROC/WEB/6] OUT [0e7a5668-2a3c-4650-ae06-737b6d42a877:6] Ticking 2020-03-24 12:27:14                                                                     
   2020-03-24T12:27:15.00+0000 [APP/PROC/WEB/5] OUT [bca85f4d-a745-4539-a13f-297136c703fe:5] Ticking 2020-03-24 12:27:15                                                                     
   2020-03-24T12:27:15.00+0000 [APP/PROC/WEB/4] OUT [267fd49a-a829-448f-b5fd-00387263e940:4] Ticking 2020-03-24 12:27:15                                                                     
   2020-03-24T12:27:16.00+0000 [APP/PROC/WEB/0] OUT [695cfbda-e685-415d-b304-8a43f0877904:0] Ticking 2020-03-24 12:27:16                                                                     
   2020-03-24T12:27:16.00+0000 [APP/PROC/WEB/6] OUT [0e7a5668-2a3c-4650-ae06-737b6d42a877:6] Ticking 2020-03-24 12:27:16                                                                     
   2020-03-24T12:27:17.00+0000 [APP/PROC/WEB/4] OUT [267fd49a-a829-448f-b5fd-00387263e940:4] Ticking 2020-03-24 12:27:17                                                                     
   2020-03-24T12:27:17.00+0000 [APP/PROC/WEB/5] OUT [bca85f4d-a745-4539-a13f-297136c703fe:5] Ticking 2020-03-24 12:27:17                                                                     
   2020-03-24T12:27:17.00+0000 [APP/PROC/WEB/3] OUT [ab874e81-367a-4645-a87b-5618de5a9ca5:3] Ticking 2020-03-24 12:27:17                                                                     
   2020-03-24T12:27:18.00+0000 [APP/PROC/WEB/0] OUT [695cfbda-e685-415d-b304-8a43f0877904:0] Ticking 2020-03-24 12:27:18                                                                     
   2020-03-24T12:27:18.00+0000 [APP/PROC/WEB/6] OUT [0e7a5668-2a3c-4650-ae06-737b6d42a877:6] Ticking 2020-03-24 12:27:18                                                                     
   2020-03-24T12:27:19.00+0000 [APP/PROC/WEB/4] OUT [267fd49a-a829-448f-b5fd-00387263e940:4] Ticking 2020-03-24 12:27:19                                                                     
   2020-03-24T12:27:19.00+0000 [APP/PROC/WEB/5] OUT [bca85f4d-a745-4539-a13f-297136c703fe:5] Ticking 2020-03-24 12:27:19                                                                     
   2020-03-24T12:27:20.00+0000 [APP/PROC/WEB/0] OUT [695cfbda-e685-415d-b304-8a43f0877904:0] Ticking 2020-03-24 12:27:20                                                                     
   2020-03-24T12:27:20.00+0000 [APP/PROC/WEB/6] OUT [0e7a5668-2a3c-4650-ae06-737b6d42a877:6] Ticking 2020-03-24 12:27:20                                                                     
   2020-03-24T12:27:21.00+0000 [APP/PROC/WEB/4] OUT [267fd49a-a829-448f-b5fd-00387263e940:4] Ticking 2020-03-24 12:27:21                                                                     
   2020-03-24T12:27:21.00+0000 [APP/PROC/WEB/5] OUT [bca85f4d-a745-4539-a13f-297136c703fe:5] Ticking 2020-03-24 12:27:21                                                                     
   2020-03-24T12:27:22.00+0000 [APP/PROC/WEB/0] OUT [695cfbda-e685-415d-b304-8a43f0877904:0] Ticking 2020-03-24 12:27:22
   2020-03-24T12:27:22.00+0000 [APP/PROC/WEB/6] OUT [0e7a5668-2a3c-4650-ae06-737b6d42a877:6] Ticking 2020-03-24 12:27:22
   2020-03-24T12:27:23.00+0000 [APP/PROC/WEB/4] OUT [267fd49a-a829-448f-b5fd-00387263e940:4] Ticking 2020-03-24 12:27:23
   2020-03-24T12:27:23.00+0000 [APP/PROC/WEB/5] OUT [bca85f4d-a745-4539-a13f-297136c703fe:5] Ticking 2020-03-24 12:27:23
   2020-03-24T12:27:24.00+0000 [APP/PROC/WEB/0] OUT [695cfbda-e685-415d-b304-8a43f0877904:0] Ticking 2020-03-24 12:27:24
   2020-03-24T12:27:24.00+0000 [APP/PROC/WEB/6] OUT [0e7a5668-2a3c-4650-ae06-737b6d42a877:6] Ticking 2020-03-24 12:27:24
   2020-03-24T12:27:25.00+0000 [APP/PROC/WEB/5] OUT [bca85f4d-a745-4539-a13f-297136c703fe:5] Ticking 2020-03-24 12:27:25
   2020-03-24T12:27:25.00+0000 [APP/PROC/WEB/4] OUT [267fd49a-a829-448f-b5fd-00387263e940:4] Ticking 2020-03-24 12:27:25
   2020-03-24T12:27:25.00+0000 [APP/PROC/WEB/3] OUT [ab874e81-367a-4645-a87b-5618de5a9ca5:3] Ticking 2020-03-24 12:27:25
   2020-03-24T12:27:26.00+0000 [APP/PROC/WEB/6] OUT [0e7a5668-2a3c-4650-ae06-737b6d42a877:6] Ticking 2020-03-24 12:27:26
   2020-03-24T12:27:26.00+0000 [APP/PROC/WEB/0] OUT [695cfbda-e685-415d-b304-8a43f0877904:0] Ticking 2020-03-24 12:27:26

As you can see, the instance 3 got less messages in, but at the same time, each second we get at maximum 2/3 batches of messages, even if we should see 10 instances.

The eirini pod got no restarts so far:

susecf-scf-eirini-0                    9/9     Running   0          55m

So it seems we ignore watcher timeout errors, but we don't dispatch all the messages. To support this:

 2020-03-24T12:31:45.00+0000 [APP/PROC/WEB/4] OUT [267fd49a-a829-448f-b5fd-00387263e940:4] Ticking 2020-03-24 12:31:45                                                                     
   2020-03-24T12:31:45.00+0000 [APP/PROC/WEB/5] OUT [bca85f4d-a745-4539-a13f-297136c703fe:5] Ticking 2020-03-24 12:31:45                                                                     
   2020-03-24T12:31:45.00+0000 [APP/PROC/WEB/3] OUT [ab874e81-367a-4645-a87b-5618de5a9ca5:3] Ticking 2020-03-24 12:31:45                                                                     
   2020-03-24T12:31:45.00+0000 [APP/PROC/WEB/1] OUT [dfc607c5-88fe-4305-a3dc-9dc6bc241613:1] Ticking 2020-03-24 12:31:45                                                                     
   2020-03-24T12:31:45.00+0000 [APP/PROC/WEB/8] OUT [57d26343-354b-444b-a748-d55997a4683c:8] Ticking 2020-03-24 12:31:45                                                                     
   2020-03-24T12:31:46.00+0000 [APP/PROC/WEB/0] OUT [695cfbda-e685-415d-b304-8a43f0877904:0] Ticking 2020-03-24 12:31:46                                                                     
   2020-03-24T12:31:46.00+0000 [APP/PROC/WEB/6] OUT [0e7a5668-2a3c-4650-ae06-737b6d42a877:6] Ticking 2020-03-24 12:31:46                                                                     
   2020-03-24T12:31:47.00+0000 [APP/PROC/WEB/5] OUT [bca85f4d-a745-4539-a13f-297136c703fe:5] Ticking 2020-03-24 12:31:47                                                                     
   2020-03-24T12:31:47.00+0000 [APP/PROC/WEB/4] OUT [267fd49a-a829-448f-b5fd-00387263e940:4] Ticking 2020-03-24 12:31:47                                                                     
   2020-03-24T12:31:48.00+0000 [APP/PROC/WEB/0] OUT [695cfbda-e685-415d-b304-8a43f0877904:0] Ticking 2020-03-24 12:31:48                                                                     
   2020-03-24T12:31:48.00+0000 [APP/PROC/WEB/6] OUT [0e7a5668-2a3c-4650-ae06-737b6d42a877:6] Ticking 2020-03-24 12:31:48                                                                     
   2020-03-24T12:31:49.00+0000 [APP/PROC/WEB/4] OUT [267fd49a-a829-448f-b5fd-00387263e940:4] Ticking 2020-03-24 12:31:49                                                                     
   2020-03-24T12:31:49.00+0000 [APP/PROC/WEB/5] OUT [bca85f4d-a745-4539-a13f-297136c703fe:5] Ticking 2020-03-24 12:31:49                                                                     
   2020-03-24T12:31:50.00+0000 [APP/PROC/WEB/0] OUT [695cfbda-e685-415d-b304-8a43f0877904:0] Ticking 2020-03-24 12:31:50                                                                     
   2020-03-24T12:31:50.00+0000 [APP/PROC/WEB/6] OUT [0e7a5668-2a3c-4650-ae06-737b6d42a877:6] Ticking 2020-03-24 12:31:50                                                                     
   2020-03-24T12:31:51.00+0000 [APP/PROC/WEB/4] OUT [267fd49a-a829-448f-b5fd-00387263e940:4] Ticking 2020-03-24 12:31:51                                                                     
   2020-03-24T12:31:51.00+0000 [APP/PROC/WEB/5] OUT [bca85f4d-a745-4539-a13f-297136c703fe:5] Ticking 2020-03-24 12:31:51                                                                     
   2020-03-24T12:31:52.00+0000 [APP/PROC/WEB/0] OUT [695cfbda-e685-415d-b304-8a43f0877904:0] Ticking 2020-03-24 12:31:52      

As you can see, now we got messages from instance 8, but only once. It looks they are racing for slots

@mudler
Copy link
Collaborator Author

mudler commented Mar 24, 2020

I've tried lowering the watcher channels,

sudo sysctl fs.inotify.max_user_instances=20
sudo sysctl fs.inotify.max_user_watches=20

in the loggregator bridge I can observe:

failed to create fsnotify watcher: too many open files

I can see this message only when I lower the limit, as far as I can tell that's not the root cause.

Now we the bridge doesn't crash anymore, and I can see no pod restarts:

susecf-scf-eirini-0                    9/9     Running   0          71m

But as before we were restarting and re-streaming all the logs, now we probably miss the logs when the watcher timeouts. I can't see why this happens as I would have expected messages to be delayed, but not just "vanished"

@mudler
Copy link
Collaborator Author

mudler commented Mar 24, 2020

Extracted the issue found in a separate one in the eirini-loggregator-bridge project: cloudfoundry-incubator/eirini-loggregator-bridge#6

@mudler mudler requested a review from jimmykarily March 24, 2020 13:51
@mudler
Copy link
Collaborator Author

mudler commented Mar 24, 2020

I would merge this as-is now. No reason to delay it as it doesn't introduce a new issue but fixes the pod restarts, the issue found were already known and extracted in another one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants