Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Elastic Agent] Monitoring filebeat and metricbeat not connecting to Agent over GRPC #23833

Closed
blakerouse opened this issue Feb 3, 2021 · 5 comments · Fixed by #23843
Closed
Assignees
Labels
bug impacts_automation used by teams to indicate an automated test relates to the issue Team:Elastic-Agent Label for the Agent team

Comments

@blakerouse
Copy link
Contributor

Overview

Elastic Agent spawns a filebeat and metricbeat to collect logs and metrics about Elastic Agent. These are seperate from the filebeat and metricbeat that is spawned for the system integration.

Seems that the monitoring filebeat and metricbeat are not connection back to Elastic Agent. So they never receive the configuration and they also timeout because they never check-in.

I believe this is related to #23776 and the certificate work.

Logs

Below is the Elastic Agent logs that show filebeat and metricbeat are restarted because they never connect.

-8.0.0-SNAPSHOT--36643631373035623733363936343635[83c931c0-6636-11eb-b009-55417aeb58b5]: State changed to STARTING: Starting
2021-02-03T10:51:20.128-0500	INFO	log/reporter.go:40	2021-02-03T10:51:20-05:00: type: 'STATE': sub_type: 'STARTING' message: Application: filebeat--8.0.0-SNAPSHOT--36643631373035623733363936343635[83c931c0-6636-11eb-b009-55417aeb58b5]: State changed to RESTARTING: Restarting
2021-02-03T10:51:20.135-0500	ERROR	log/reporter.go:36	2021-02-03T10:51:20-05:00: type: 'ERROR': sub_type: 'FAILED' message: Application: metricbeat--8.0.0-SNAPSHOT--36643631373035623733363936343635[83c931c0-6636-11eb-b009-55417aeb58b5]: State changed to FAILED: Missed two check-ins
2021-02-03T10:51:20.135-0500	INFO	log/reporter.go:40	2021-02-03T10:51:20-05:00: type: 'STATE': sub_type: 'STARTING' message: Application: metricbeat--8.0.0-SNAPSHOT--36643631373035623733363936343635[83c931c0-6636-11eb-b009-55417aeb58b5]: State changed to STARTING: Starting
2021-02-03T10:51:20.135-0500	INFO	log/reporter.go:40	2021-02-03T10:51:20-05:00: type: 'STATE': sub_type: 'STARTING' message: Application: metricbeat--8.0.0-SNAPSHOT--36643631373035623733363936343635[83c931c0-6636-11eb-b009-55417aeb58b5]: State changed to RESTARTING: Restarting
2021-02-03T10:52:20.140-0500	INFO	log/reporter.go:40	2021-02-03T10:52:20-05:00: type: 'STATE': sub_type: 'RUNNING' message: Application: metricbeat--8.0.0-SNAPSHOT--36643631373035623733363936343635[83c931c0-6636-11eb-b009-55417aeb58b5]: State changed to DEGRADED: Missed last check-in
2021-02-03T10:52:20.140-0500	INFO	log/reporter.go:40	2021-02-03T10:52:20-05:00: type: 'STATE': sub_type: 'RUNNING' message: Application: filebeat--8.0.0-SNAPSHOT--36643631373035623733363936343635[83c931c0-6636-11eb-b009-55417aeb58b5]: State changed to DEGRADED: Missed last check-in
2021-02-03T10:53:20.145-0500	ERROR	log/reporter.go:36	2021-02-03T10:53:20-05:00: type: 'ERROR': sub_type: 'FAILED' message: Application: filebeat--8.0.0-SNAPSHOT--36643631373035623733363936343635[83c931c0-6636-11eb-b009-55417aeb58b5]: State changed to FAILED: Missed two check-ins
2021-02-03T10:53:20.146-0500	INFO	log/reporter.go:40	2021-02-03T10:53:20-05:00: type: 'STATE': sub_type: 'STARTING' message: Application: filebeat--8.0.0-SNAPSHOT--36643631373035623733363936343635[83c931c0-6636-11eb-b009-55417aeb58b5]: State changed to STARTING: Starting
2021-02-03T10:53:20.146-0500	INFO	log/reporter.go:40	2021-02-03T10:53:20-05:00: type: 'STATE': sub_type: 'STARTING' message: Application: filebeat--8.0.0-SNAPSHOT--36643631373035623733363936343635[83c931c0-6636-11eb-b009-55417aeb58b5]: State changed to RESTARTING: Restarting
2021-02-03T10:53:20.147-0500	ERROR	log/reporter.go:36	2021-02-03T10:53:20-05:00: type: 'ERROR': sub_type: 'FAILED' message: Application: metricbeat--8.0.0-SNAPSHOT--36643631373035623733363936343635[83c931c0-6636-11eb-b009-55417aeb58b5]: State changed to FAILED: Missed two check-ins
2021-02-03T10:53:20.147-0500	INFO	log/reporter.go:40	2021-02-03T10:53:20-05:00: type: 'STATE': sub_type: 'STARTING' message: Application: metricbeat--8.0.0-SNAPSHOT--36643631373035623733363936343635[83c931c0-6636-11eb-b009-55417aeb58b5]: State changed to STARTING: Starting
2021-02-03T10:53:20.147-0500	INFO	log/reporter.go:40	2021-02-03T10:53:20-05:00: type: 'STATE': sub_type: 'STARTING' message: Application: metricbeat--8.0.0-SNAPSHOT--36643631373035623733363936343635[83c931c0-6636-11eb-b009-55417aeb58b5]: State changed to RESTARTING: Restarting

Below is the netstat output. This should have 2 filebeat and 2 metricbeat, as you can see it only has 1 of each.

root@blake-ubnt-20-vm:/# netstat -natp | grep :6789
tcp        0      0 127.0.0.1:6789          0.0.0.0:*               LISTEN      26625/elastic-agent 
tcp        0      0 127.0.0.1:6789          127.0.0.1:39460         ESTABLISHED 26625/elastic-agent 
tcp        0      0 127.0.0.1:39444         127.0.0.1:6789          ESTABLISHED 26758/filebeat      
tcp        0      0 127.0.0.1:6789          127.0.0.1:39444         ESTABLISHED 26625/elastic-agent 
tcp        0      0 127.0.0.1:39460         127.0.0.1:6789          ESTABLISHED 26799/metricbeat  
  • Version: 8.0.0-SNAPSHOT
  • Operating System: Linux (don't think it matters, but reproduced it on Linux)
@blakerouse blakerouse added bug Team:Elastic-Agent Label for the Agent team labels Feb 3, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/agent (Team:Agent)

@EricDavisX EricDavisX added the impacts_automation used by teams to indicate an automated test relates to the issue label Feb 3, 2021
@EricDavisX
Copy link
Contributor

EricDavisX commented Feb 3, 2021

the failure was found both in the e2e-testing automated test as well as in the Demo/Test environment deploy, which confirmed it against Windows as well. e2e-testing job link:
https://beats-ci.elastic.co/blue/organizations/jenkins/e2e-tests%2Fe2e-testing-mbp%2Fmaster/detail/master/274/pipeline/

@mdelapenya and @michalpristas I see that the tests ran during the potentially relating PR put in for other tests passed e2e-testing, so I'm curious how the test missed it, or if it indeed isn't related to that change somehow.
relating test run: https://beats-ci.elastic.co/blue/organizations/jenkins/e2e-tests%2Fe2e-testing-mbp/detail/master/259/pipeline

@mdelapenya
Copy link
Contributor

I'm going to add here my traces about the root cause for the error we see:

The possible culprit commits are:

* 333edd8e6 - Elastic Agent - Endpoint e2e test fix (#23776) (2 days ago) <Michal Pristas>
* 1f8a2e6b5 - o365: Fix processing of ModifiedProperties field (#23777) (2 days ago) <Adrian Serrano>
* d59f78075 - [Elastic Agent] Add the ability to run the Fleet Server (#23736) (2 days ago) <Blake Rouse>
* 5cb370eaf - [Filebeat] add RFC6587 framing support (#23724) (2 days ago) <Lee Hinman>
* 4a44facd9 - [Auditbeat] Determine event.action based on diff against state (#22170) (2 days ago) <Andrew Kroh>
* a974f4f9e - Fix Zoom module config for url and basic auth (#23779) (2 days ago) <Andrew Kroh>
* 6515ad553 - [elastic-agent] fix: cpu cgroup values (#23714) (2 days ago) <Silvia Mitter>

In this table, which lists the commits in reverse order (older first), we are trying to describe what happened at the CI side of the PRs, which job triggered what, and with what result.

PR Beats CI merge build ID Packaging build ID E2E tests job Failing tests
#23714 136 ➡️ 198 ➡️ 258 🔴 3 errors, non of them related to the standalone mode tests we are seeing right now
#23779 137 🔴 ➡️ 203 ➡️ 262 🔴 7 failing tests, existing 3 + the 4 tests for the standalone mode
#22170 138 ➡️ 203 ➡️ 262 🔴 7 failing tests, existing 3 + the 4 tests for the standalone mode
#23724 139 🔴 ➡️ 203 ➡️ 262 🔴 7 failing tests, existing 3 + the 4 tests for the standalone mode
#23736 140 🔴 ➡️ 203 ➡️ 262 🔴 7 failing tests, existing 3 + the 4 tests for the standalone mode

I bisected this change set, building the elastic-agent artifacts for each commit and running the tests against the local binaries, and the results are exactly the same as on CI.

After checking that (#23779, #22170, #23724, #23736) were bundled in the same packaging job, and triggered the same E2E job, I'd say that the culprit is in that set of commits. Given the changes, I'd say that one of #23724 or #23736 are the root cause.

@michalpristas
Copy link
Contributor

I got deep into beat-agent communication and logically nothing seemed wrong feeding TLS Config in a way it has to work correctly.
after a long time i got thinking if TLS is wrong configuration will not reach beats, so i updated output to some garbage run agent and grep-ed logs for this string and it was there. so i ruled out TLS as issue.

i tried drop-ing fleet-server PR and issue dissapeared, i dont see a reason why this should be an issue code wise. but i got it running for minutes multiple times without a single missed checkin (with the PR it was manifesting consistently)

i need to go to sleep now @blakerouse if you have time in the meantime it would be great if you could take a look at it. i will try picking up where you left in the morning

@EricDavisX
Copy link
Contributor

we can test this with the next 8.0 snapshot - hopefully on Friday Feb 5 it will be available for us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug impacts_automation used by teams to indicate an automated test relates to the issue Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants