Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Deployment]: Hosted fleet server gets offline inconsistently on 8.8 Snapshot. #2460

Closed
amolnater-qasource opened this issue Mar 31, 2023 · 12 comments
Assignees
Labels
bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. Team:Fleet Label for the Fleet team

Comments

@amolnater-qasource
Copy link
Collaborator

Deployment Links:

Description:
Hosted fleet server gets offline and back to healthy inconsistently on 8.8 Snapshot.

  • Further installed agents also goes offline and healthy too.

Screenshots:
image
image
image
image
image
image

@amolnater-qasource amolnater-qasource added bug Something isn't working Team:Fleet Label for the Fleet team impact:high Short-term priority; add to current release, or definitely next. labels Mar 31, 2023
@amolnater-qasource
Copy link
Collaborator Author

@karanbirsingh-qasource Please review.

@ghost
Copy link

ghost commented Mar 31, 2023

Secondary review for this ticket is Done

@cmacknz
Copy link
Member

cmacknz commented Mar 31, 2023

I just observed this on one of my test clusters.

@jen-huang
Copy link

@michel-laterman Could you take a look?

@michel-laterman
Copy link
Contributor

Are there any logs from fleet-server available?

@amolnater-qasource
Copy link
Collaborator Author

Hi @michel-laterman

Thank you for looking into this.
Please find below attached collected Diagnostics for Hosted Fleet-Server:

Deployment 1:
elastic-agent-diagnostics-2023-04-04T08-11-11Z-00.zip

Deployment 2:
elastic-agent-diagnostics-2023-04-04T08-18-25Z-00.zip

Please let us know if anything else is required from our end.
Thanks!

@michel-laterman
Copy link
Contributor

Both diagnostics have logs like:

{"log.level":"info","@timestamp":"2023-04-04T05:28:46.693Z","message":"New policy found on update and added","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-es-containerhost","type":"fleet-server"},"log":{"source":"fleet-server-es-containerhost"},"service.name":"fleet-server","rev":2,"ctx":"policy agent monitor","fleet.policy.id":"9d4af4a0-d2a7-11ed-b058-e9b546b147b4","coord":1,"ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-04T05:32:24.887Z","message":"applying new components data","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-es-containerhost","type":"fleet-server"},"log":{"source":"fleet-server-es-containerhost"},"ecs.version":"1.6.0","fleet.access.apikey.id":"TUyjSocBPiM0FoVkwiQC","service.name":"fleet-server","http.request.id":"01GX5C11XSD892ZV0MV2TYRGWE","server.address":"","fleet.agent.id":"9dd1a52e-e27d-4b7f-b07d-beef4de0489f","req.Components":[{"id":"system/metrics-default","message":"Healthy: communicating with pid '1211'","status":"HEALTHY","type":"system/metrics","units":[{"id":"system/metrics-default-system/metrics-system-79b88a00-d2a4-11ed-b058-e9b546b147b4","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"system/metrics-default","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"log-default","message":"Healthy: communicating with pid '1219'","status":"HEALTHY","type":"log","units":[{"id":"log-default-logfile-system-79b88a00-d2a4-11ed-b058-e9b546b147b4","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"log-default","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"http/metrics-monitoring","message":"Healthy: communicating with pid '1226'","status":"HEALTHY","type":"http/metrics","units":[{"id":"http/metrics-monitoring-metrics-monitoring-agent","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"http/metrics-monitoring","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"filestream-monitoring","message":"Healthy: communicating with pid '1235'","status":"HEALTHY","type":"filestream","units":[{"id":"filestream-monitoring-filestream-monitoring-agent","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"filestream-monitoring","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"beat/metrics-monitoring","message":"Healthy: communicating with pid '1243'","status":"HEALTHY","type":"beat/metrics","units":[{"id":"beat/metrics-monitoring-metrics-monitoring-beats","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"beat/metrics-monitoring","message":"Healthy","status":"HEALTHY","type":"output"}]}],"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-04-04T05:38:46.148Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":194},"message":"Possible transient error during checkin with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://localhost:8221/ errored: Post \"https://localhost:8221/api/fleet/agents/8d1bdffd-e6d9-471e-8465-c0bd1be3459a/checkin?\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)\n\n"},"request_duration_ns":600001306356,"failed_checkins":2,"retry_after_ns":217830189836,"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-04-04T05:52:24.203Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":198},"message":"Cannot checkin in with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://localhost:8221/ errored: Post \"https://localhost:8221/api/fleet/agents/8d1bdffd-e6d9-471e-8465-c0bd1be3459a/checkin?\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)\n\n"},"request_duration_ns":600002008790,"failed_checkins":3,"retry_after_ns":391392729550,"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-04-04T06:08:55.816Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":198},"message":"Cannot checkin in with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://localhost:8221/ errored: Post \"https://localhost:8221/api/fleet/agents/8d1bdffd-e6d9-471e-8465-c0bd1be3459a/checkin?\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)\n\n"},"request_duration_ns":600000725609,"failed_checkins":4,"retry_after_ns":431338267924,"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-04T06:13:31.327Z","message":"applying new components data","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-es-containerhost","type":"fleet-server"},"log":{"source":"fleet-server-es-containerhost"},"http.request.id":"01GX5ECAHAXQCA9SZE0TY5PH8P","ecs.version":"1.6.0","service.name":"fleet-server","server.address":"","fleet.agent.id":"9dd1a52e-e27d-4b7f-b07d-beef4de0489f","fleet.access.apikey.id":"TUyjSocBPiM0FoVkwiQC","req.Components":[{"id":"system/metrics-default","message":"Healthy: communicating with pid '1211'","status":"HEALTHY","type":"system/metrics","units":[{"id":"system/metrics-default-system/metrics-system-79b88a00-d2a4-11ed-b058-e9b546b147b4","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"system/metrics-default","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"log-default","message":"Healthy: communicating with pid '1219'","status":"HEALTHY","type":"log","units":[{"id":"log-default-logfile-system-79b88a00-d2a4-11ed-b058-e9b546b147b4","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"log-default","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"http/metrics-monitoring","message":"Healthy: communicating with pid '1226'","status":"HEALTHY","type":"http/metrics","units":[{"id":"http/metrics-monitoring-metrics-monitoring-agent","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"http/metrics-monitoring","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"filestream-monitoring","message":"Healthy: communicating with pid '1235'","status":"HEALTHY","type":"filestream","units":[{"id":"filestream-monitoring-filestream-monitoring-agent","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"filestream-monitoring","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"beat/metrics-monitoring","message":"Healthy: communicating with pid '1243'","status":"HEALTHY","type":"beat/metrics","units":[{"id":"beat/metrics-monitoring","message":"Healthy","status":"HEALTHY","type":"output"},{"id":"beat/metrics-monitoring-metrics-monitoring-beats","message":"Healthy","status":"HEALTHY","type":"input"}]}],"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-04-04T06:26:07.377Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":198},"message":"Cannot checkin in with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://localhost:8221/ errored: Post \"https://localhost:8221/api/fleet/agents/8d1bdffd-e6d9-471e-8465-c0bd1be3459a/checkin?\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)\n\n"},"request_duration_ns":600000938429,"failed_checkins":5,"retry_after_ns":553705204129,"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-04-04T06:45:21.306Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":198},"message":"Cannot checkin in with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://localhost:8221/ errored: Post \"https://localhost:8221/api/fleet/agents/8d1bdffd-e6d9-471e-8465-c0bd1be3459a/checkin?\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)\n\n"},"request_duration_ns":600001252744,"failed_checkins":6,"retry_after_ns":472784364737,"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-04T06:48:54.953Z","message":"applying new components data","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-es-containerhost","type":"fleet-server"},"log":{"source":"fleet-server-es-containerhost"},"ecs.version":"1.6.0","server.address":"","fleet.agent.id":"9dd1a52e-e27d-4b7f-b07d-beef4de0489f","req.Components":[{"id":"system/metrics-default","message":"Healthy: communicating with pid '1211'","status":"HEALTHY","type":"system/metrics","units":[{"id":"system/metrics-default-system/metrics-system-79b88a00-d2a4-11ed-b058-e9b546b147b4","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"system/metrics-default","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"log-default","message":"Healthy: communicating with pid '1219'","status":"HEALTHY","type":"log","units":[{"id":"log-default-logfile-system-79b88a00-d2a4-11ed-b058-e9b546b147b4","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"log-default","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"http/metrics-monitoring","message":"Healthy: communicating with pid '1226'","status":"HEALTHY","type":"http/metrics","units":[{"id":"http/metrics-monitoring-metrics-monitoring-agent","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"http/metrics-monitoring","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"filestream-monitoring","message":"Healthy: communicating with pid '1235'","status":"HEALTHY","type":"filestream","units":[{"id":"filestream-monitoring-filestream-monitoring-agent","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"filestream-monitoring","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"beat/metrics-monitoring","message":"Healthy: communicating with pid '1243'","status":"HEALTHY","type":"beat/metrics","units":[{"id":"beat/metrics-monitoring-metrics-monitoring-beats","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"beat/metrics-monitoring","message":"Healthy","status":"HEALTHY","type":"output"}]}],"service.name":"fleet-server","http.request.id":"01GX5GD4CGJGZZA541Y8SPA45Y","fleet.access.apikey.id":"TUyjSocBPiM0FoVkwiQC","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-04-04T07:03:14.314Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":198},"message":"Cannot checkin in with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://localhost:8221/ errored: Post \"https://localhost:8221/api/fleet/agents/8d1bdffd-e6d9-471e-8465-c0bd1be3459a/checkin?\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)\n\n"},"request_duration_ns":600001327778,"failed_checkins":7,"retry_after_ns":369868042188,"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-04T07:09:24.563Z","message":"applying new components data","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-es-containerhost","type":"fleet-server"},"log":{"source":"fleet-server-es-containerhost"},"server.address":"","fleet.agent.id":"8d1bdffd-e6d9-471e-8465-c0bd1be3459a","ecs.version":"1.6.0","service.name":"fleet-server","http.request.id":"01GX5HJNJ3JQSMVZYHN8EXWJHE","fleet.access.apikey.id":"O1S7SocBlXtIZ4N9fxEa","req.Components":[{"id":"fleet-server-es-containerhost","message":"Healthy: communicating with pid '250'","status":"HEALTHY","type":"fleet-server","units":[{"id":"fleet-server-es-containerhost-fleet-server-fleet_server-elastic-cloud-fleet-server","message":"Running on policy with Fleet Server integration: policy-elastic-agent-on-cloud","status":"HEALTHY","type":"input"},{"id":"fleet-server-es-containerhost","message":"Running on policy with Fleet Server integration: policy-elastic-agent-on-cloud","status":"HEALTHY","type":"output"}]},{"id":"apm-es-containerhost","message":"Healthy: communicating with pid '265'","status":"HEALTHY","type":"apm","units":[{"id":"apm-es-containerhost","message":"Healthy","status":"HEALTHY","type":"output"},{"id":"apm-es-containerhost-elastic-cloud-apm","message":"Healthy","status":"HEALTHY","type":"input"}]}],"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-04-04T07:19:24.420Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":198},"message":"Cannot checkin in with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://localhost:8221/ errored: Post \"https://localhost:8221/api/fleet/agents/8d1bdffd-e6d9-471e-8465-c0bd1be3459a/checkin?\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)\n\n"},"request_duration_ns":600001188498,"failed_checkins":8,"retry_after_ns":550974383773,"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-04T07:28:35.886Z","message":"applying new components data","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-es-containerhost","type":"fleet-server"},"log":{"source":"fleet-server-es-containerhost"},"ecs.version":"1.6.0","http.request.id":"01GX5JNSS0WQSK0WT2G42QJ0EZ","fleet.access.apikey.id":"O1S7SocBlXtIZ4N9fxEa","service.name":"fleet-server","server.address":"","fleet.agent.id":"8d1bdffd-e6d9-471e-8465-c0bd1be3459a","req.Components":[{"id":"fleet-server-es-containerhost","message":"Healthy: communicating with pid '250'","status":"HEALTHY","type":"fleet-server","units":[{"id":"fleet-server-es-containerhost-fleet-server-fleet_server-elastic-cloud-fleet-server","message":"Running on policy with Fleet Server integration: policy-elastic-agent-on-cloud","status":"HEALTHY","type":"input"},{"id":"fleet-server-es-containerhost","message":"Running on policy with Fleet Server integration: policy-elastic-agent-on-cloud","status":"HEALTHY","type":"output"}]},{"id":"apm-es-containerhost","message":"Healthy: communicating with pid '265'","status":"HEALTHY","type":"apm","units":[{"id":"apm-es-containerhost-elastic-cloud-apm","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"apm-es-containerhost","message":"Healthy","status":"HEALTHY","type":"output"}]}],"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-04-04T07:38:35.616Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":198},"message":"Cannot checkin in with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://localhost:8221/ errored: Post \"https://localhost:8221/api/fleet/agents/8d1bdffd-e6d9-471e-8465-c0bd1be3459a/checkin?\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)\n\n"},"request_duration_ns":600000385060,"failed_checkins":9,"retry_after_ns":838374796602,"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-04T07:45:48.278Z","message":"applying new components data","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-es-containerhost","type":"fleet-server"},"log":{"source":"fleet-server-es-containerhost"},"http.request.id":"01GX5KN9TF3JP7F20SKTKXPXC4","server.address":"","fleet.agent.id":"9dd1a52e-e27d-4b7f-b07d-beef4de0489f","fleet.access.apikey.id":"TUyjSocBPiM0FoVkwiQC","ecs.version":"1.6.0","service.name":"fleet-server","req.Components":[{"id":"system/metrics-default","message":"Healthy: communicating with pid '1211'","status":"HEALTHY","type":"system/metrics","units":[{"id":"system/metrics-default-system/metrics-system-79b88a00-d2a4-11ed-b058-e9b546b147b4","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"system/metrics-default","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"log-default","message":"Healthy: communicating with pid '1219'","status":"HEALTHY","type":"log","units":[{"id":"log-default-logfile-system-79b88a00-d2a4-11ed-b058-e9b546b147b4","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"log-default","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"http/metrics-monitoring","message":"Healthy: communicating with pid '1226'","status":"HEALTHY","type":"http/metrics","units":[{"id":"http/metrics-monitoring-metrics-monitoring-agent","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"http/metrics-monitoring","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"filestream-monitoring","message":"Healthy: communicating with pid '1235'","status":"HEALTHY","type":"filestream","units":[{"id":"filestream-monitoring","message":"Healthy","status":"HEALTHY","type":"output"},{"id":"filestream-monitoring-filestream-monitoring-agent","message":"Healthy","status":"HEALTHY","type":"input"}]},{"id":"beat/metrics-monitoring","message":"Healthy: communicating with pid '1243'","status":"HEALTHY","type":"beat/metrics","units":[{"id":"beat/metrics-monitoring-metrics-monitoring-beats","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"beat/metrics-monitoring","message":"Healthy","status":"HEALTHY","type":"output"}]}],"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-04-04T08:02:34.215Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":198},"message":"Cannot checkin in with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://localhost:8221/ errored: Post \"https://localhost:8221/api/fleet/agents/8d1bdffd-e6d9-471e-8465-c0bd1be3459a/checkin?\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)\n\n"},"request_duration_ns":600000881470,"failed_checkins":10,"retry_after_ns":300755394517,"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-04T08:07:35.462Z","message":"applying new components data","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-es-containerhost","type":"fleet-server"},"log":{"source":"fleet-server-es-containerhost"},"fleet.agent.id":"8d1bdffd-e6d9-471e-8465-c0bd1be3459a","http.request.id":"01GX5MX6GT5693RQQA2V0TSXWW","ecs.version":"1.6.0","service.name":"fleet-server","server.address":"","fleet.access.apikey.id":"O1S7SocBlXtIZ4N9fxEa","req.Components":[{"id":"fleet-server-es-containerhost","message":"Healthy: communicating with pid '250'","status":"HEALTHY","type":"fleet-server","units":[{"id":"fleet-server-es-containerhost-fleet-server-fleet_server-elastic-cloud-fleet-server","message":"Running on policy with Fleet Server integration: policy-elastic-agent-on-cloud","status":"HEALTHY","type":"input"},{"id":"fleet-server-es-containerhost","message":"Running on policy with Fleet Server integration: policy-elastic-agent-on-cloud","status":"HEALTHY","type":"output"}]},{"id":"apm-es-containerhost","message":"Healthy: communicating with pid '265'","status":"HEALTHY","type":"apm","units":[{"id":"apm-es-containerhost","message":"Healthy","status":"HEALTHY","type":"output"},{"id":"apm-es-containerhost-elastic-cloud-apm","message":"Healthy","status":"HEALTHY","type":"input"}]}],"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-04T08:08:52.621Z","message":"applying new components data","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-es-containerhost","type":"fleet-server"},"log":{"source":"fleet-server-es-containerhost"},"server.address":"","fleet.agent.id":"9dd1a52e-e27d-4b7f-b07d-beef4de0489f","req.Components":[{"id":"system/metrics-default","message":"Healthy: communicating with pid '1211'","status":"HEALTHY","type":"system/metrics","units":[{"id":"system/metrics-default-system/metrics-system-79b88a00-d2a4-11ed-b058-e9b546b147b4","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"system/metrics-default","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"log-default","message":"Healthy: communicating with pid '1219'","status":"HEALTHY","type":"log","units":[{"id":"log-default-logfile-system-79b88a00-d2a4-11ed-b058-e9b546b147b4","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"log-default","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"http/metrics-monitoring","message":"Healthy: communicating with pid '1226'","status":"HEALTHY","type":"http/metrics","units":[{"id":"http/metrics-monitoring","message":"Healthy","status":"HEALTHY","type":"output"},{"id":"http/metrics-monitoring-metrics-monitoring-agent","message":"Healthy","status":"HEALTHY","type":"input"}]},{"id":"filestream-monitoring","message":"Healthy: communicating with pid '1235'","status":"HEALTHY","type":"filestream","units":[{"id":"filestream-monitoring-filestream-monitoring-agent","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"filestream-monitoring","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"beat/metrics-monitoring","message":"Healthy: communicating with pid '1243'","status":"HEALTHY","type":"beat/metrics","units":[{"id":"beat/metrics-monitoring","message":"Healthy","status":"HEALTHY","type":"output"},{"id":"beat/metrics-monitoring-metrics-monitoring-beats","message":"Healthy","status":"HEALTHY","type":"input"}]}],"ecs.version":"1.6.0","http.request.id":"01GX5MZHM9AZNQG219MWWPA8XN","service.name":"fleet-server","fleet.access.apikey.id":"TUyjSocBPiM0FoVkwiQC","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-04T08:11:11.161Z","message":"Action delivered to agent on checkin","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-es-containerhost","type":"fleet-server"},"log":{"source":"fleet-server-es-containerhost"},"server.address":"","type":"REQUEST_DIAGNOSTICS","ecs.version":"1.6.0","http.request.id":"01GX5MX6GT5693RQQA2V0TSXWW","fleet.access.apikey.id":"O1S7SocBlXtIZ4N9fxEa","createdAt":"2023-04-04T08:11:10.024Z","timeout":0,"service.name":"fleet-server","fleet.agent.id":"8d1bdffd-e6d9-471e-8465-c0bd1be3459a","ackToken":"3ea0c35c-31ee-4774-8393-3c46acd4fbb2","id":"6277383d-812a-427c-86ad-deacb187eaa2","inputType":"","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-04-04T08:11:11.173Z","log.origin":

and near the end they have:

{"log.level":"error","@timestamp":"2023-04-04T08:11:11.173Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":221},"message":"Checkin request to fleet-server succeeded after 10 failures","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}

The "success" here is after the diagnostics action was dispatched to the agents, so the checkin would immediately return something short circuiting the long-poll duration.

I think what's happening is that we are getting the header timeout error (Client.Timeout exceeded while awaiting headers)
because we've increased our long-poll durations in fleet-server for 8.8 #2337 while the agent's has not been increased to match yet elastic/elastic-agent#2257

There is an issue to revert the poll change for fleet-server's 8.8 release #2387 once the branch is made

@michel-laterman
Copy link
Contributor

I've reverted the timeout changes on fleet-server; the snapshots built with this change should not have this issue anymore.

@cmacknz
Copy link
Member

cmacknz commented Apr 4, 2023

Just to confirm, the problem causing the Fleet server to go offline here is because each Header timeout error causes the agent to backoff with increasingly large durations making it appear offline?

The last error before it succeeds has a retry_after of 838374796602n s / 1E9 / 60.0 = 13.97 minutes.

@michel-laterman
Copy link
Contributor

The agent long-poll timeout is set to 10m on this build (https://github.com/elastic/elastic-agent/blob/main/internal/pkg/remote/config.go#L49), the fleet-server was set to 30m, so when the agent that oversees fleet-server tried to checkin, the request times out.

The eventual checkin success we see is caused by the diagnostics action being detected and returned

@jlind23
Copy link
Contributor

jlind23 commented Apr 5, 2023

Closing this as fixed thanks to #2471

@jlind23 jlind23 closed this as completed Apr 5, 2023
@amolnater-qasource amolnater-qasource added the QA:Ready For Testing Code is merged and ready for QA to validate label Apr 5, 2023
@amolnater-qasource amolnater-qasource removed the QA:Ready For Testing Code is merged and ready for QA to validate label Jul 25, 2023
@harshitgupta-qasource
Copy link

Bug Conversion:

We have updated 01 testcase for this scenario in our fleet test suite at:

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. Team:Fleet Label for the Fleet team
Projects
None yet
Development

No branches or pull requests

6 participants