[Deployment]: Hosted fleet server gets offline inconsistently on 8.8 Snapshot. #2460

amolnater-qasource · 2023-03-31T11:31:12Z

Deployment Links:

Description:
Hosted fleet server gets offline and back to healthy inconsistently on 8.8 Snapshot.

Further installed agents also goes offline and healthy too.

Screenshots:

amolnater-qasource · 2023-03-31T11:31:24Z

@karanbirsingh-qasource Please review.

ghost · 2023-03-31T11:31:56Z

Secondary review for this ticket is Done

cmacknz · 2023-03-31T15:28:41Z

I just observed this on one of my test clusters.

jen-huang · 2023-03-31T16:53:40Z

@michel-laterman Could you take a look?

michel-laterman · 2023-04-03T22:45:15Z

Are there any logs from fleet-server available?

amolnater-qasource · 2023-04-04T08:28:44Z

Hi @michel-laterman

Thank you for looking into this.
Please find below attached collected Diagnostics for Hosted Fleet-Server:

Deployment 1:
elastic-agent-diagnostics-2023-04-04T08-11-11Z-00.zip

Deployment 2:
elastic-agent-diagnostics-2023-04-04T08-18-25Z-00.zip

Please let us know if anything else is required from our end.
Thanks!

michel-laterman · 2023-04-04T17:01:10Z

Both diagnostics have logs like:

{"log.level":"info","@timestamp":"2023-04-04T05:28:46.693Z","message":"New policy found on update and added","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-es-containerhost","type":"fleet-server"},"log":{"source":"fleet-server-es-containerhost"},"service.name":"fleet-server","rev":2,"ctx":"policy agent monitor","fleet.policy.id":"9d4af4a0-d2a7-11ed-b058-e9b546b147b4","coord":1,"ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-04T05:32:24.887Z","message":"applying new components data","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-es-containerhost","type":"fleet-server"},"log":{"source":"fleet-server-es-containerhost"},"ecs.version":"1.6.0","fleet.access.apikey.id":"TUyjSocBPiM0FoVkwiQC","service.name":"fleet-server","http.request.id":"01GX5C11XSD892ZV0MV2TYRGWE","server.address":"","fleet.agent.id":"9dd1a52e-e27d-4b7f-b07d-beef4de0489f","req.Components":[{"id":"system/metrics-default","message":"Healthy: communicating with pid '1211'","status":"HEALTHY","type":"system/metrics","units":[{"id":"system/metrics-default-system/metrics-system-79b88a00-d2a4-11ed-b058-e9b546b147b4","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"system/metrics-default","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"log-default","message":"Healthy: communicating with pid '1219'","status":"HEALTHY","type":"log","units":[{"id":"log-default-logfile-system-79b88a00-d2a4-11ed-b058-e9b546b147b4","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"log-default","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"http/metrics-monitoring","message":"Healthy: communicating with pid '1226'","status":"HEALTHY","type":"http/metrics","units":[{"id":"http/metrics-monitoring-metrics-monitoring-agent","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"http/metrics-monitoring","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"filestream-monitoring","message":"Healthy: communicating with pid '1235'","status":"HEALTHY","type":"filestream","units":[{"id":"filestream-monitoring-filestream-monitoring-agent","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"filestream-monitoring","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"beat/metrics-monitoring","message":"Healthy: communicating with pid '1243'","status":"HEALTHY","type":"beat/metrics","units":[{"id":"beat/metrics-monitoring-metrics-monitoring-beats","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"beat/metrics-monitoring","message":"Healthy","status":"HEALTHY","type":"output"}]}],"ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2023-04-04T05:38:46.148Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":194},"message":"Possible transient error during checkin with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://localhost:8221/ errored: Post \"https://localhost:8221/api/fleet/agents/8d1bdffd-e6d9-471e-8465-c0bd1be3459a/checkin?\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)\n\n"},"request_duration_ns":600001306356,"failed_checkins":2,"retry_after_ns":217830189836,"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-04-04T05:52:24.203Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":198},"message":"Cannot checkin in with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://localhost:8221/ errored: Post \"https://localhost:8221/api/fleet/agents/8d1bdffd-e6d9-471e-8465-c0bd1be3459a/checkin?\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)\n\n"},"request_duration_ns":600002008790,"failed_checkins":3,"retry_after_ns":391392729550,"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-04-04T06:08:55.816Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":198},"message":"Cannot checkin in with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://localhost:8221/ errored: Post \"https://localhost:8221/api/fleet/agents/8d1bdffd-e6d9-471e-8465-c0bd1be3459a/checkin?\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)\n\n"},"request_duration_ns":600000725609,"failed_checkins":4,"retry_after_ns":431338267924,"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-04T06:13:31.327Z","message":"applying new components data","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-es-containerhost","type":"fleet-server"},"log":{"source":"fleet-server-es-containerhost"},"http.request.id":"01GX5ECAHAXQCA9SZE0TY5PH8P","ecs.version":"1.6.0","service.name":"fleet-server","server.address":"","fleet.agent.id":"9dd1a52e-e27d-4b7f-b07d-beef4de0489f","fleet.access.apikey.id":"TUyjSocBPiM0FoVkwiQC","req.Components":[{"id":"system/metrics-default","message":"Healthy: communicating with pid '1211'","status":"HEALTHY","type":"system/metrics","units":[{"id":"system/metrics-default-system/metrics-system-79b88a00-d2a4-11ed-b058-e9b546b147b4","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"system/metrics-default","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"log-default","message":"Healthy: communicating with pid '1219'","status":"HEALTHY","type":"log","units":[{"id":"log-default-logfile-system-79b88a00-d2a4-11ed-b058-e9b546b147b4","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"log-default","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"http/metrics-monitoring","message":"Healthy: communicating with pid '1226'","status":"HEALTHY","type":"http/metrics","units":[{"id":"http/metrics-monitoring-metrics-monitoring-agent","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"http/metrics-monitoring","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"filestream-monitoring","message":"Healthy: communicating with pid '1235'","status":"HEALTHY","type":"filestream","units":[{"id":"filestream-monitoring-filestream-monitoring-agent","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"filestream-monitoring","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"beat/metrics-monitoring","message":"Healthy: communicating with pid '1243'","status":"HEALTHY","type":"beat/metrics","units":[{"id":"beat/metrics-monitoring","message":"Healthy","status":"HEALTHY","type":"output"},{"id":"beat/metrics-monitoring-metrics-monitoring-beats","message":"Healthy","status":"HEALTHY","type":"input"}]}],"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-04-04T06:26:07.377Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":198},"message":"Cannot checkin in with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://localhost:8221/ errored: Post \"https://localhost:8221/api/fleet/agents/8d1bdffd-e6d9-471e-8465-c0bd1be3459a/checkin?\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)\n\n"},"request_duration_ns":600000938429,"failed_checkins":5,"retry_after_ns":553705204129,"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-04-04T06:45:21.306Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":198},"message":"Cannot checkin in with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://localhost:8221/ errored: Post \"https://localhost:8221/api/fleet/agents/8d1bdffd-e6d9-471e-8465-c0bd1be3459a/checkin?\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)\n\n"},"request_duration_ns":600001252744,"failed_checkins":6,"retry_after_ns":472784364737,"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-04T06:48:54.953Z","message":"applying new components data","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-es-containerhost","type":"fleet-server"},"log":{"source":"fleet-server-es-containerhost"},"ecs.version":"1.6.0","server.address":"","fleet.agent.id":"9dd1a52e-e27d-4b7f-b07d-beef4de0489f","req.Components":[{"id":"system/metrics-default","message":"Healthy: communicating with pid '1211'","status":"HEALTHY","type":"system/metrics","units":[{"id":"system/metrics-default-system/metrics-system-79b88a00-d2a4-11ed-b058-e9b546b147b4","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"system/metrics-default","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"log-default","message":"Healthy: communicating with pid '1219'","status":"HEALTHY","type":"log","units":[{"id":"log-default-logfile-system-79b88a00-d2a4-11ed-b058-e9b546b147b4","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"log-default","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"http/metrics-monitoring","message":"Healthy: communicating with pid '1226'","status":"HEALTHY","type":"http/metrics","units":[{"id":"http/metrics-monitoring-metrics-monitoring-agent","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"http/metrics-monitoring","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"filestream-monitoring","message":"Healthy: communicating with pid '1235'","status":"HEALTHY","type":"filestream","units":[{"id":"filestream-monitoring-filestream-monitoring-agent","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"filestream-monitoring","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"beat/metrics-monitoring","message":"Healthy: communicating with pid '1243'","status":"HEALTHY","type":"beat/metrics","units":[{"id":"beat/metrics-monitoring-metrics-monitoring-beats","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"beat/metrics-monitoring","message":"Healthy","status":"HEALTHY","type":"output"}]}],"service.name":"fleet-server","http.request.id":"01GX5GD4CGJGZZA541Y8SPA45Y","fleet.access.apikey.id":"TUyjSocBPiM0FoVkwiQC","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-04-04T07:03:14.314Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":198},"message":"Cannot checkin in with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://localhost:8221/ errored: Post \"https://localhost:8221/api/fleet/agents/8d1bdffd-e6d9-471e-8465-c0bd1be3459a/checkin?\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)\n\n"},"request_duration_ns":600001327778,"failed_checkins":7,"retry_after_ns":369868042188,"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-04T07:09:24.563Z","message":"applying new components data","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-es-containerhost","type":"fleet-server"},"log":{"source":"fleet-server-es-containerhost"},"server.address":"","fleet.agent.id":"8d1bdffd-e6d9-471e-8465-c0bd1be3459a","ecs.version":"1.6.0","service.name":"fleet-server","http.request.id":"01GX5HJNJ3JQSMVZYHN8EXWJHE","fleet.access.apikey.id":"O1S7SocBlXtIZ4N9fxEa","req.Components":[{"id":"fleet-server-es-containerhost","message":"Healthy: communicating with pid '250'","status":"HEALTHY","type":"fleet-server","units":[{"id":"fleet-server-es-containerhost-fleet-server-fleet_server-elastic-cloud-fleet-server","message":"Running on policy with Fleet Server integration: policy-elastic-agent-on-cloud","status":"HEALTHY","type":"input"},{"id":"fleet-server-es-containerhost","message":"Running on policy with Fleet Server integration: policy-elastic-agent-on-cloud","status":"HEALTHY","type":"output"}]},{"id":"apm-es-containerhost","message":"Healthy: communicating with pid '265'","status":"HEALTHY","type":"apm","units":[{"id":"apm-es-containerhost","message":"Healthy","status":"HEALTHY","type":"output"},{"id":"apm-es-containerhost-elastic-cloud-apm","message":"Healthy","status":"HEALTHY","type":"input"}]}],"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-04-04T07:19:24.420Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":198},"message":"Cannot checkin in with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://localhost:8221/ errored: Post \"https://localhost:8221/api/fleet/agents/8d1bdffd-e6d9-471e-8465-c0bd1be3459a/checkin?\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)\n\n"},"request_duration_ns":600001188498,"failed_checkins":8,"retry_after_ns":550974383773,"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-04T07:28:35.886Z","message":"applying new components data","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-es-containerhost","type":"fleet-server"},"log":{"source":"fleet-server-es-containerhost"},"ecs.version":"1.6.0","http.request.id":"01GX5JNSS0WQSK0WT2G42QJ0EZ","fleet.access.apikey.id":"O1S7SocBlXtIZ4N9fxEa","service.name":"fleet-server","server.address":"","fleet.agent.id":"8d1bdffd-e6d9-471e-8465-c0bd1be3459a","req.Components":[{"id":"fleet-server-es-containerhost","message":"Healthy: communicating with pid '250'","status":"HEALTHY","type":"fleet-server","units":[{"id":"fleet-server-es-containerhost-fleet-server-fleet_server-elastic-cloud-fleet-server","message":"Running on policy with Fleet Server integration: policy-elastic-agent-on-cloud","status":"HEALTHY","type":"input"},{"id":"fleet-server-es-containerhost","message":"Running on policy with Fleet Server integration: policy-elastic-agent-on-cloud","status":"HEALTHY","type":"output"}]},{"id":"apm-es-containerhost","message":"Healthy: communicating with pid '265'","status":"HEALTHY","type":"apm","units":[{"id":"apm-es-containerhost-elastic-cloud-apm","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"apm-es-containerhost","message":"Healthy","status":"HEALTHY","type":"output"}]}],"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-04-04T07:38:35.616Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":198},"message":"Cannot checkin in with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://localhost:8221/ errored: Post \"https://localhost:8221/api/fleet/agents/8d1bdffd-e6d9-471e-8465-c0bd1be3459a/checkin?\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)\n\n"},"request_duration_ns":600000385060,"failed_checkins":9,"retry_after_ns":838374796602,"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-04T07:45:48.278Z","message":"applying new components data","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-es-containerhost","type":"fleet-server"},"log":{"source":"fleet-server-es-containerhost"},"http.request.id":"01GX5KN9TF3JP7F20SKTKXPXC4","server.address":"","fleet.agent.id":"9dd1a52e-e27d-4b7f-b07d-beef4de0489f","fleet.access.apikey.id":"TUyjSocBPiM0FoVkwiQC","ecs.version":"1.6.0","service.name":"fleet-server","req.Components":[{"id":"system/metrics-default","message":"Healthy: communicating with pid '1211'","status":"HEALTHY","type":"system/metrics","units":[{"id":"system/metrics-default-system/metrics-system-79b88a00-d2a4-11ed-b058-e9b546b147b4","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"system/metrics-default","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"log-default","message":"Healthy: communicating with pid '1219'","status":"HEALTHY","type":"log","units":[{"id":"log-default-logfile-system-79b88a00-d2a4-11ed-b058-e9b546b147b4","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"log-default","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"http/metrics-monitoring","message":"Healthy: communicating with pid '1226'","status":"HEALTHY","type":"http/metrics","units":[{"id":"http/metrics-monitoring-metrics-monitoring-agent","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"http/metrics-monitoring","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"filestream-monitoring","message":"Healthy: communicating with pid '1235'","status":"HEALTHY","type":"filestream","units":[{"id":"filestream-monitoring","message":"Healthy","status":"HEALTHY","type":"output"},{"id":"filestream-monitoring-filestream-monitoring-agent","message":"Healthy","status":"HEALTHY","type":"input"}]},{"id":"beat/metrics-monitoring","message":"Healthy: communicating with pid '1243'","status":"HEALTHY","type":"beat/metrics","units":[{"id":"beat/metrics-monitoring-metrics-monitoring-beats","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"beat/metrics-monitoring","message":"Healthy","status":"HEALTHY","type":"output"}]}],"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-04-04T08:02:34.215Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":198},"message":"Cannot checkin in with fleet-server, retrying","log":{"source":"elastic-agent"},"error":{"message":"fail to checkin to fleet-server: all hosts failed: 1 error occurred:\n\t* requester 0/1 to host https://localhost:8221/ errored: Post \"https://localhost:8221/api/fleet/agents/8d1bdffd-e6d9-471e-8465-c0bd1be3459a/checkin?\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)\n\n"},"request_duration_ns":600000881470,"failed_checkins":10,"retry_after_ns":300755394517,"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-04T08:07:35.462Z","message":"applying new components data","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-es-containerhost","type":"fleet-server"},"log":{"source":"fleet-server-es-containerhost"},"fleet.agent.id":"8d1bdffd-e6d9-471e-8465-c0bd1be3459a","http.request.id":"01GX5MX6GT5693RQQA2V0TSXWW","ecs.version":"1.6.0","service.name":"fleet-server","server.address":"","fleet.access.apikey.id":"O1S7SocBlXtIZ4N9fxEa","req.Components":[{"id":"fleet-server-es-containerhost","message":"Healthy: communicating with pid '250'","status":"HEALTHY","type":"fleet-server","units":[{"id":"fleet-server-es-containerhost-fleet-server-fleet_server-elastic-cloud-fleet-server","message":"Running on policy with Fleet Server integration: policy-elastic-agent-on-cloud","status":"HEALTHY","type":"input"},{"id":"fleet-server-es-containerhost","message":"Running on policy with Fleet Server integration: policy-elastic-agent-on-cloud","status":"HEALTHY","type":"output"}]},{"id":"apm-es-containerhost","message":"Healthy: communicating with pid '265'","status":"HEALTHY","type":"apm","units":[{"id":"apm-es-containerhost","message":"Healthy","status":"HEALTHY","type":"output"},{"id":"apm-es-containerhost-elastic-cloud-apm","message":"Healthy","status":"HEALTHY","type":"input"}]}],"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-04T08:08:52.621Z","message":"applying new components data","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-es-containerhost","type":"fleet-server"},"log":{"source":"fleet-server-es-containerhost"},"server.address":"","fleet.agent.id":"9dd1a52e-e27d-4b7f-b07d-beef4de0489f","req.Components":[{"id":"system/metrics-default","message":"Healthy: communicating with pid '1211'","status":"HEALTHY","type":"system/metrics","units":[{"id":"system/metrics-default-system/metrics-system-79b88a00-d2a4-11ed-b058-e9b546b147b4","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"system/metrics-default","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"log-default","message":"Healthy: communicating with pid '1219'","status":"HEALTHY","type":"log","units":[{"id":"log-default-logfile-system-79b88a00-d2a4-11ed-b058-e9b546b147b4","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"log-default","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"http/metrics-monitoring","message":"Healthy: communicating with pid '1226'","status":"HEALTHY","type":"http/metrics","units":[{"id":"http/metrics-monitoring","message":"Healthy","status":"HEALTHY","type":"output"},{"id":"http/metrics-monitoring-metrics-monitoring-agent","message":"Healthy","status":"HEALTHY","type":"input"}]},{"id":"filestream-monitoring","message":"Healthy: communicating with pid '1235'","status":"HEALTHY","type":"filestream","units":[{"id":"filestream-monitoring-filestream-monitoring-agent","message":"Healthy","status":"HEALTHY","type":"input"},{"id":"filestream-monitoring","message":"Healthy","status":"HEALTHY","type":"output"}]},{"id":"beat/metrics-monitoring","message":"Healthy: communicating with pid '1243'","status":"HEALTHY","type":"beat/metrics","units":[{"id":"beat/metrics-monitoring","message":"Healthy","status":"HEALTHY","type":"output"},{"id":"beat/metrics-monitoring-metrics-monitoring-beats","message":"Healthy","status":"HEALTHY","type":"input"}]}],"ecs.version":"1.6.0","http.request.id":"01GX5MZHM9AZNQG219MWWPA8XN","service.name":"fleet-server","fleet.access.apikey.id":"TUyjSocBPiM0FoVkwiQC","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-04-04T08:11:11.161Z","message":"Action delivered to agent on checkin","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-es-containerhost","type":"fleet-server"},"log":{"source":"fleet-server-es-containerhost"},"server.address":"","type":"REQUEST_DIAGNOSTICS","ecs.version":"1.6.0","http.request.id":"01GX5MX6GT5693RQQA2V0TSXWW","fleet.access.apikey.id":"O1S7SocBlXtIZ4N9fxEa","createdAt":"2023-04-04T08:11:10.024Z","timeout":0,"service.name":"fleet-server","fleet.agent.id":"8d1bdffd-e6d9-471e-8465-c0bd1be3459a","ackToken":"3ea0c35c-31ee-4774-8393-3c46acd4fbb2","id":"6277383d-812a-427c-86ad-deacb187eaa2","inputType":"","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-04-04T08:11:11.173Z","log.origin":

and near the end they have:

{"log.level":"error","@timestamp":"2023-04-04T08:11:11.173Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":221},"message":"Checkin request to fleet-server succeeded after 10 failures","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}

The "success" here is after the diagnostics action was dispatched to the agents, so the checkin would immediately return something short circuiting the long-poll duration.

I think what's happening is that we are getting the header timeout error (Client.Timeout exceeded while awaiting headers)
because we've increased our long-poll durations in fleet-server for 8.8 #2337 while the agent's has not been increased to match yet elastic/elastic-agent#2257

There is an issue to revert the poll change for fleet-server's 8.8 release #2387 once the branch is made

michel-laterman · 2023-04-04T18:28:53Z

I've reverted the timeout changes on fleet-server; the snapshots built with this change should not have this issue anymore.

cmacknz · 2023-04-04T19:50:06Z

Just to confirm, the problem causing the Fleet server to go offline here is because each Header timeout error causes the agent to backoff with increasingly large durations making it appear offline?

The last error before it succeeds has a retry_after of 838374796602n s / 1E9 / 60.0 = 13.97 minutes.

michel-laterman · 2023-04-04T20:07:01Z

The agent long-poll timeout is set to 10m on this build (https://github.com/elastic/elastic-agent/blob/main/internal/pkg/remote/config.go#L49), the fleet-server was set to 30m, so when the agent that oversees fleet-server tried to checkin, the request times out.

The eventual checkin success we see is caused by the diagnostics action being detected and returned

jlind23 · 2023-04-05T05:47:24Z

Closing this as fixed thanks to #2471

harshitgupta-qasource · 2024-01-24T10:16:37Z

`Bug Conversion:`

We have updated 01 testcase for this scenario in our fleet test suite at:

https://elastic.testrail.io/index.php?/cases/view/149907

Thanks!

amolnater-qasource added bug Something isn't working Team:Fleet Label for the Fleet team impact:high Short-term priority; add to current release, or definitely next. labels Mar 31, 2023

jen-huang assigned michel-laterman Mar 31, 2023

This was referenced Apr 4, 2023

Revert #2341 #2471

Merged

Increase long poll default to ~30 minutes #2337

Closed

jlind23 closed this as completed Apr 5, 2023

amolnater-qasource added the QA:Ready For Testing Code is merged and ready for QA to validate label Apr 5, 2023

amolnater-qasource removed the QA:Ready For Testing Code is merged and ready for QA to validate label Jul 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Deployment]: Hosted fleet server gets offline inconsistently on 8.8 Snapshot. #2460

[Deployment]: Hosted fleet server gets offline inconsistently on 8.8 Snapshot. #2460

amolnater-qasource commented Mar 31, 2023

amolnater-qasource commented Mar 31, 2023

ghost commented Mar 31, 2023

cmacknz commented Mar 31, 2023

jen-huang commented Mar 31, 2023

michel-laterman commented Apr 3, 2023

amolnater-qasource commented Apr 4, 2023

michel-laterman commented Apr 4, 2023

michel-laterman commented Apr 4, 2023

cmacknz commented Apr 4, 2023

michel-laterman commented Apr 4, 2023

jlind23 commented Apr 5, 2023

harshitgupta-qasource commented Jan 24, 2024

[Deployment]: Hosted fleet server gets offline inconsistently on 8.8 Snapshot. #2460

[Deployment]: Hosted fleet server gets offline inconsistently on 8.8 Snapshot. #2460

Comments

amolnater-qasource commented Mar 31, 2023

amolnater-qasource commented Mar 31, 2023

ghost commented Mar 31, 2023

cmacknz commented Mar 31, 2023

jen-huang commented Mar 31, 2023

michel-laterman commented Apr 3, 2023

amolnater-qasource commented Apr 4, 2023

michel-laterman commented Apr 4, 2023

michel-laterman commented Apr 4, 2023

cmacknz commented Apr 4, 2023

michel-laterman commented Apr 4, 2023

jlind23 commented Apr 5, 2023

harshitgupta-qasource commented Jan 24, 2024

Bug Conversion:

`Bug Conversion:`