-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Deployment]: Hosted fleet server gets offline inconsistently on 8.8 Snapshot. #2460
Comments
@karanbirsingh-qasource Please review. |
Secondary review for this ticket is Done |
I just observed this on one of my test clusters. |
@michel-laterman Could you take a look? |
Are there any logs from fleet-server available? |
Thank you for looking into this. Deployment 1: Deployment 2: Please let us know if anything else is required from our end. |
Both diagnostics have logs like:
and near the end they have:
The "success" here is after the diagnostics action was dispatched to the agents, so the checkin would immediately return something short circuiting the long-poll duration. I think what's happening is that we are getting the header timeout error ( There is an issue to revert the poll change for fleet-server's 8.8 release #2387 once the branch is made |
I've reverted the timeout changes on fleet-server; the snapshots built with this change should not have this issue anymore. |
Just to confirm, the problem causing the Fleet server to go offline here is because each Header timeout error causes the agent to backoff with increasingly large durations making it appear offline? The last error before it succeeds has a retry_after of |
The agent long-poll timeout is set to 10m on this build (https://github.com/elastic/elastic-agent/blob/main/internal/pkg/remote/config.go#L49), the fleet-server was set to 30m, so when the agent that oversees fleet-server tried to checkin, the request times out. The eventual checkin success we see is caused by the diagnostics action being detected and returned |
Closing this as fixed thanks to #2471 |
|
Deployment Links:
Description:
Hosted fleet server gets offline and back to healthy inconsistently on 8.8 Snapshot.
Screenshots:
The text was updated successfully, but these errors were encountered: