Support longer checkin intervals when the agent status has not changed #2257

joshdover · 2023-02-13T14:34:18Z

We've been doing scale testing over the past few months using a ~30 minute long poll duration (rather than current default of 5m) and we are seeing much better results for very large clusters.

We're now ready to make this the default setting for Fleet Server and Agent. These changes can happen independently and do not necessarily need to land in the same release, though it would be preferred. The corresponding Fleet Server changes are in tracked in:

Increase long poll default to ~30 minutes fleet-server#2337

There is some additional complexity to changing this on the Agent side, as we currently have an issue where Agent will not re-checkin with Fleet Server when it's health status changes. If we update the long polling interval to 30 minutes, this could result in the agent status in the UI being up to 30 minutes stale, rather than only 5 minutes stale.

To avoid this kind of regression, we need to update Agent to also cancel the current checkin and start a new one when status changes, however we will cap the frequency of this to 5 minutes to avoid any extra load on large Fleets. We will investigate increasing the frequency that Agent updates this further separately from this change, see #1946.

Tasks

The client side timeout in Agent should be longer than Fleet Server (28m) or the proxy's timeout (30m 20s). We'll keep a similar buffer here at 5 minutes over what the proxy will timeout at and timeout at 35 minutes from the client.

Add the ability cancel a checkin and start a new one when the status changes, with a 5 minute debounce
When there is a request error during checkin the log message should link the troubleshooting guide page
Update default for fleet.timeout to 35 minutes:

elastic-agent/internal/pkg/remote/config.go

Line 49 in c097697

transport.Timeout = 10 * time.Minute

The text was updated successfully, but these errors were encountered:

michel-laterman · 2023-03-02T22:06:06Z

How will the action queue for scheduled actions be checked with a longer poll time?

EDIT: I've added a separate timer to dispatch scheduled actions in a managed agent #2344

cmacknz · 2023-04-24T19:39:59Z

Changed the description to "Support longer checkin intervals when the agent status has not changed" since we aren't going to increase the default timeout when this issue closes.

pchila · 2023-04-28T15:25:08Z

after a quick clarification with @cmacknz :

This change will not be in 8.8 but we are targeting 8.9 only.
We need to add a migration for agents < 8.9 where we update the old default timeout value to 7 minutes in order to have checkin intervals of ~ 5 minutes
We need to set the elastic agent state debounce to a value of 7 minutes to avoid a race between fleet server and elastic agent at the end of a long poll
We need to be able to migrate the debounce value on upgrade as well.
Debounce settings should not be part of the fleet.enc in the first implementation (to avoid the problem of an older default value overriding a new one)

cmacknz · 2023-05-03T14:19:20Z

I added the timeout configuration migration to a separate issue in #2597 for tracking.

Also created #2598 so track updating horde to use the new checkin parameter in requests with a 7m timeout.

Allow migrating the stored value of the agent checkin long polling timeout #2597
Update the horde drones to use the agent checkin timeout request parameter #2598

elasticmachine · 2024-06-03T15:59:14Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

joshdover added enhancement New feature or request Team:Elastic-Agent Label for the Agent team Project:FleetScaling labels Feb 13, 2023

joshdover mentioned this issue Feb 13, 2023

Investigate allowing the agent to check in more frequently when the agent status changes #1946

Open

pierrehilbert assigned pchila Feb 28, 2023

pchila mentioned this issue Mar 28, 2023

Feature/increase fleet long poll timeout #2408

Closed

8 tasks

This was referenced Apr 4, 2023

[Deployment]: Hosted fleet server gets offline inconsistently on 8.8 Snapshot. elastic/fleet-server#2460

Closed

Increase long poll default to ~30 minutes elastic/fleet-server#2337

Closed

pchila mentioned this issue Apr 21, 2023

Increase default elastic agent long poll duration to ~30 minutes #2536

Closed

1 task

kpollich mentioned this issue Apr 24, 2023

[Fleet] Create API endpoint to set fleet.timeout across all policies elastic/kibana#155654

Open

cmacknz changed the title ~~Increase Fleet long poll default to ~30 minutes~~ Allowing changing Fleet long poll timeout to ~30 minutes Apr 24, 2023

cmacknz changed the title ~~Allowing changing Fleet long poll timeout to ~30 minutes~~ Support longer checkin intervals when the agent status has not changed Apr 24, 2023

cmacknz mentioned this issue Apr 24, 2023

Increase default long poll duration to ~10 minutes #2544

Closed

2 tasks

pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support longer checkin intervals when the agent status has not changed #2257

Support longer checkin intervals when the agent status has not changed #2257

joshdover commented Feb 13, 2023 •

edited by pchila

Loading

michel-laterman commented Mar 2, 2023 •

edited

Loading

cmacknz commented Apr 24, 2023

pchila commented Apr 28, 2023

cmacknz commented May 3, 2023

elasticmachine commented Jun 3, 2024

Support longer checkin intervals when the agent status has not changed #2257

Support longer checkin intervals when the agent status has not changed #2257

Comments

joshdover commented Feb 13, 2023 • edited by pchila Loading

Tasks

michel-laterman commented Mar 2, 2023 • edited Loading

cmacknz commented Apr 24, 2023

pchila commented Apr 28, 2023

cmacknz commented May 3, 2023

elasticmachine commented Jun 3, 2024

joshdover commented Feb 13, 2023 •

edited by pchila

Loading

michel-laterman commented Mar 2, 2023 •

edited

Loading