Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support longer checkin intervals when the agent status has not changed #2257

Open
3 tasks
joshdover opened this issue Feb 13, 2023 · 5 comments
Open
3 tasks
Assignees
Labels
enhancement New feature or request Project:FleetScaling Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@joshdover
Copy link
Contributor

joshdover commented Feb 13, 2023

We've been doing scale testing over the past few months using a ~30 minute long poll duration (rather than current default of 5m) and we are seeing much better results for very large clusters.

We're now ready to make this the default setting for Fleet Server and Agent. These changes can happen independently and do not necessarily need to land in the same release, though it would be preferred. The corresponding Fleet Server changes are in tracked in:

There is some additional complexity to changing this on the Agent side, as we currently have an issue where Agent will not re-checkin with Fleet Server when it's health status changes. If we update the long polling interval to 30 minutes, this could result in the agent status in the UI being up to 30 minutes stale, rather than only 5 minutes stale.

To avoid this kind of regression, we need to update Agent to also cancel the current checkin and start a new one when status changes, however we will cap the frequency of this to 5 minutes to avoid any extra load on large Fleets. We will investigate increasing the frequency that Agent updates this further separately from this change, see #1946.

Tasks

The client side timeout in Agent should be longer than Fleet Server (28m) or the proxy's timeout (30m 20s). We'll keep a similar buffer here at 5 minutes over what the proxy will timeout at and timeout at 35 minutes from the client.

  • Add the ability cancel a checkin and start a new one when the status changes, with a 5 minute debounce
  • When there is a request error during checkin the log message should link the troubleshooting guide page
  • Update default for fleet.timeout to 35 minutes:
    transport.Timeout = 10 * time.Minute
@michel-laterman
Copy link
Contributor

michel-laterman commented Mar 2, 2023

How will the action queue for scheduled actions be checked with a longer poll time?

EDIT: I've added a separate timer to dispatch scheduled actions in a managed agent #2344

@cmacknz cmacknz changed the title Increase Fleet long poll default to ~30 minutes Allowing changing Fleet long poll timeout to ~30 minutes Apr 24, 2023
@cmacknz cmacknz changed the title Allowing changing Fleet long poll timeout to ~30 minutes Support longer checkin intervals when the agent status has not changed Apr 24, 2023
@cmacknz
Copy link
Member

cmacknz commented Apr 24, 2023

Changed the description to "Support longer checkin intervals when the agent status has not changed" since we aren't going to increase the default timeout when this issue closes.

@pchila
Copy link
Member

pchila commented Apr 28, 2023

after a quick clarification with @cmacknz :

  • This change will not be in 8.8 but we are targeting 8.9 only.
  • We need to add a migration for agents < 8.9 where we update the old default timeout value to 7 minutes in order to have checkin intervals of ~ 5 minutes
  • We need to set the elastic agent state debounce to a value of 7 minutes to avoid a race between fleet server and elastic agent at the end of a long poll
  • We need to be able to migrate the debounce value on upgrade as well.
  • Debounce settings should not be part of the fleet.enc in the first implementation (to avoid the problem of an older default value overriding a new one)

@cmacknz
Copy link
Member

cmacknz commented May 3, 2023

I added the timeout configuration migration to a separate issue in #2597 for tracking.

Also created #2598 so track updating horde to use the new checkin parameter in requests with a 7m timeout.

@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Jun 3, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Project:FleetScaling Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants