managed-agent will periodically call dispatcher.Dispatch #2344

michel-laterman · 2023-03-02T22:59:31Z

What does this PR do?

Add a timer in the goroutine that passes actions from fleet-gateway to the dispatcher that calls dispatch with no actions.

Why is it important?

Scheduled actions are only ran when the dispatch.Dispatch() method is called.
Currently this method is only called when an action list is received from the fleet-gateway.

If the gateway does not send any actions after a checkin, the dispatcher is not called and scheduled actions are not checked (until a new action is received).

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool

How to test this PR locally

Schedule an upgrade for a future time
upgrade should run if no new action is sent after scheduled time

Related issues

Closes Fleet stuck in "Updating..." status when attempting to upgrade agents to v8.6.1 or v8.6.2 #2343

elasticmachine · 2023-03-02T23:21:25Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-03-06T21:43:32.164+0000
Duration: 17 min 32 sec

Test stats 🧪

Test	Results
Failed	0
Passed	4975
Skipped	15
Total	4990

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages.
run integration tests : Run the Elastic Agent Integration tests.
run end-to-end tests : Generate the packages and run the E2E Tests.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

elasticmachine · 2023-03-02T23:21:34Z

🌐 Coverage report

Name	Metrics % (`covered/total`)	Diff
Packages	98.361% (`60/61`)	❕
Files	69.378% (`145/209`)	❕
Classes	68.329% (`274/401`)	❕
Methods	53.546% (`838/1565`)	❕
Lines	38.862% (`9187/23640`)	❕
Conditionals	100.0% (`0/0`)	💚

ycombinator · 2023-03-03T10:43:04Z

While this change will have the desired effect of scheduled actions being dequed and dispatched even when no subsequent actions are received from Fleet Server, I have two concerns/comments:

With the proposed design, we will call the dispatch.Dispatch() method once every 1 - 1.5 seconds by default, since that's the default configuration for how often the agent is scheduled to checkin with Fleet Server. By comparison, currently we are calling dispatch.Dispatch() much less frequently since most of the times there will likely not be any actions being returned in the checkin response from the Fleet Server. I'm not sure if this change will have any impact on performance of the Agent, particularly CPU usage?
I think the design of calling dispatch.Dispatch() only when we have a successful checkin call to the Fleet Server might be exposing a more fundamental issue, particularly in relation to scheduled actions. What if the Fleet Server is temporarily unreachable for some reason? In the current and proposed designs, we will not execute any scheduled actions until the connection is restored and we have a successful checkin call again. In fact, because scheduled actions support expiration, we may never execute a scheduled action if the connection is restored only after the expiration period has elapsed. I don't know enough about the history of the project to know if this behavior is desired or not. Today the only scheduled action possible is upgrade; I would wager that upgrades should be processed even if there's some problem in connecting to the Fleet Server for the checkin call but maybe it's by design we don't want to do this?

An alternative design might be for the managedConfigManager.Run() method to start the dispatcher in its own goroutine so it can keep (efficiently) processing it's action queue over time. And then we can keep the current design of sending actions from the fleet gateway to the dispatcher only when the checkin call is successful and returns a non-empty list of actions. This way we don't incur any potential performance penalty of the dispatcher reprocessing the action queue too frequently and we solve the potential problem of coupling Fleet Server checkin API availability to processing scheduled actions.

michel-laterman · 2023-03-06T15:16:48Z

@ycombinator
I thought the checkin was a long-poll checkin, however you raised a good point about connectivity. I'll use an alternate mechanism instead.

ycombinator · 2023-03-06T16:03:21Z

internal/pkg/agent/application/managed_mode.go

+			case <-t.C: // periodically call the dispatcher to handle scheduled actions.
+				m.dispatcher.Dispatch(ctx, actionAcker)
+				t.Reset(dispatchFlushInterval)


👍. I think this is a slightly better approach because it at least decouples the processing of the dispatcher's action queue from the Fleet Checkin call.

I think we can have an even more efficient implementation of "when to re-dispatch" by deciding to process the action queue either:

when new actions are added to it (already handled by the case below this line), or

just-in-time for the next scheduled action (by looking at the next scheduled action's time and setting up a time.Timer accordingly) rather than on a fixed interval like we are doing here. But we can make this change in a separate PR if it's too complicated for this one.

internal/pkg/agent/application/managed_mode.go

cmacknz · 2023-03-06T20:20:46Z

I thought the checkin was a long-poll checkin

Just to confirm, the check in requests do use long polling but fleet-server is responsible for holding the request open. The short 1 - 1.5s request durations on the agent only come into play when the server has terminated the long poll request.

Big thanks to @ycombinator for suggesting that we decouple the execution of scheduled actions from the checkin period regardless. This is a much better approach and avoids the possibility of hard to find side effects as we start increasing the checkin duration significantly for scalability reasons (from 5 to 30 minutes).

changelog/fragments/1677797191-dispatch-periodic-flush.yaml

internal/pkg/agent/application/managed_mode.go

* Allow fleet-gateway to return empty action lists * Change fix to periodically call the dispatcher * Fix comment * Fix changelog, add unit tests (cherry picked from commit 2c8877e)

* Allow fleet-gateway to return empty action lists * Change fix to periodically call the dispatcher * Fix comment * Fix changelog, add unit tests (cherry picked from commit 2c8877e) Co-authored-by: Michel Laterman <[email protected]>

Allow fleet-gateway to return empty action lists

0e82e35

michel-laterman added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Fleet Label for the Fleet team backport-v8.6.0 Automated backport with mergify backport-v8.7.0 Automated backport with mergify labels Mar 2, 2023

michel-laterman requested a review from a team as a code owner March 2, 2023 22:59

michel-laterman requested review from AndersonQ and blakerouse and removed request for a team March 2, 2023 22:59

mergify bot assigned michel-laterman Mar 2, 2023

jlind23 requested review from ycombinator and michalpristas March 3, 2023 08:01

Change fix to periodically call the dispatcher

791aa99

michel-laterman changed the title ~~Allow fleet-gateway to return empty action lists~~ managed-agent will periodically call dispatcher.Dispatch Mar 6, 2023

ycombinator reviewed Mar 6, 2023

View reviewed changes

ycombinator approved these changes Mar 6, 2023

View reviewed changes

ycombinator reviewed Mar 6, 2023

View reviewed changes

internal/pkg/agent/application/managed_mode.go Outdated Show resolved Hide resolved

Fix comment

d3a1ebd

cmacknz reviewed Mar 6, 2023

View reviewed changes

changelog/fragments/1677797191-dispatch-periodic-flush.yaml Outdated Show resolved Hide resolved

cmacknz reviewed Mar 6, 2023

View reviewed changes

internal/pkg/agent/application/managed_mode.go Outdated Show resolved Hide resolved

Fix changelog, add unit tests

e5c2476

michel-laterman merged commit 2c8877e into elastic:main Mar 7, 2023

michel-laterman deleted the gateway-fix branch March 7, 2023 19:43

This was referenced Mar 7, 2023

[8.6](backport #2344) managed-agent will periodically call dispatcher.Dispatch #2353

Merged

[8.7](backport #2344) managed-agent will periodically call dispatcher.Dispatch #2354

Merged

michel-laterman mentioned this pull request Mar 8, 2023

Support longer checkin intervals when the agent status has not changed #2257

Open

3 tasks

pierrehilbert mentioned this pull request Jun 8, 2023

8.6.2 agent upgrade fails when different version agents are selected for upgrade through bulk actions. elastic/kibana#159265

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

managed-agent will periodically call dispatcher.Dispatch #2344

managed-agent will periodically call dispatcher.Dispatch #2344

michel-laterman commented Mar 2, 2023 •

edited

Loading

elasticmachine commented Mar 2, 2023 •

edited

Loading

Build stats

Test stats 🧪

elasticmachine commented Mar 2, 2023 •

edited

Loading

ycombinator commented Mar 3, 2023 •

edited

Loading

michel-laterman commented Mar 6, 2023

ycombinator Mar 6, 2023 •

edited

Loading

ycombinator Mar 6, 2023 •

edited

Loading

cmacknz commented Mar 6, 2023

managed-agent will periodically call dispatcher.Dispatch #2344

managed-agent will periodically call dispatcher.Dispatch #2344

Conversation

michel-laterman commented Mar 2, 2023 • edited Loading

What does this PR do?

Why is it important?

Checklist

How to test this PR locally

Related issues

elasticmachine commented Mar 2, 2023 • edited Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

elasticmachine commented Mar 2, 2023 • edited Loading

🌐 Coverage report

ycombinator commented Mar 3, 2023 • edited Loading

michel-laterman commented Mar 6, 2023

ycombinator Mar 6, 2023 • edited Loading

Choose a reason for hiding this comment

ycombinator Mar 6, 2023 • edited Loading

Choose a reason for hiding this comment

cmacknz commented Mar 6, 2023

michel-laterman commented Mar 2, 2023 •

edited

Loading

elasticmachine commented Mar 2, 2023 •

edited

Loading

elasticmachine commented Mar 2, 2023 •

edited

Loading

ycombinator commented Mar 3, 2023 •

edited

Loading

ycombinator Mar 6, 2023 •

edited

Loading

ycombinator Mar 6, 2023 •

edited

Loading