Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

managed-agent will periodically call dispatcher.Dispatch #2344

Merged
merged 4 commits into from
Mar 7, 2023

Conversation

michel-laterman
Copy link
Contributor

@michel-laterman michel-laterman commented Mar 2, 2023

What does this PR do?

Add a timer in the goroutine that passes actions from fleet-gateway to the dispatcher that calls dispatch with no actions.

Why is it important?

Scheduled actions are only ran when the dispatch.Dispatch() method is called.
Currently this method is only called when an action list is received from the fleet-gateway.

If the gateway does not send any actions after a checkin, the dispatcher is not called and scheduled actions are not checked (until a new action is received).

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

How to test this PR locally

  • Schedule an upgrade for a future time
  • upgrade should run if no new action is sent after scheduled time

Related issues

@michel-laterman michel-laterman added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Fleet Label for the Fleet team backport-v8.6.0 Automated backport with mergify backport-v8.7.0 Automated backport with mergify labels Mar 2, 2023
@michel-laterman michel-laterman requested a review from a team as a code owner March 2, 2023 22:59
@michel-laterman michel-laterman requested review from AndersonQ and blakerouse and removed request for a team March 2, 2023 22:59
@elasticmachine
Copy link
Contributor

elasticmachine commented Mar 2, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-03-06T21:43:32.164+0000

  • Duration: 17 min 32 sec

Test stats 🧪

Test Results
Failed 0
Passed 4975
Skipped 15
Total 4990

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages.

  • run integration tests : Run the Elastic Agent Integration tests.

  • run end-to-end tests : Generate the packages and run the E2E Tests.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Contributor

elasticmachine commented Mar 2, 2023

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 98.361% (60/61)
Files 69.378% (145/209)
Classes 68.329% (274/401)
Methods 53.546% (838/1565)
Lines 38.862% (9187/23640)
Conditionals 100.0% (0/0) 💚

@ycombinator
Copy link
Contributor

ycombinator commented Mar 3, 2023

While this change will have the desired effect of scheduled actions being dequed and dispatched even when no subsequent actions are received from Fleet Server, I have two concerns/comments:

  • With the proposed design, we will call the dispatch.Dispatch() method once every 1 - 1.5 seconds by default, since that's the default configuration for how often the agent is scheduled to checkin with Fleet Server. By comparison, currently we are calling dispatch.Dispatch() much less frequently since most of the times there will likely not be any actions being returned in the checkin response from the Fleet Server. I'm not sure if this change will have any impact on performance of the Agent, particularly CPU usage?
  • I think the design of calling dispatch.Dispatch() only when we have a successful checkin call to the Fleet Server might be exposing a more fundamental issue, particularly in relation to scheduled actions. What if the Fleet Server is temporarily unreachable for some reason? In the current and proposed designs, we will not execute any scheduled actions until the connection is restored and we have a successful checkin call again. In fact, because scheduled actions support expiration, we may never execute a scheduled action if the connection is restored only after the expiration period has elapsed. I don't know enough about the history of the project to know if this behavior is desired or not. Today the only scheduled action possible is upgrade; I would wager that upgrades should be processed even if there's some problem in connecting to the Fleet Server for the checkin call but maybe it's by design we don't want to do this?

An alternative design might be for the managedConfigManager.Run() method to start the dispatcher in its own goroutine so it can keep (efficiently) processing it's action queue over time. And then we can keep the current design of sending actions from the fleet gateway to the dispatcher only when the checkin call is successful and returns a non-empty list of actions. This way we don't incur any potential performance penalty of the dispatcher reprocessing the action queue too frequently and we solve the potential problem of coupling Fleet Server checkin API availability to processing scheduled actions.

@michel-laterman
Copy link
Contributor Author

@ycombinator
I thought the checkin was a long-poll checkin, however you raised a good point about connectivity. I'll use an alternate mechanism instead.

@michel-laterman michel-laterman changed the title Allow fleet-gateway to return empty action lists managed-agent will periodically call dispatcher.Dispatch Mar 6, 2023
Comment on lines 220 to 222
case <-t.C: // periodically call the dispatcher to handle scheduled actions.
m.dispatcher.Dispatch(ctx, actionAcker)
t.Reset(dispatchFlushInterval)
Copy link
Contributor

@ycombinator ycombinator Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍. I think this is a slightly better approach because it at least decouples the processing of the dispatcher's action queue from the Fleet Checkin call.

Copy link
Contributor

@ycombinator ycombinator Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can have an even more efficient implementation of "when to re-dispatch" by deciding to process the action queue either:

  • when new actions are added to it (already handled by the case below this line), or
  • just-in-time for the next scheduled action (by looking at the next scheduled action's time and setting up a time.Timer accordingly) rather than on a fixed interval like we are doing here. But we can make this change in a separate PR if it's too complicated for this one.

@cmacknz
Copy link
Member

cmacknz commented Mar 6, 2023

I thought the checkin was a long-poll checkin

Just to confirm, the check in requests do use long polling but fleet-server is responsible for holding the request open. The short 1 - 1.5s request durations on the agent only come into play when the server has terminated the long poll request.

Big thanks to @ycombinator for suggesting that we decouple the execution of scheduled actions from the checkin period regardless. This is a much better approach and avoids the possibility of hard to find side effects as we start increasing the checkin duration significantly for scalability reasons (from 5 to 30 minutes).

@michel-laterman michel-laterman merged commit 2c8877e into elastic:main Mar 7, 2023
@michel-laterman michel-laterman deleted the gateway-fix branch March 7, 2023 19:43
mergify bot pushed a commit that referenced this pull request Mar 7, 2023
* Allow fleet-gateway to return empty action lists

* Change fix to periodically call the dispatcher

* Fix comment

* Fix changelog, add unit tests

(cherry picked from commit 2c8877e)
mergify bot pushed a commit that referenced this pull request Mar 7, 2023
* Allow fleet-gateway to return empty action lists

* Change fix to periodically call the dispatcher

* Fix comment

* Fix changelog, add unit tests

(cherry picked from commit 2c8877e)
michel-laterman added a commit that referenced this pull request Mar 7, 2023
* Allow fleet-gateway to return empty action lists

* Change fix to periodically call the dispatcher

* Fix comment

* Fix changelog, add unit tests

(cherry picked from commit 2c8877e)

Co-authored-by: Michel Laterman <[email protected]>
michel-laterman added a commit that referenced this pull request Mar 7, 2023
* Allow fleet-gateway to return empty action lists

* Change fix to periodically call the dispatcher

* Fix comment

* Fix changelog, add unit tests

(cherry picked from commit 2c8877e)

Co-authored-by: Michel Laterman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.6.0 Automated backport with mergify backport-v8.7.0 Automated backport with mergify bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Fleet Label for the Fleet team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fleet stuck in "Updating..." status when attempting to upgrade agents to v8.6.1 or v8.6.2
4 participants