Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix install/enroll command not failing when the daemon restart fails #3815

Merged
merged 25 commits into from
Dec 28, 2023

Conversation

belimawr
Copy link
Contributor

@belimawr belimawr commented Nov 24, 2023

For the agent to be enrolled it needs to restart after the enrol process is completed, in order to pickup the new config and "connect" to fleet-server.

This change makes the enrol command to fail if it cannot restart the agent after enrolling on fleet. A --skip-daemon-reload CLI flag is also added to the enroll command because the container command calls enroll, however there is no running daemon to restart. Skipping the daemon reload on this case allows the container command to succeed.

This PR brings back the changes introduced by #3554 that were reverted due to the container command not working and fixes the issues with the container command.

665cbe9 does not belong to this PR but is required to get the integration tests to pass. I'll remove it once #3917 is merged

What does this PR do?

This change makes the enroll command to fail if it cannot restart the agent after enrolling on fleet

Why is it important?

For the agent to be enrolled it needs to restart after the enrol process is completed, in order to pickup the new config and "connect" to fleet-server.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
    - [ ] I have made corresponding changes to the documentation
    - [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Author's Checklist

How to test this PR locally

Export the follwing environment variables

export FLEET_ENROLL=1
export FLEET_URL=https://testing-this-PR.elastic-cloud.com:443
export FLEET_ENROLLMENT_TOKEN=my-secret-token

Then execute the container command:

./elastic-agent container -e -v -d "*"

The Elastic-Agent should enrol in Fleet and start working.

Related issues

## Use cases
## Screenshots
## Logs

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@belimawr belimawr added the Team:Elastic-Agent Label for the Agent team label Nov 24, 2023
@belimawr belimawr requested a review from a team as a code owner November 24, 2023 18:44
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

Copy link
Contributor

mergify bot commented Nov 24, 2023

This pull request does not have a backport label. Could you fix it @belimawr? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@belimawr belimawr force-pushed the fix-docker-2 branch 3 times, most recently from 9dbfe0a to eb35ba8 Compare November 24, 2023 18:53
Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of nitpicks here and there, overall the change looks ok

internal/pkg/agent/cmd/enroll.go Show resolved Hide resolved
internal/pkg/agent/cmd/enroll_cmd.go Outdated Show resolved Hide resolved
internal/pkg/agent/cmd/enroll_cmd.go Outdated Show resolved Hide resolved
internal/pkg/agent/cmd/enroll_cmd.go Outdated Show resolved Hide resolved
internal/pkg/agent/cmd/enroll_cmd_test.go Outdated Show resolved Hide resolved
internal/pkg/agent/cmd/enroll_cmd_test.go Outdated Show resolved Hide resolved
internal/pkg/agent/cmd/enroll_cmd_test.go Outdated Show resolved Hide resolved
@belimawr
Copy link
Contributor Author

buildkite test this

@cmacknz
Copy link
Member

cmacknz commented Nov 28, 2023

I tried this out by modifying agent so that it unconditionally retried restarting the daemon with the following patch applied:

diff --git a/internal/pkg/agent/cmd/enroll_cmd.go b/internal/pkg/agent/cmd/enroll_cmd.go
index e704990426..a1facdf185 100644
--- a/internal/pkg/agent/cmd/enroll_cmd.go
+++ b/internal/pkg/agent/cmd/enroll_cmd.go
@@ -452,10 +452,10 @@ func (c *enrollCmd) prepareFleetTLS() error {
 }

 func (c *enrollCmd) daemonReloadWithBackoff(ctx context.Context) error {
-       err := c.daemonReload(ctx)
-       if err == nil {
-               return nil
-       }
+       err := errors.New("this is a test error")
+       // if err == nil {
+       //      return nil
+       // }
        if errors.Is(err, context.DeadlineExceeded) ||
                errors.Is(err, context.Canceled) {
                return fmt.Errorf("could not reload daemon: %w", err)
@@ -470,9 +470,7 @@ func (c *enrollCmd) daemonReloadWithBackoff(ctx context.Context) error {
                c.log.Info("Retrying to restart...")

                err = c.daemonReload(ctx)
-               if err == nil {
-                       return nil
-               }
+
                if errors.Is(err, context.DeadlineExceeded) ||
                        errors.Is(err, context.Canceled) {
                        return fmt.Errorf("could not reload daemon after %d retries: %w",
(END)

I got the following output. Note that the existing implementation of the progress bar seems to buffer all logs until the step completes, so I didn't see any output until it failed completely and then everything was dumped out.

❯ sudo ./elastic-agent install --url=https://574feabc9fff4d4fbc4d6994e7f0d556.fleet.eastus2.staging.azure.foundit.no:443 --enrollment-token=d0tZZUFvd0J2M1UxSW91VlRDX3c6SFVTRlpJWXBTY200VXM3RlRlT1VZQQ==
Elastic Agent will be installed at /Library/Elastic/Agent and will run as a service. Do you want to continue? [Y/n]:y
[  ==] Service Started  [13s] Elastic Agent successfully installed, starting enrollment.
[==  ] Waiting For Enroll...  [14s] {"log.level":"info","@timestamp":"2023-11-28T15:27:18.441-0500","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":497},"message":"Starting enrollment to URL: https://574feabc9fff4d4fbc4d6994e7f0d556.fleet.eastus2.staging.azure.foundit.no:443/","ecs.version":"1.6.0"}
[==  ] Waiting For Enroll...  [36s] {"log.level":"info","@timestamp":"2023-11-28T15:27:40.866-0500","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":470},"message":"Retrying to restart...","ecs.version":"1.6.0"}
[=   ] Waiting For Enroll...  [1m16s] {"log.level":"info","@timestamp":"2023-11-28T15:28:20.870-0500","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":470},"message":"Retrying to restart...","ecs.version":"1.6.0"}
[=   ] Waiting For Enroll...  [2m16s] {"log.level":"info","@timestamp":"2023-11-28T15:29:20.873-0500","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":470},"message":"Retrying to restart...","ecs.version":"1.6.0"}
[=   ] Waiting For Enroll...  [3m16s] {"log.level":"info","@timestamp":"2023-11-28T15:30:20.886-0500","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":470},"message":"Retrying to restart...","ecs.version":"1.6.0"}
[=   ] Waiting For Enroll...  [4m16s] {"log.level":"info","@timestamp":"2023-11-28T15:31:20.887-0500","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":470},"message":"Retrying to restart...","ecs.version":"1.6.0"}
[=   ] Waiting For Enroll...  [4m16s] {"log.level":"error","@timestamp":"2023-11-28T15:31:20.895-0500","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":283},"message":"Elastic Agent might not be running; unable to trigger restart: could not reload agent's daemon, all retries failed. Last error: %!w(<nil>)","ecs.version":"1.6.0"}
Something went wrong while enrolling the Elastic Agent: could not reload agent's daemon, all retries failed. Last error: %!w(<nil>)
Error: could not reload agent daemon, unable to trigger restart: could not reload agent's daemon, all retries failed. Last error: %!w(<nil>)
For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.12/fleet-troubleshooting.html
[==  ] Uninstalled  [5m17s] Error: enroll command failed for unknown reason: exit status 1
For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.12/fleet-troubleshooting.html

There are several problems here:

  1. As already noted, we don't actually log the error we get on retry attempts: "message":"Retrying to restart...".
  2. The time between retries is too long. The exponential backoff is almost pointless because it maxes out after two iterations. The retry timestamps are at 14s, 36s, 1m16s, 2m16s, 3m16s, 4m16s. So we spend 4 minutes retrying but only retry 5 times. I think 4 minutes is too long to wait before giving up here, we could cut this in half to 2 minutes. I'm also not sure the exponential backoff is doing anything for us when trying to restart the agent, we could just use a constant backoff retrying every 1s with better results I think.
  3. The error formatter in could not reload agent's daemon, all retries failed. Last error: %!w(<nil> is wrong.
  4. [== ] Uninstalled [5m17s] Error: enroll command failed for unknown reason: exit status 1 is wrong, we know the reason but say it is unknown.

internal/pkg/agent/cmd/enroll_cmd_test.go Outdated Show resolved Hide resolved
internal/pkg/agent/cmd/enroll_cmd_test.go Outdated Show resolved Hide resolved
testing/integration/container_cmd_test.go Outdated Show resolved Hide resolved
@belimawr belimawr force-pushed the fix-docker-2 branch 3 times, most recently from 7fbbbf8 to ad50c63 Compare December 6, 2023 12:09
Copy link
Member

@AndersonQ AndersonQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you missed a check on one of the tests, but apart from that all good

internal/pkg/agent/cmd/enroll_cmd_test.go Show resolved Hide resolved
testing/integration/container_cmd_test.go Outdated Show resolved Hide resolved
@cmacknz cmacknz added backport-v8.12.0 Automated backport with mergify and removed backport-skip labels Dec 6, 2023
@belimawr
Copy link
Contributor Author

@cmacknz

The time between retries is too long. The exponential backoff is almost pointless because it maxes out after two iterations. The retry timestamps are at 14s, 36s, 1m16s, 2m16s, 3m16s, 4m16s. So we spend 4 minutes retrying but only retry 5 times. I think 4 minutes is too long to wait before giving up here, we could cut this in half to 2 minutes. I'm also not sure the exponential backoff is doing anything for us when trying to restart the agent, we could just use a constant backoff retrying every 1s with better results I think.

I see your point there, I reduced the time so iterations are faster now. How many times do you think we should retry?

I don't know the details of the shutdown process from the Elastic-Agent so I'm really not sure what makes sense here. If there is any sort of communication with Fleet-Server or even waiting Beats finishing to flush some data, it might make sense to have longer wait times in case ES is overloaded.

[== ] Uninstalled [5m17s] Error: enroll command failed for unknown reason: exit status 1 is wrong, we know the reason but say it is unknown.

That comes from a totally different bit of the codebase, the bit that logs it is waiting for an external process to finish, hence it does not know the reason of the failure.

err = enrollCmd.Wait()
if err != nil {
progBar.Describe("Failed to Enroll")
// uninstall doesn't need to be performed here the defer above will
// catch the error and perform the uninstall
return fmt.Errorf("enroll command failed for unknown reason: %w", err)
}

Comment on lines 129 to 154
// Stage 3: Make sure there are no errors in logs
t.Log("Making sure there are no error logs")
docs = queryESDocs(t, func() (estools.Documents, error) {
return estools.CheckForErrorsInLogs(info.ESClient, info.Namespace, []string{
// acceptable error messages (include reason)
"Error dialing dial tcp 127.0.0.1:9200: connect: connection refused", // beat is running default config before its config gets updated
"Global configuration artifact is not available", // Endpoint: failed to load user artifact due to connectivity issues
"Failed to download artifact",
"Failed to initialize artifact",
"Failed to apply initial policy from on disk configuration",
"elastic-agent-client error: rpc error: code = Canceled desc = context canceled", // can happen on restart
"add_cloud_metadata: received error failed requesting openstack metadata: Get \\\"https://169.254.169.254/2009-04-04/meta-data/instance-id\\\": dial tcp 169.254.169.254:443: connect: connection refused", // okay for the openstack metadata to not work
"add_cloud_metadata: received error failed requesting openstack metadata: Get \\\"https://169.254.169.254/2009-04-04/meta-data/hostname\\\": dial tcp 169.254.169.254:443: connect: connection refused", // okay for the cloud metadata to not work
"add_cloud_metadata: received error failed requesting openstack metadata: Get \\\"https://169.254.169.254/2009-04-04/meta-data/placement/availability-zone\\\": dial tcp 169.254.169.254:443: connect: connection refused", // okay for the cloud metadata to not work
"add_cloud_metadata: received error failed requesting openstack metadata: Get \\\"https://169.254.169.254/2009-04-04/meta-data/instance-type\\\": dial tcp 169.254.169.254:443: connect: connection refused", // okay for the cloud metadata to not work
"add_cloud_metadata: received error failed with http status code 404", // okay for the cloud metadata to not work
"add_cloud_metadata: received error failed fetching EC2 Identity Document: operation error ec2imds: GetInstanceIdentityDocument, http response error StatusCode: 404, request to EC2 IMDS failed", // okay for the cloud metadata to not work
})
})
t.Logf("error logs: Got %d documents", len(docs.Hits.Hits))
for _, doc := range docs.Hits.Hits {
t.Logf("%#v", doc.Source)
}
require.Empty(t, docs.Hits.Hits)

// Stage 4: Make sure we have message confirming central management is running
// Stage 3: Make sure we have message confirming central management is running
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove this change once #3917 is merged. I kept it here so the integration tests will, hopefully, pass.

@belimawr
Copy link
Contributor Author

Folks, I've resolved all comments, some I implemented what was asked, others I explained why it is implemented the way it is.

Once all tests are passing I'll ask for a re-review.

Copy link
Contributor

mergify bot commented Dec 18, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix-docker-2 upstream/fix-docker-2
git merge upstream/main
git push upstream fix-docker-2

@cmacknz
Copy link
Member

cmacknz commented Dec 18, 2023

How many times do you think we should retry?

In a situation where the control socket isn't available yet, most likely it will become available fairly quickly once the agent creates it. The problem is we previously only waited once, when really we probably needed to wait a few seconds.

I would say retrying every 1s is reasonable for 120 total retries over 2 minutes.

I don't know the details of the shutdown process from the Elastic-Agent so I'm really not sure what makes sense here. If there is any sort of communication with Fleet-Server or even waiting Beats finishing to flush some data, it might make sense to have longer wait times in case ES is overloaded.

Does this retry logic actually impact the shutdown process? This wasn't something I was expecting to have to consider.

Asserting there are no errors in the logs from Elastic-Agent and all
Beats is flaky and does not ensure the Elastic-Agent is working
correctly. The test already assert the healthy of all components, so
there is no need to look in the logs.

The number of exceptions this assertion for no log errors is already
an example of how fragile this is. The Elastic-Agent life cycle is
complex and some transient errors are expected, as the code evolves
those errors or messages will change, making assertion on them flaky.
@AndersonQ
Copy link
Member

buildkite test this

Copy link

Quality Gate failed Quality Gate failed

Failed conditions

35.1% 35.1% Coverage on New Code (required ≥ 40%)

See analysis details on SonarQube

@AndersonQ AndersonQ enabled auto-merge (squash) December 28, 2023 12:47
@jlind23
Copy link
Contributor

jlind23 commented Dec 28, 2023

Sonarqube is failing due to code coverage below 40%, force merging it.

@jlind23 jlind23 merged commit 938f0b9 into elastic:main Dec 28, 2023
8 of 9 checks passed
mergify bot pushed a commit that referenced this pull request Dec 28, 2023
…3815)

* fix install/enroll cmd not failing when agent restart fails

* surface errors that might occur during enroll
* fail install command if agent cannot be restarted
* do not print success message if there was an enroll error. Print an error message and the error instead
* add logs to show the different enroll attempts
* add more context t errors
* refactor internal/pkg/agent/install/perms_unix.go and add more context to errors
restore main version
* ignore agent restart error on enroll tests as there is no agent to be restarted
* daemonReloadWithBackoff does not retry on context deadline exceeded

* Do not reload the Agent daemon if enrolling from a container

The enroll command would always try to restart the daemon, however
when enrolling as part of the container command, there is no running
daemon to reload.

This commit adds a CLI flag, --skip-daemon-reload, to the enroll
command to skip the reloading step, the container command now makes
use of this flag.

* Apply suggestions from code review

Co-authored-by: Paolo Chilà <[email protected]>

* PR improvements

* Add integration test

* make lint happy

* PR improvements

* Fix after rebase

* Fix some issues

* more PR improvments

* Fix enroll command

* Fix TestContainterCMD

* Fix implementation

* Remove flaky integration test assertion

Asserting there are no errors in the logs from Elastic-Agent and all

Co-authored-by: Anderson Queriroz <[email protected]>
Co-authored-by: Paolo Chilà <[email protected]>
Co-authored-by: Pierre HILBERT <[email protected]>
(cherry picked from commit 938f0b9)

# Conflicts:
#	internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets_test.go
AndersonQ added a commit that referenced this pull request Dec 28, 2023
…e daemon restart fails (#3962)

* Fix install/enroll command not  failing when the daemon restart fails (#3815)

* Do not reload the Agent daemon if enrolling from a container

The enroll command would always try to restart the daemon, however
when enrolling as part of the container command, there is no running
daemon to reload.

This commit adds a CLI flag, --skip-daemon-reload, to the enroll
command to skip the reloading step, the container command now makes
use of this flag.

Co-authored-by: Anderson Queriroz <[email protected]>
Co-authored-by: Paolo Chilà <[email protected]>
Co-authored-by: Pierre HILBERT <[email protected]>
(cherry picked from commit 938f0b9)

---------

Co-authored-by: Tiago Queiroz <[email protected]>
Co-authored-by: Anderson Queiroz <[email protected]>
cmacknz pushed a commit to cmacknz/elastic-agent that referenced this pull request Jan 8, 2024
…lastic#3815)

* fix install/enroll cmd not failing when agent restart fails

* surface errors that might occur during enroll
* fail install command if agent cannot be restarted
* do not print success message if there was an enroll error. Print an error message and the error instead
* add logs to show the different enroll attempts
* add more context t errors
* refactor internal/pkg/agent/install/perms_unix.go and add more context to errors
restore main version
* ignore agent restart error on enroll tests as there is no agent to be restarted
* daemonReloadWithBackoff does not retry on context deadline exceeded

* Do not reload the Agent daemon if enrolling from a container

The enroll command would always try to restart the daemon, however
when enrolling as part of the container command, there is no running
daemon to reload.

This commit adds a CLI flag, --skip-daemon-reload, to the enroll
command to skip the reloading step, the container command now makes
use of this flag.

* Apply suggestions from code review

Co-authored-by: Paolo Chilà <[email protected]>

* PR improvements

* Add integration test

* make lint happy

* PR improvements

* Fix after rebase

* Fix some issues

* more PR improvments

* Fix enroll command

* Fix TestContainterCMD

* Fix implementation

* Remove flaky integration test assertion

Asserting there are no errors in the logs from Elastic-Agent and all

Co-authored-by: Anderson Queriroz <[email protected]>
Co-authored-by: Paolo Chilà <[email protected]>
Co-authored-by: Pierre HILBERT <[email protected]>
cmacknz pushed a commit that referenced this pull request Jan 17, 2024
…e daemon restart fails (#3962)

* Fix install/enroll command not  failing when the daemon restart fails (#3815)

* Do not reload the Agent daemon if enrolling from a container

The enroll command would always try to restart the daemon, however
when enrolling as part of the container command, there is no running
daemon to reload.

This commit adds a CLI flag, --skip-daemon-reload, to the enroll
command to skip the reloading step, the container command now makes
use of this flag.

Co-authored-by: Anderson Queriroz <[email protected]>
Co-authored-by: Paolo Chilà <[email protected]>
Co-authored-by: Pierre HILBERT <[email protected]>
(cherry picked from commit 938f0b9)

---------

Co-authored-by: Tiago Queiroz <[email protected]>
Co-authored-by: Anderson Queiroz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.12.0 Automated backport with mergify Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enrolling into Fleet can fail because the control socket doesn't exist
8 participants