Fix install/enroll command not failing when the daemon restart fails #3815

belimawr · 2023-11-24T18:44:39Z

For the agent to be enrolled it needs to restart after the enrol process is completed, in order to pickup the new config and "connect" to fleet-server.

This change makes the enrol command to fail if it cannot restart the agent after enrolling on fleet. A --skip-daemon-reload CLI flag is also added to the enroll command because the container command calls enroll, however there is no running daemon to restart. Skipping the daemon reload on this case allows the container command to succeed.

This PR brings back the changes introduced by #3554 that were reverted due to the container command not working and fixes the issues with the container command.

665cbe9 does not belong to this PR but is required to get the integration tests to pass. I'll remove it once #3917 is merged

What does this PR do?

This change makes the enroll command to fail if it cannot restart the agent after enrolling on fleet

Why is it important?

For the agent to be enrolled it needs to restart after the enrol process is completed, in order to pickup the new config and "connect" to fleet-server.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~- [ ] I have made corresponding changes to the documentation~~
~~- [ ] I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool
I have added an integration test or an E2E test

Author's Checklist

Remove 665cbe9 from this PR once Remove flaky integration test assertion #3917 is merged

How to test this PR locally

Export the follwing environment variables

export FLEET_ENROLL=1
export FLEET_URL=https://testing-this-PR.elastic-cloud.com:443
export FLEET_ENROLLMENT_TOKEN=my-secret-token

Then execute the container command:

./elastic-agent container -e -v -d "*"

The Elastic-Agent should enrol in Fleet and start working.

Related issues

~~## Use cases~~
~~## Screenshots~~
~~## Logs~~

Questions to ask yourself

How are we going to support this in production?
How are we going to measure its adoption?
How are we going to debug this?
What are the metrics I should take care of?
...

elasticmachine · 2023-11-24T18:44:42Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

mergify · 2023-11-24T18:45:15Z

This pull request does not have a backport label. Could you fix it @belimawr? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

pchila

A couple of nitpicks here and there, overall the change looks ok

internal/pkg/agent/cmd/enroll.go

internal/pkg/agent/cmd/enroll_cmd.go

internal/pkg/agent/cmd/enroll_cmd_test.go

internal/pkg/agent/cmd/enroll_cmd.go

belimawr · 2023-11-28T19:59:28Z

buildkite test this

internal/pkg/agent/cmd/enroll_cmd.go

cmacknz · 2023-11-28T20:49:03Z

I tried this out by modifying agent so that it unconditionally retried restarting the daemon with the following patch applied:

diff --git a/internal/pkg/agent/cmd/enroll_cmd.go b/internal/pkg/agent/cmd/enroll_cmd.go
index e704990426..a1facdf185 100644
--- a/internal/pkg/agent/cmd/enroll_cmd.go
+++ b/internal/pkg/agent/cmd/enroll_cmd.go
@@ -452,10 +452,10 @@ func (c *enrollCmd) prepareFleetTLS() error {
 }

 func (c *enrollCmd) daemonReloadWithBackoff(ctx context.Context) error {
-       err := c.daemonReload(ctx)
-       if err == nil {
-               return nil
-       }
+       err := errors.New("this is a test error")
+       // if err == nil {
+       //      return nil
+       // }
        if errors.Is(err, context.DeadlineExceeded) ||
                errors.Is(err, context.Canceled) {
                return fmt.Errorf("could not reload daemon: %w", err)
@@ -470,9 +470,7 @@ func (c *enrollCmd) daemonReloadWithBackoff(ctx context.Context) error {
                c.log.Info("Retrying to restart...")

                err = c.daemonReload(ctx)
-               if err == nil {
-                       return nil
-               }
+
                if errors.Is(err, context.DeadlineExceeded) ||
                        errors.Is(err, context.Canceled) {
                        return fmt.Errorf("could not reload daemon after %d retries: %w",
(END)

I got the following output. Note that the existing implementation of the progress bar seems to buffer all logs until the step completes, so I didn't see any output until it failed completely and then everything was dumped out.

❯ sudo ./elastic-agent install --url=https://574feabc9fff4d4fbc4d6994e7f0d556.fleet.eastus2.staging.azure.foundit.no:443 --enrollment-token=d0tZZUFvd0J2M1UxSW91VlRDX3c6SFVTRlpJWXBTY200VXM3RlRlT1VZQQ==
Elastic Agent will be installed at /Library/Elastic/Agent and will run as a service. Do you want to continue? [Y/n]:y
[  ==] Service Started  [13s] Elastic Agent successfully installed, starting enrollment.
[==  ] Waiting For Enroll...  [14s] {"log.level":"info","@timestamp":"2023-11-28T15:27:18.441-0500","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":497},"message":"Starting enrollment to URL: https://574feabc9fff4d4fbc4d6994e7f0d556.fleet.eastus2.staging.azure.foundit.no:443/","ecs.version":"1.6.0"}
[==  ] Waiting For Enroll...  [36s] {"log.level":"info","@timestamp":"2023-11-28T15:27:40.866-0500","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":470},"message":"Retrying to restart...","ecs.version":"1.6.0"}
[=   ] Waiting For Enroll...  [1m16s] {"log.level":"info","@timestamp":"2023-11-28T15:28:20.870-0500","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":470},"message":"Retrying to restart...","ecs.version":"1.6.0"}
[=   ] Waiting For Enroll...  [2m16s] {"log.level":"info","@timestamp":"2023-11-28T15:29:20.873-0500","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":470},"message":"Retrying to restart...","ecs.version":"1.6.0"}
[=   ] Waiting For Enroll...  [3m16s] {"log.level":"info","@timestamp":"2023-11-28T15:30:20.886-0500","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":470},"message":"Retrying to restart...","ecs.version":"1.6.0"}
[=   ] Waiting For Enroll...  [4m16s] {"log.level":"info","@timestamp":"2023-11-28T15:31:20.887-0500","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":470},"message":"Retrying to restart...","ecs.version":"1.6.0"}
[=   ] Waiting For Enroll...  [4m16s] {"log.level":"error","@timestamp":"2023-11-28T15:31:20.895-0500","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":283},"message":"Elastic Agent might not be running; unable to trigger restart: could not reload agent's daemon, all retries failed. Last error: %!w(<nil>)","ecs.version":"1.6.0"}
Something went wrong while enrolling the Elastic Agent: could not reload agent's daemon, all retries failed. Last error: %!w(<nil>)
Error: could not reload agent daemon, unable to trigger restart: could not reload agent's daemon, all retries failed. Last error: %!w(<nil>)
For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.12/fleet-troubleshooting.html
[==  ] Uninstalled  [5m17s] Error: enroll command failed for unknown reason: exit status 1
For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.12/fleet-troubleshooting.html

There are several problems here:

As already noted, we don't actually log the error we get on retry attempts: "message":"Retrying to restart...".
The time between retries is too long. The exponential backoff is almost pointless because it maxes out after two iterations. The retry timestamps are at 14s, 36s, 1m16s, 2m16s, 3m16s, 4m16s. So we spend 4 minutes retrying but only retry 5 times. I think 4 minutes is too long to wait before giving up here, we could cut this in half to 2 minutes. I'm also not sure the exponential backoff is doing anything for us when trying to restart the agent, we could just use a constant backoff retrying every 1s with better results I think.
The error formatter in could not reload agent's daemon, all retries failed. Last error: %!w(<nil> is wrong.
[== ] Uninstalled [5m17s] Error: enroll command failed for unknown reason: exit status 1 is wrong, we know the reason but say it is unknown.

internal/pkg/agent/cmd/enroll_cmd_test.go

testing/integration/container_cmd_test.go

AndersonQ

you missed a check on one of the tests, but apart from that all good

internal/pkg/agent/cmd/enroll_cmd_test.go

testing/integration/container_cmd_test.go

belimawr · 2023-12-11T18:17:14Z

@cmacknz

The time between retries is too long. The exponential backoff is almost pointless because it maxes out after two iterations. The retry timestamps are at 14s, 36s, 1m16s, 2m16s, 3m16s, 4m16s. So we spend 4 minutes retrying but only retry 5 times. I think 4 minutes is too long to wait before giving up here, we could cut this in half to 2 minutes. I'm also not sure the exponential backoff is doing anything for us when trying to restart the agent, we could just use a constant backoff retrying every 1s with better results I think.

I see your point there, I reduced the time so iterations are faster now. How many times do you think we should retry?

I don't know the details of the shutdown process from the Elastic-Agent so I'm really not sure what makes sense here. If there is any sort of communication with Fleet-Server or even waiting Beats finishing to flush some data, it might make sense to have longer wait times in case ES is overloaded.

[== ] Uninstalled [5m17s] Error: enroll command failed for unknown reason: exit status 1 is wrong, we know the reason but say it is unknown.

That comes from a totally different bit of the codebase, the bit that logs it is waiting for an external process to finish, hence it does not know the reason of the failure.

elastic-agent/internal/pkg/agent/cmd/install.go

Lines 256 to 262 in e3c0695

    
           err = enrollCmd.Wait() 
        
           if err != nil { 
        
           	progBar.Describe("Failed to Enroll") 
        
           	// uninstall doesn't need to be performed here the defer above will 
        
           	// catch the error and perform the uninstall 
        
           	return fmt.Errorf("enroll command failed for unknown reason: %w", err) 
        
           }

belimawr · 2023-12-14T15:18:34Z

testing/integration/logs_ingestion_test.go

-	// Stage 3: Make sure there are no errors in logs
-	t.Log("Making sure there are no error logs")
-	docs = queryESDocs(t, func() (estools.Documents, error) {
-		return estools.CheckForErrorsInLogs(info.ESClient, info.Namespace, []string{
-			// acceptable error messages (include reason)
-			"Error dialing dial tcp 127.0.0.1:9200: connect: connection refused", // beat is running default config before its config gets updated
-			"Global configuration artifact is not available",                     // Endpoint: failed to load user artifact due to connectivity issues
-			"Failed to download artifact",
-			"Failed to initialize artifact",
-			"Failed to apply initial policy from on disk configuration",
-			"elastic-agent-client error: rpc error: code = Canceled desc = context canceled", // can happen on restart
-			"add_cloud_metadata: received error failed requesting openstack metadata: Get \\\"https://169.254.169.254/2009-04-04/meta-data/instance-id\\\": dial tcp 169.254.169.254:443: connect: connection refused",                 // okay for the openstack metadata to not work
-			"add_cloud_metadata: received error failed requesting openstack metadata: Get \\\"https://169.254.169.254/2009-04-04/meta-data/hostname\\\": dial tcp 169.254.169.254:443: connect: connection refused",                    // okay for the cloud metadata to not work
-			"add_cloud_metadata: received error failed requesting openstack metadata: Get \\\"https://169.254.169.254/2009-04-04/meta-data/placement/availability-zone\\\": dial tcp 169.254.169.254:443: connect: connection refused", // okay for the cloud metadata to not work
-			"add_cloud_metadata: received error failed requesting openstack metadata: Get \\\"https://169.254.169.254/2009-04-04/meta-data/instance-type\\\": dial tcp 169.254.169.254:443: connect: connection refused",               // okay for the cloud metadata to not work
-			"add_cloud_metadata: received error failed with http status code 404", // okay for the cloud metadata to not work
-			"add_cloud_metadata: received error failed fetching EC2 Identity Document: operation error ec2imds: GetInstanceIdentityDocument, http response error StatusCode: 404, request to EC2 IMDS failed", // okay for the cloud metadata to not work
-		})
-	})
-	t.Logf("error logs: Got %d documents", len(docs.Hits.Hits))
-	for _, doc := range docs.Hits.Hits {
-		t.Logf("%#v", doc.Source)
-	}
-	require.Empty(t, docs.Hits.Hits)
-
-	// Stage 4: Make sure we have message confirming central management is running
+	// Stage 3: Make sure we have message confirming central management is running


I'll remove this change once #3917 is merged. I kept it here so the integration tests will, hopefully, pass.

belimawr · 2023-12-14T15:22:39Z

Folks, I've resolved all comments, some I implemented what was asked, others I explained why it is implemented the way it is.

Once all tests are passing I'll ask for a re-review.

mergify · 2023-12-18T17:13:32Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix-docker-2 upstream/fix-docker-2
git merge upstream/main
git push upstream fix-docker-2

cmacknz · 2023-12-18T18:50:47Z

How many times do you think we should retry?

In a situation where the control socket isn't available yet, most likely it will become available fairly quickly once the agent creates it. The problem is we previously only waited once, when really we probably needed to wait a few seconds.

I would say retrying every 1s is reasonable for 120 total retries over 2 minutes.

I don't know the details of the shutdown process from the Elastic-Agent so I'm really not sure what makes sense here. If there is any sort of communication with Fleet-Server or even waiting Beats finishing to flush some data, it might make sense to have longer wait times in case ES is overloaded.

Does this retry logic actually impact the shutdown process? This wasn't something I was expecting to have to consider.

Asserting there are no errors in the logs from Elastic-Agent and all Beats is flaky and does not ensure the Elastic-Agent is working correctly. The test already assert the healthy of all components, so there is no need to look in the logs. The number of exceptions this assertion for no log errors is already an example of how fragile this is. The Elastic-Agent life cycle is complex and some transient errors are expected, as the code evolves those errors or messages will change, making assertion on them flaky.

AndersonQ · 2023-12-28T11:18:45Z

buildkite test this

elastic-sonarqube · 2023-12-28T12:16:02Z

Quality Gate failed

Failed conditions

35.1% 35.1% Coverage on New Code (required ≥ 40%)

See analysis details on SonarQube

jlind23 · 2023-12-28T13:00:05Z

Sonarqube is failing due to code coverage below 40%, force merging it.

…3815) * fix install/enroll cmd not failing when agent restart fails * surface errors that might occur during enroll * fail install command if agent cannot be restarted * do not print success message if there was an enroll error. Print an error message and the error instead * add logs to show the different enroll attempts * add more context t errors * refactor internal/pkg/agent/install/perms_unix.go and add more context to errors restore main version * ignore agent restart error on enroll tests as there is no agent to be restarted * daemonReloadWithBackoff does not retry on context deadline exceeded * Do not reload the Agent daemon if enrolling from a container The enroll command would always try to restart the daemon, however when enrolling as part of the container command, there is no running daemon to reload. This commit adds a CLI flag, --skip-daemon-reload, to the enroll command to skip the reloading step, the container command now makes use of this flag. * Apply suggestions from code review Co-authored-by: Paolo Chilà <[email protected]> * PR improvements * Add integration test * make lint happy * PR improvements * Fix after rebase * Fix some issues * more PR improvments * Fix enroll command * Fix TestContainterCMD * Fix implementation * Remove flaky integration test assertion Asserting there are no errors in the logs from Elastic-Agent and all Co-authored-by: Anderson Queriroz <[email protected]> Co-authored-by: Paolo Chilà <[email protected]> Co-authored-by: Pierre HILBERT <[email protected]> (cherry picked from commit 938f0b9) # Conflicts: # internal/pkg/composable/providers/kubernetessecrets/kubernetes_secrets_test.go

…e daemon restart fails (#3962) * Fix install/enroll command not failing when the daemon restart fails (#3815) * Do not reload the Agent daemon if enrolling from a container The enroll command would always try to restart the daemon, however when enrolling as part of the container command, there is no running daemon to reload. This commit adds a CLI flag, --skip-daemon-reload, to the enroll command to skip the reloading step, the container command now makes use of this flag. Co-authored-by: Anderson Queriroz <[email protected]> Co-authored-by: Paolo Chilà <[email protected]> Co-authored-by: Pierre HILBERT <[email protected]> (cherry picked from commit 938f0b9) --------- Co-authored-by: Tiago Queiroz <[email protected]> Co-authored-by: Anderson Queiroz <[email protected]>

…lastic#3815) * fix install/enroll cmd not failing when agent restart fails * surface errors that might occur during enroll * fail install command if agent cannot be restarted * do not print success message if there was an enroll error. Print an error message and the error instead * add logs to show the different enroll attempts * add more context t errors * refactor internal/pkg/agent/install/perms_unix.go and add more context to errors restore main version * ignore agent restart error on enroll tests as there is no agent to be restarted * daemonReloadWithBackoff does not retry on context deadline exceeded * Do not reload the Agent daemon if enrolling from a container The enroll command would always try to restart the daemon, however when enrolling as part of the container command, there is no running daemon to reload. This commit adds a CLI flag, --skip-daemon-reload, to the enroll command to skip the reloading step, the container command now makes use of this flag. * Apply suggestions from code review Co-authored-by: Paolo Chilà <[email protected]> * PR improvements * Add integration test * make lint happy * PR improvements * Fix after rebase * Fix some issues * more PR improvments * Fix enroll command * Fix TestContainterCMD * Fix implementation * Remove flaky integration test assertion Asserting there are no errors in the logs from Elastic-Agent and all Co-authored-by: Anderson Queriroz <[email protected]> Co-authored-by: Paolo Chilà <[email protected]> Co-authored-by: Pierre HILBERT <[email protected]>

…e daemon restart fails (#3962) * Fix install/enroll command not failing when the daemon restart fails (#3815) * Do not reload the Agent daemon if enrolling from a container The enroll command would always try to restart the daemon, however when enrolling as part of the container command, there is no running daemon to reload. This commit adds a CLI flag, --skip-daemon-reload, to the enroll command to skip the reloading step, the container command now makes use of this flag. Co-authored-by: Anderson Queriroz <[email protected]> Co-authored-by: Paolo Chilà <[email protected]> Co-authored-by: Pierre HILBERT <[email protected]> (cherry picked from commit 938f0b9) --------- Co-authored-by: Tiago Queiroz <[email protected]> Co-authored-by: Anderson Queiroz <[email protected]>

belimawr added the Team:Elastic-Agent Label for the Agent team label Nov 24, 2023

belimawr requested a review from a team as a code owner November 24, 2023 18:44

belimawr requested review from AndersonQ and pchila November 24, 2023 18:44

mergify bot assigned belimawr Nov 24, 2023

mergify bot added the backport-skip label Nov 24, 2023

belimawr force-pushed the fix-docker-2 branch 3 times, most recently from 9dbfe0a to eb35ba8 Compare November 24, 2023 18:53

pchila approved these changes Nov 27, 2023

View reviewed changes

cmacknz reviewed Nov 27, 2023

View reviewed changes

internal/pkg/agent/cmd/enroll_cmd.go Outdated Show resolved Hide resolved

internal/pkg/agent/cmd/enroll_cmd.go Show resolved Hide resolved

cmacknz reviewed Nov 28, 2023

View reviewed changes

internal/pkg/agent/cmd/enroll_cmd.go Show resolved Hide resolved

AndersonQ requested changes Nov 29, 2023

View reviewed changes

internal/pkg/agent/cmd/enroll_cmd_test.go Outdated Show resolved Hide resolved

internal/pkg/agent/cmd/enroll_cmd_test.go Outdated Show resolved Hide resolved

testing/integration/container_cmd_test.go Outdated Show resolved Hide resolved

belimawr force-pushed the fix-docker-2 branch 3 times, most recently from 7fbbbf8 to ad50c63 Compare December 6, 2023 12:09

AndersonQ requested changes Dec 6, 2023

View reviewed changes

internal/pkg/agent/cmd/enroll_cmd_test.go Show resolved Hide resolved

testing/integration/container_cmd_test.go Outdated Show resolved Hide resolved

cmacknz added backport-v8.12.0 Automated backport with mergify and removed backport-skip labels Dec 6, 2023

belimawr force-pushed the fix-docker-2 branch from e4cb918 to 1d12fb4 Compare December 12, 2023 09:45

belimawr commented Dec 14, 2023

View reviewed changes

belimawr force-pushed the fix-docker-2 branch from 665cbe9 to e587a8b Compare December 14, 2023 15:23

belimawr added 10 commits December 22, 2023 18:30

more PR improvments

a54f267

Fix enroll command

e82ad16

Fix TestContainterCMD

651fdf2

Fix implementation

5ff02a9

rename fragment

02c086a

fix tests

37b9b91

Fix tests after rebase

8ed90d3

foo

3a5719d

Fix a possible panic

13fd587

belimawr force-pushed the fix-docker-2 branch from aa95353 to 13fd587 Compare December 22, 2023 17:30

pierrehilbert and others added 6 commits December 23, 2023 09:55

Merge branch 'main' into fix-docker-2

120fe37

improve error message on TestContainerCMD

a7a0a5f

increase test timeout

6f4bf64

add debug logs

b633108

restore logf and add error to message

8424f8c

add better error message to Test_K8sSecretsProvider_Fetch_Cache_Enabled

7ac5fe9

AndersonQ approved these changes Dec 28, 2023

View reviewed changes

AndersonQ enabled auto-merge (squash) December 28, 2023 12:47

jlind23 disabled auto-merge December 28, 2023 12:59

jlind23 merged commit 938f0b9 into elastic:main Dec 28, 2023
8 of 9 checks passed

mergify bot mentioned this pull request Dec 28, 2023

[8.12](backport #3815) Fix install/enroll command not failing when the daemon restart fails #3962

Merged

cmacknz mentioned this pull request Jan 15, 2024

[Fleet]: On enrolling RPM and Deb agents, Restarting agent failed error is displayed in CLI. #4084

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix install/enroll command not failing when the daemon restart fails #3815

Fix install/enroll command not failing when the daemon restart fails #3815

belimawr commented Nov 24, 2023 •

edited

Loading

elasticmachine commented Nov 24, 2023

mergify bot commented Nov 24, 2023

pchila left a comment

belimawr commented Nov 28, 2023

cmacknz commented Nov 28, 2023

AndersonQ left a comment

belimawr commented Dec 11, 2023

belimawr Dec 14, 2023

belimawr commented Dec 14, 2023

mergify bot commented Dec 18, 2023

cmacknz commented Dec 18, 2023

AndersonQ commented Dec 28, 2023

elastic-sonarqube bot commented Dec 28, 2023

jlind23 commented Dec 28, 2023

Fix install/enroll command not failing when the daemon restart fails #3815

Fix install/enroll command not failing when the daemon restart fails #3815

Conversation

belimawr commented Nov 24, 2023 • edited Loading

What does this PR do?

Why is it important?

Checklist

Author's Checklist

How to test this PR locally

Related issues

Questions to ask yourself

elasticmachine commented Nov 24, 2023

mergify bot commented Nov 24, 2023

pchila left a comment

Choose a reason for hiding this comment

belimawr commented Nov 28, 2023

cmacknz commented Nov 28, 2023

AndersonQ left a comment

Choose a reason for hiding this comment

belimawr commented Dec 11, 2023

belimawr Dec 14, 2023

Choose a reason for hiding this comment

belimawr commented Dec 14, 2023

mergify bot commented Dec 18, 2023

cmacknz commented Dec 18, 2023

AndersonQ commented Dec 28, 2023

elastic-sonarqube bot commented Dec 28, 2023

Quality Gate failed

Failed conditions

jlind23 commented Dec 28, 2023

belimawr commented Nov 24, 2023 •

edited

Loading