Fleet stuck in "Updating..." status when attempting to upgrade agents to v8.6.1 or v8.6.2 #2343

renzedj · 2023-02-27T15:23:30Z

Issue

Since upgrading from Elastic Cloud 8.5.3, I have been unable to manage Elastic Agent integrations through Fleet in Kibana.

I initially upgraded from 8.5.3->8.6.1 and scheduled an Agent upgrade. A few of the agents upgraded as expected, the remaining remained stuck in "Updating..." status in Fleet for more than a week. I later updated to 8.6.2 and continued to experience the same issue.

I cancelled out the "Updating..." status per the instructions in this post, which returned the agents to Healthy status, as promised. A second attempt at upgrading yielded the same results. After cancelling out the "Updating..." status again, I determined that I was able to schedule agents to upgrade immediately and it would usually succeed, as long as they were in "Healthy" status, but if an "Offline" or even a "Not Healthy" agent was included, the upgrade would fail. If I attempted to schedule the update for a later time or with a maintenance window other than "immediately" the upgrade fails.

Additionally, I observed this morning when I checked the "Agent Activity" log in fleet that multiple activities showing "Reassigning X of Y agents to [new policy]" that I executed approximately a week ago are still showing as in progress. However, in this case, when I investigate, all agents have been reassigned to the new policy and are reporting in as expected. I don't know whether the two items are related.

Deployment

My deployment is:

Elastic Cloud 8.6.2
Fleet-Managed Elastic Agent in both Windows and Linux
Discussion Forum: https://discuss.elastic.co/t/elastic-agents-fail-to-upgrade-from-fleet-shows-updating-status-for-more-than-a-week/325426/12

Steps to Reproduce

Upgrade Elastic Cloud to 8.6.1 or 8.6.2
Schedule an agent upgrade in Fleet to occur at a specific time.

michel-laterman · 2023-02-27T19:22:25Z

I've tried to recreate this:

My first attempt was with a newly deployed cluster on v8.6.2 where I enroll an agent on v8.5.3 and upgrade to v8.6.2 but did not encounter an issue after a few repeated attempts

My next attempt was to deploy a cluster on v8.5.3 with and agent on the same version, then upgrade the cluster to v8.6.2, then upgrade the agent, on a single run of this test, i could not recreate the issue.
I then tried a few attempts of enrolling an v8.5.3 agent against the upgraded cluster and upgrading the agent to v8.6.2 but was unable to recreate.

Can you provide logs from a failing/stuck agent?
@juliaElastic, I saw you comment on the discuss thread, do you have any additional findings?

renzedj · 2023-02-27T19:44:21Z

@michel-laterman - What log level do you want and what in specific are you looking for?

renzedj · 2023-02-27T20:00:10Z

...also, @michel-laterman, were you doing an "Immediate" upgrade or a scheduled upgrade? The immediate upgrade usually works just fine for me, but it's when I either schedule it for later or give it a maintenance window to complete in that it fails.

renzedj · 2023-02-27T20:16:09Z

I scheduled an upgrade from 8.6.0->8.6.2 on a single agent. Ten minutes later, all I was getting was repeated harvester panics:

{
    "log.level": "error",
    "@timestamp": "2023-02-27T14:06:53.330-0600",
    "message": "Harvester crashed with: harvester panic with: close of closed channel\ngoroutine 404 [running]:\nruntime/debug.Stack()\n\truntime/debug/stack.go:24 +0x65\ngithub.com/elastic/beats/v7/filebeat/input/filestream/internal/input-logfile.startHarvester.func1.1()\n\tgithub.com/elastic/beats/v7/filebeat/input/filestream/internal/input-logfile/harvester.go:167 +0x78\npanic({0x56037e4e0780, 0x56037eaa7ab0})\n\truntime/panic.go:844 +0x258\ngithub.com/elastic/beats/v7/libbeat/processors/add_kubernetes_metadata.(*cache).stop(...)\n\tgithub.com/elastic/beats/v7/libbeat/processors/add_kubernetes_metadata/cache.go:97\ngithub.com/elastic/beats/v7/libbeat/processors/add_kubernetes_metadata.(*kubernetesAnnotator).Close(0xc000c64000?)\n\tgithub.com/elastic/beats/v7/libbeat/processors/add_kubernetes_metadata/kubernetes.go:311 +0x4f\ngithub.com/elastic/beats/v7/libbeat/processors.Close(...)\n\tgithub.com/elastic/beats/v7/libbeat/processors/processor.go:58\ngithub.com/elastic/beats/v7/libbeat/publisher/processing.(*group).Close(0x5?)\n\tgithub.com/elastic/beats/v7/libbeat/publisher/processing/processors.go:95 +0x159\ngithub.com/elastic/beats/v7/libbeat/processors.Close(...)\n\tgithub.com/elastic/beats/v7/libbeat/processors/processor.go:58\ngithub.com/elastic/beats/v7/libbeat/publisher/processing.(*group).Close(0x0?)\n\tgithub.com/elastic/beats/v7/libbeat/publisher/processing/processors.go:95 +0x159\ngithub.com/elastic/beats/v7/libbeat/processors.Close(...)\n\tgithub.com/elastic/beats/v7/libbeat/processors/processor.go:58\ngithub.com/elastic/beats/v7/libbeat/publisher/pipeline.(*client).Close.func1()\n\tgithub.com/elastic/beats/v7/libbeat/publisher/pipeline/client.go:167 +0x2df\nsync.(*Once).doSlow(0x0?, 0x0?)\n\tsync/once.go:68 +0xc2\nsync.(*Once).Do(...)\n\tsync/once.go:59\ngithub.com/elastic/beats/v7/libbeat/publisher/pipeline.(*client).Close(0x56037c4e8346?)\n\tgithub.com/elastic/beats/v7/libbeat/publisher/pipeline/client.go:148 +0x59\ngithub.com/elastic/beats/v7/filebeat/beater.(*countingClient).Close(0x56037c4e82bf?)\n\tgithub.com/elastic/beats/v7/filebeat/beater/channels.go:145 +0x22\ngithub.com/elastic/beats/v7/filebeat/input/filestream/internal/input-logfile.startHarvester.func1({0x56037eaecf68?, 0xc00096e600})\n\tgithub.com/elastic/beats/v7/filebeat/input/filestream/internal/input-logfile/harvester.go:219 +0x929\ngithub.com/elastic/go-concert/unison.(*TaskGroup).Go.func1()\n\tgithub.com/elastic/[email protected]/unison/taskgroup.go:163 +0xc3\ncreated by github.com/elastic/go-concert/unison.(*TaskGroup).Go\n\tgithub.com/elastic/[email protected]/unison/taskgroup.go:159 +0xca\n",
    "component":
    {
        "binary": "filebeat",
        "dataset": "elastic_agent.filebeat",
        "id": "filestream-monitoring",
        "type": "filestream"
    },
    "service.name": "filebeat",
    "id": "filestream-monitoring-agent",
    "ecs.version": "1.6.0",
    "log.logger": "input.filestream",
    "log.origin":
    {
        "file.line": 168,
        "file.name": "input-logfile/harvester.go"
    },
    "source_file": "filestream::filestream-monitoring-agent::native::17035104-64778"
}

I'm now at 20 minutes post-upgrade, and the agent is still showing as "Updating." If there's anything else I can pull for you, please let me know.

cmacknz · 2023-02-27T20:50:16Z

The agent can show as Updating if the upgrade fails on the agent. Did the upgrade from 8.6.1 to 8.6.2 actually succeed?

The harvester panic is from elastic/beats#34219, which is usually just log noise on shutdown. That you are seeing this repeatedly suggests that Filebeat here might be starting and stopping quickly, which might be a symptom of another problem.

What we would want to further investigate this is the archive output from elastic-agent diagnostics, but it is possible for it contain credentials and other information you may not wanted posted in a public GitHub issue.

The complete logs from the agent are a good substitute for this, if you can provide a zip of the logs sub-directory of your installed agent. Those can be found at data/elastic-agent-$hash/logs. If you have multiple data/elastic-agent-$hash directories providing the logs for each of them is valuable as these are the logs from the different versions of the agent involved in the upgrades here.

renzedj · 2023-02-27T20:57:54Z

I assumed that the harvester panic was associated with that, just posting what was written to the log.

To answer your question, the update did not succeed. I will take a look at the contents of the complete agent logs and post them, if I'm able. If I'm not able to post them to the issue, is it possible to open a ticket through our Elastic Cloud subscription and get them to you that way?

michel-laterman · 2023-02-27T22:31:02Z

@drenze-athene, we just need the agent log output for when it tried to run an upgrade.

I'll also try to recreate with the scheduled upgrades. How likely is it to fail?

renzedj · 2023-02-28T13:57:51Z

@drenze-athene, we just need the agent log output for when it tried to run an upgrade.

I'll also try to recreate with the scheduled upgrades. How likely is it to fail?

I've had 100% failure when scheduling an upgrade or selecting a maintenance window other than immediately.

juliaElastic · 2023-02-28T17:02:43Z

Our QA team has found a similar issue, probably the same root cause: #2508

renzedj · 2023-02-28T17:55:02Z

@juliaElastic - Thanks. I was wondering if upgrade failure was a contributing factor, but without getting a scheduled upgrade to succeed, I wasn't able to determine.

michel-laterman · 2023-02-28T18:45:38Z

I'm not able to recreate by scheduling the upgrade.
Also if I give a bad upgrade instruction to 8.5.3 it goes unhealthy as expected.

@drenze-athene, can you turn on debug logs for an agent and recreate the failure and post the agent diagnostics?

michel-laterman · 2023-03-02T22:20:06Z

Alright, was able to recreate and investigate; it looks like this effects scheduled actions for v8.6+:
It seems to be an agent side issue.
I ran a policy with no integration/monitoring with debug logs on and found:

{"log.level":"debug","@timestamp":"2023-03-02T21:49:54.495Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":153},"message":"FleetGateway calling Checkin API","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-03-02T21:49:54.495Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":183},"message":"Checking started","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-03-02T21:49:54.721Z","log.origin":{"file.name":"remote/client.go","file.line":162},"message":"Request method: POST, path: /api/fleet/agents/0fcc94ac-efa9-43f3-ba7a-dbff775734c4/checkin, reqID: 01GTJ4V6N1MJRSJKY02QZZABVV","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-03-02T21:49:54.721Z","log.origin":{"file.name":"remote/client.go","file.line":177},"message":"Creating new request to request URL https://c1f9d6b79a324cedb74231badfae3f89.fleet.us-central1.gcp.qa.cld.elstc.co:443/api/fleet/agents/0fcc94ac-efa9-43f3-ba7a-dbff775734c4/checkin?","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-03-02T21:54:35.314Z","log.origin":{"file.name":"store/state_store.go","file.line":282},"message":"save state on disk : {action:0xc00045c2d0 ackToken:dd59d7b0-b9cc-4f89-b1a0-608f4a3f5a63 queue:[]}","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-03-02T21:54:35.315Z","log.origin":{"file.name":"dispatcher/dispatcher.go","file.line":195},"message":"Adding action id: 9176f4f0-f27c-41b2-825a-ced6a9d22244 to queue.","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-03-02T21:54:35.315Z","log.origin":{"file.name":"dispatcher/dispatcher.go","file.line":116},"message":"Gathered 0 actions from queue, 0 actions expired","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-03-02T21:54:35.315Z","log.origin":{"file.name":"dispatcher/dispatcher.go","file.line":117},"message":"Expired actions: []","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-03-02T21:54:35.327Z","log.origin":{"file.name":"store/state_store.go","file.line":282},"message":"save state on disk : {action:0xc00045c2d0 ackToken:dd59d7b0-b9cc-4f89-b1a0-608f4a3f5a63 queue:[0xc000296000]}","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-03-02T21:54:35.328Z","log.origin":{"file.name":"dispatcher/dispatcher.go","file.line":125},"message":"No action to dispatch","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-03-02T21:54:36.378Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":153},"message":"FleetGateway calling Checkin API","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-03-02T21:54:36.379Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":183},"message":"Checking started","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-03-02T21:54:36.599Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":314},"message":"using previously saved ack token: dd59d7b0-b9cc-4f89-b1a0-608f4a3f5a63","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-03-02T21:54:36.599Z","log.origin":{"file.name":"remote/client.go","file.line":162},"message":"Request method: POST, path: /api/fleet/agents/0fcc94ac-efa9-43f3-ba7a-dbff775734c4/checkin, reqID: 01GTJ53SXQ9NBMR88T2HJSVQQT","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-03-02T21:54:36.599Z","log.origin":{"file.name":"remote/client.go","file.line":177},"message":"Creating new request to request URL https://c1f9d6b79a324cedb74231badfae3f89.fleet.us-central1.gcp.qa.cld.elstc.co:443/api/fleet/agents/0fcc94ac-efa9-43f3-ba7a-dbff775734c4/checkin?","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}

We can see the scheduled action (9176f4f0-f27c-41b2-825a-ced6a9d22244) is added to the local queue after a checkin, and the next checkin runs as expected.
However when that checkin returns, the scheduled action is not removed from the action queue in:
https://github.com/elastic/elastic-agent/blob/main/internal/pkg/agent/application/dispatcher/dispatcher.go#L115

The dispatch.Dispatch() call that would trigger this is made by the appplication by passing actions recived from the fleet-gateway to the dispatcher:
https://github.com/elastic/elastic-agent/blob/main/internal/pkg/agent/application/managed_mode.go#L211-L220

However, if a checkin returns no actions; nothing is sent:
https://github.com/elastic/elastic-agent/blob/main/internal/pkg/agent/application/gateway/fleet/fleet_gateway.go#L168-L170

I think that if we get rid of the if statement in fleet_gateway.go and send an empty list instead, the dispatcher should be able to check the action queue.

This should be a simple fix.
I'll move this issue to the agent repo.

michel-laterman · 2023-03-02T22:52:28Z

I've added the backport-8.6.0 label even though there is no further 8.6 release planned at this time.

@cmacknz I think we should add this as a known issue to the 8.6 release notes

cmacknz · 2023-03-06T19:20:45Z

To add this as a known issue open a PR against the 8.6 release notes: https://github.com/elastic/observability-docs/blob/main/docs/en/ingest-management/release-notes/release-notes-8.6.asciidoc

Note that you'll need to copy the known issue to each of 8.6.2, 8.6.1, and 8.6.0 if they are all effected. There are plenty of examples there to use as a reference. You can add me as a reviewer.

michel-laterman self-assigned this Feb 27, 2023

michel-laterman added bug Something isn't working Team:Fleet Label for the Fleet team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Mar 2, 2023

michel-laterman transferred this issue from elastic/fleet-server Mar 2, 2023

michel-laterman added backport-v8.6.0 Automated backport with mergify backport-v8.7.0 Automated backport with mergify labels Mar 2, 2023

michel-laterman mentioned this issue Mar 2, 2023

managed-agent will periodically call dispatcher.Dispatch #2344

Merged

5 tasks

michel-laterman mentioned this issue Mar 6, 2023

Add known issue entry for elastic-agent#2343 elastic/observability-docs#2726

Merged

michel-laterman closed this as completed in #2344 Mar 7, 2023

AndersonQ mentioned this issue Apr 28, 2023

Change the known issue number to pint to the correct one elastic/ingest-docs#181

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fleet stuck in "Updating..." status when attempting to upgrade agents to v8.6.1 or v8.6.2 #2343

Fleet stuck in "Updating..." status when attempting to upgrade agents to v8.6.1 or v8.6.2 #2343

renzedj commented Feb 27, 2023 •

edited

Loading

michel-laterman commented Feb 27, 2023

renzedj commented Feb 27, 2023 •

edited

Loading

renzedj commented Feb 27, 2023

renzedj commented Feb 27, 2023 •

edited

Loading

cmacknz commented Feb 27, 2023

renzedj commented Feb 27, 2023 •

edited

Loading

michel-laterman commented Feb 27, 2023

renzedj commented Feb 28, 2023

juliaElastic commented Feb 28, 2023

renzedj commented Feb 28, 2023

michel-laterman commented Feb 28, 2023

michel-laterman commented Mar 2, 2023

michel-laterman commented Mar 2, 2023

cmacknz commented Mar 6, 2023

Fleet stuck in "Updating..." status when attempting to upgrade agents to v8.6.1 or v8.6.2 #2343

Fleet stuck in "Updating..." status when attempting to upgrade agents to v8.6.1 or v8.6.2 #2343

Comments

renzedj commented Feb 27, 2023 • edited Loading

Issue

Deployment

Steps to Reproduce

michel-laterman commented Feb 27, 2023

renzedj commented Feb 27, 2023 • edited Loading

renzedj commented Feb 27, 2023

renzedj commented Feb 27, 2023 • edited Loading

cmacknz commented Feb 27, 2023

renzedj commented Feb 27, 2023 • edited Loading

michel-laterman commented Feb 27, 2023

renzedj commented Feb 28, 2023

juliaElastic commented Feb 28, 2023

renzedj commented Feb 28, 2023

michel-laterman commented Feb 28, 2023

michel-laterman commented Mar 2, 2023

michel-laterman commented Mar 2, 2023

cmacknz commented Mar 6, 2023

renzedj commented Feb 27, 2023 •

edited

Loading

renzedj commented Feb 27, 2023 •

edited

Loading

renzedj commented Feb 27, 2023 •

edited

Loading

renzedj commented Feb 27, 2023 •

edited

Loading