Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fleet stuck in "Updating..." status when attempting to upgrade agents to v8.6.1 or v8.6.2 #2343

Closed
renzedj opened this issue Feb 27, 2023 · 14 comments · Fixed by #2344
Closed
Assignees
Labels
backport-v8.6.0 Automated backport with mergify backport-v8.7.0 Automated backport with mergify bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Fleet Label for the Fleet team

Comments

@renzedj
Copy link

renzedj commented Feb 27, 2023

Issue

Since upgrading from Elastic Cloud 8.5.3, I have been unable to manage Elastic Agent integrations through Fleet in Kibana.

I initially upgraded from 8.5.3->8.6.1 and scheduled an Agent upgrade. A few of the agents upgraded as expected, the remaining remained stuck in "Updating..." status in Fleet for more than a week. I later updated to 8.6.2 and continued to experience the same issue.

I cancelled out the "Updating..." status per the instructions in this post, which returned the agents to Healthy status, as promised. A second attempt at upgrading yielded the same results. After cancelling out the "Updating..." status again, I determined that I was able to schedule agents to upgrade immediately and it would usually succeed, as long as they were in "Healthy" status, but if an "Offline" or even a "Not Healthy" agent was included, the upgrade would fail. If I attempted to schedule the update for a later time or with a maintenance window other than "immediately" the upgrade fails.

Additionally, I observed this morning when I checked the "Agent Activity" log in fleet that multiple activities showing "Reassigning X of Y agents to [new policy]" that I executed approximately a week ago are still showing as in progress. However, in this case, when I investigate, all agents have been reassigned to the new policy and are reporting in as expected. I don't know whether the two items are related.

Deployment

My deployment is:

Steps to Reproduce

  1. Upgrade Elastic Cloud to 8.6.1 or 8.6.2
  2. Schedule an agent upgrade in Fleet to occur at a specific time.
@michel-laterman
Copy link
Contributor

I've tried to recreate this:

My first attempt was with a newly deployed cluster on v8.6.2 where I enroll an agent on v8.5.3 and upgrade to v8.6.2 but did not encounter an issue after a few repeated attempts

My next attempt was to deploy a cluster on v8.5.3 with and agent on the same version, then upgrade the cluster to v8.6.2, then upgrade the agent, on a single run of this test, i could not recreate the issue.
I then tried a few attempts of enrolling an v8.5.3 agent against the upgraded cluster and upgrading the agent to v8.6.2 but was unable to recreate.

Can you provide logs from a failing/stuck agent?
@juliaElastic, I saw you comment on the discuss thread, do you have any additional findings?

@michel-laterman michel-laterman self-assigned this Feb 27, 2023
@renzedj
Copy link
Author

renzedj commented Feb 27, 2023

@michel-laterman - What log level do you want and what in specific are you looking for?

@renzedj
Copy link
Author

renzedj commented Feb 27, 2023

...also, @michel-laterman, were you doing an "Immediate" upgrade or a scheduled upgrade? The immediate upgrade usually works just fine for me, but it's when I either schedule it for later or give it a maintenance window to complete in that it fails.

@renzedj
Copy link
Author

renzedj commented Feb 27, 2023

I scheduled an upgrade from 8.6.0->8.6.2 on a single agent. Ten minutes later, all I was getting was repeated harvester panics:

{
    "log.level": "error",
    "@timestamp": "2023-02-27T14:06:53.330-0600",
    "message": "Harvester crashed with: harvester panic with: close of closed channel\ngoroutine 404 [running]:\nruntime/debug.Stack()\n\truntime/debug/stack.go:24 +0x65\ngithub.com/elastic/beats/v7/filebeat/input/filestream/internal/input-logfile.startHarvester.func1.1()\n\tgithub.com/elastic/beats/v7/filebeat/input/filestream/internal/input-logfile/harvester.go:167 +0x78\npanic({0x56037e4e0780, 0x56037eaa7ab0})\n\truntime/panic.go:844 +0x258\ngithub.com/elastic/beats/v7/libbeat/processors/add_kubernetes_metadata.(*cache).stop(...)\n\tgithub.com/elastic/beats/v7/libbeat/processors/add_kubernetes_metadata/cache.go:97\ngithub.com/elastic/beats/v7/libbeat/processors/add_kubernetes_metadata.(*kubernetesAnnotator).Close(0xc000c64000?)\n\tgithub.com/elastic/beats/v7/libbeat/processors/add_kubernetes_metadata/kubernetes.go:311 +0x4f\ngithub.com/elastic/beats/v7/libbeat/processors.Close(...)\n\tgithub.com/elastic/beats/v7/libbeat/processors/processor.go:58\ngithub.com/elastic/beats/v7/libbeat/publisher/processing.(*group).Close(0x5?)\n\tgithub.com/elastic/beats/v7/libbeat/publisher/processing/processors.go:95 +0x159\ngithub.com/elastic/beats/v7/libbeat/processors.Close(...)\n\tgithub.com/elastic/beats/v7/libbeat/processors/processor.go:58\ngithub.com/elastic/beats/v7/libbeat/publisher/processing.(*group).Close(0x0?)\n\tgithub.com/elastic/beats/v7/libbeat/publisher/processing/processors.go:95 +0x159\ngithub.com/elastic/beats/v7/libbeat/processors.Close(...)\n\tgithub.com/elastic/beats/v7/libbeat/processors/processor.go:58\ngithub.com/elastic/beats/v7/libbeat/publisher/pipeline.(*client).Close.func1()\n\tgithub.com/elastic/beats/v7/libbeat/publisher/pipeline/client.go:167 +0x2df\nsync.(*Once).doSlow(0x0?, 0x0?)\n\tsync/once.go:68 +0xc2\nsync.(*Once).Do(...)\n\tsync/once.go:59\ngithub.com/elastic/beats/v7/libbeat/publisher/pipeline.(*client).Close(0x56037c4e8346?)\n\tgithub.com/elastic/beats/v7/libbeat/publisher/pipeline/client.go:148 +0x59\ngithub.com/elastic/beats/v7/filebeat/beater.(*countingClient).Close(0x56037c4e82bf?)\n\tgithub.com/elastic/beats/v7/filebeat/beater/channels.go:145 +0x22\ngithub.com/elastic/beats/v7/filebeat/input/filestream/internal/input-logfile.startHarvester.func1({0x56037eaecf68?, 0xc00096e600})\n\tgithub.com/elastic/beats/v7/filebeat/input/filestream/internal/input-logfile/harvester.go:219 +0x929\ngithub.com/elastic/go-concert/unison.(*TaskGroup).Go.func1()\n\tgithub.com/elastic/[email protected]/unison/taskgroup.go:163 +0xc3\ncreated by github.com/elastic/go-concert/unison.(*TaskGroup).Go\n\tgithub.com/elastic/[email protected]/unison/taskgroup.go:159 +0xca\n",
    "component":
    {
        "binary": "filebeat",
        "dataset": "elastic_agent.filebeat",
        "id": "filestream-monitoring",
        "type": "filestream"
    },
    "service.name": "filebeat",
    "id": "filestream-monitoring-agent",
    "ecs.version": "1.6.0",
    "log.logger": "input.filestream",
    "log.origin":
    {
        "file.line": 168,
        "file.name": "input-logfile/harvester.go"
    },
    "source_file": "filestream::filestream-monitoring-agent::native::17035104-64778"
}

I'm now at 20 minutes post-upgrade, and the agent is still showing as "Updating." If there's anything else I can pull for you, please let me know.

@cmacknz
Copy link
Member

cmacknz commented Feb 27, 2023

The agent can show as Updating if the upgrade fails on the agent. Did the upgrade from 8.6.1 to 8.6.2 actually succeed?

The harvester panic is from elastic/beats#34219, which is usually just log noise on shutdown. That you are seeing this repeatedly suggests that Filebeat here might be starting and stopping quickly, which might be a symptom of another problem.

What we would want to further investigate this is the archive output from elastic-agent diagnostics, but it is possible for it contain credentials and other information you may not wanted posted in a public GitHub issue.

The complete logs from the agent are a good substitute for this, if you can provide a zip of the logs sub-directory of your installed agent. Those can be found at data/elastic-agent-$hash/logs. If you have multiple data/elastic-agent-$hash directories providing the logs for each of them is valuable as these are the logs from the different versions of the agent involved in the upgrades here.

@renzedj
Copy link
Author

renzedj commented Feb 27, 2023

I assumed that the harvester panic was associated with that, just posting what was written to the log.

To answer your question, the update did not succeed. I will take a look at the contents of the complete agent logs and post them, if I'm able. If I'm not able to post them to the issue, is it possible to open a ticket through our Elastic Cloud subscription and get them to you that way?

@michel-laterman
Copy link
Contributor

@drenze-athene, we just need the agent log output for when it tried to run an upgrade.

I'll also try to recreate with the scheduled upgrades. How likely is it to fail?

@renzedj
Copy link
Author

renzedj commented Feb 28, 2023

@drenze-athene, we just need the agent log output for when it tried to run an upgrade.

I'll also try to recreate with the scheduled upgrades. How likely is it to fail?

I've had 100% failure when scheduling an upgrade or selecting a maintenance window other than immediately.

@juliaElastic
Copy link
Contributor

Our QA team has found a similar issue, probably the same root cause: #2508

@renzedj
Copy link
Author

renzedj commented Feb 28, 2023

@juliaElastic - Thanks. I was wondering if upgrade failure was a contributing factor, but without getting a scheduled upgrade to succeed, I wasn't able to determine.

@michel-laterman
Copy link
Contributor

I'm not able to recreate by scheduling the upgrade.
Also if I give a bad upgrade instruction to 8.5.3 it goes unhealthy as expected.

@drenze-athene, can you turn on debug logs for an agent and recreate the failure and post the agent diagnostics?

@michel-laterman
Copy link
Contributor

Alright, was able to recreate and investigate; it looks like this effects scheduled actions for v8.6+:
It seems to be an agent side issue.
I ran a policy with no integration/monitoring with debug logs on and found:

{"log.level":"debug","@timestamp":"2023-03-02T21:49:54.495Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":153},"message":"FleetGateway calling Checkin API","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-03-02T21:49:54.495Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":183},"message":"Checking started","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-03-02T21:49:54.721Z","log.origin":{"file.name":"remote/client.go","file.line":162},"message":"Request method: POST, path: /api/fleet/agents/0fcc94ac-efa9-43f3-ba7a-dbff775734c4/checkin, reqID: 01GTJ4V6N1MJRSJKY02QZZABVV","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-03-02T21:49:54.721Z","log.origin":{"file.name":"remote/client.go","file.line":177},"message":"Creating new request to request URL https://c1f9d6b79a324cedb74231badfae3f89.fleet.us-central1.gcp.qa.cld.elstc.co:443/api/fleet/agents/0fcc94ac-efa9-43f3-ba7a-dbff775734c4/checkin?","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-03-02T21:54:35.314Z","log.origin":{"file.name":"store/state_store.go","file.line":282},"message":"save state on disk : {action:0xc00045c2d0 ackToken:dd59d7b0-b9cc-4f89-b1a0-608f4a3f5a63 queue:[]}","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-03-02T21:54:35.315Z","log.origin":{"file.name":"dispatcher/dispatcher.go","file.line":195},"message":"Adding action id: 9176f4f0-f27c-41b2-825a-ced6a9d22244 to queue.","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-03-02T21:54:35.315Z","log.origin":{"file.name":"dispatcher/dispatcher.go","file.line":116},"message":"Gathered 0 actions from queue, 0 actions expired","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-03-02T21:54:35.315Z","log.origin":{"file.name":"dispatcher/dispatcher.go","file.line":117},"message":"Expired actions: []","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-03-02T21:54:35.327Z","log.origin":{"file.name":"store/state_store.go","file.line":282},"message":"save state on disk : {action:0xc00045c2d0 ackToken:dd59d7b0-b9cc-4f89-b1a0-608f4a3f5a63 queue:[0xc000296000]}","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-03-02T21:54:35.328Z","log.origin":{"file.name":"dispatcher/dispatcher.go","file.line":125},"message":"No action to dispatch","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-03-02T21:54:36.378Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":153},"message":"FleetGateway calling Checkin API","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-03-02T21:54:36.379Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":183},"message":"Checking started","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-03-02T21:54:36.599Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":314},"message":"using previously saved ack token: dd59d7b0-b9cc-4f89-b1a0-608f4a3f5a63","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-03-02T21:54:36.599Z","log.origin":{"file.name":"remote/client.go","file.line":162},"message":"Request method: POST, path: /api/fleet/agents/0fcc94ac-efa9-43f3-ba7a-dbff775734c4/checkin, reqID: 01GTJ53SXQ9NBMR88T2HJSVQQT","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-03-02T21:54:36.599Z","log.origin":{"file.name":"remote/client.go","file.line":177},"message":"Creating new request to request URL https://c1f9d6b79a324cedb74231badfae3f89.fleet.us-central1.gcp.qa.cld.elstc.co:443/api/fleet/agents/0fcc94ac-efa9-43f3-ba7a-dbff775734c4/checkin?","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}

We can see the scheduled action (9176f4f0-f27c-41b2-825a-ced6a9d22244) is added to the local queue after a checkin, and the next checkin runs as expected.
However when that checkin returns, the scheduled action is not removed from the action queue in:
https://github.com/elastic/elastic-agent/blob/main/internal/pkg/agent/application/dispatcher/dispatcher.go#L115

The dispatch.Dispatch() call that would trigger this is made by the appplication by passing actions recived from the fleet-gateway to the dispatcher:
https://github.com/elastic/elastic-agent/blob/main/internal/pkg/agent/application/managed_mode.go#L211-L220

However, if a checkin returns no actions; nothing is sent:
https://github.com/elastic/elastic-agent/blob/main/internal/pkg/agent/application/gateway/fleet/fleet_gateway.go#L168-L170

I think that if we get rid of the if statement in fleet_gateway.go and send an empty list instead, the dispatcher should be able to check the action queue.

This should be a simple fix.
I'll move this issue to the agent repo.

@michel-laterman michel-laterman added bug Something isn't working Team:Fleet Label for the Fleet team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Mar 2, 2023
@michel-laterman michel-laterman transferred this issue from elastic/fleet-server Mar 2, 2023
@michel-laterman michel-laterman added backport-v8.6.0 Automated backport with mergify backport-v8.7.0 Automated backport with mergify labels Mar 2, 2023
@michel-laterman
Copy link
Contributor

I've added the backport-8.6.0 label even though there is no further 8.6 release planned at this time.

@cmacknz I think we should add this as a known issue to the 8.6 release notes

@cmacknz
Copy link
Member

cmacknz commented Mar 6, 2023

To add this as a known issue open a PR against the 8.6 release notes: https://github.com/elastic/observability-docs/blob/main/docs/en/ingest-management/release-notes/release-notes-8.6.asciidoc

Note that you'll need to copy the known issue to each of 8.6.2, 8.6.1, and 8.6.0 if they are all effected. There are plenty of examples there to use as a reference. You can add me as a reviewer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.6.0 Automated backport with mergify backport-v8.7.0 Automated backport with mergify bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Fleet Label for the Fleet team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants