-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fleet stuck in "Updating..." status when attempting to upgrade agents to v8.6.1 or v8.6.2 #2343
Comments
I've tried to recreate this: My first attempt was with a newly deployed cluster on v8.6.2 where I enroll an agent on v8.5.3 and upgrade to v8.6.2 but did not encounter an issue after a few repeated attempts My next attempt was to deploy a cluster on v8.5.3 with and agent on the same version, then upgrade the cluster to v8.6.2, then upgrade the agent, on a single run of this test, i could not recreate the issue. Can you provide logs from a failing/stuck agent? |
@michel-laterman - What log level do you want and what in specific are you looking for? |
...also, @michel-laterman, were you doing an "Immediate" upgrade or a scheduled upgrade? The immediate upgrade usually works just fine for me, but it's when I either schedule it for later or give it a maintenance window to complete in that it fails. |
I scheduled an upgrade from 8.6.0->8.6.2 on a single agent. Ten minutes later, all I was getting was repeated harvester panics: {
"log.level": "error",
"@timestamp": "2023-02-27T14:06:53.330-0600",
"message": "Harvester crashed with: harvester panic with: close of closed channel\ngoroutine 404 [running]:\nruntime/debug.Stack()\n\truntime/debug/stack.go:24 +0x65\ngithub.com/elastic/beats/v7/filebeat/input/filestream/internal/input-logfile.startHarvester.func1.1()\n\tgithub.com/elastic/beats/v7/filebeat/input/filestream/internal/input-logfile/harvester.go:167 +0x78\npanic({0x56037e4e0780, 0x56037eaa7ab0})\n\truntime/panic.go:844 +0x258\ngithub.com/elastic/beats/v7/libbeat/processors/add_kubernetes_metadata.(*cache).stop(...)\n\tgithub.com/elastic/beats/v7/libbeat/processors/add_kubernetes_metadata/cache.go:97\ngithub.com/elastic/beats/v7/libbeat/processors/add_kubernetes_metadata.(*kubernetesAnnotator).Close(0xc000c64000?)\n\tgithub.com/elastic/beats/v7/libbeat/processors/add_kubernetes_metadata/kubernetes.go:311 +0x4f\ngithub.com/elastic/beats/v7/libbeat/processors.Close(...)\n\tgithub.com/elastic/beats/v7/libbeat/processors/processor.go:58\ngithub.com/elastic/beats/v7/libbeat/publisher/processing.(*group).Close(0x5?)\n\tgithub.com/elastic/beats/v7/libbeat/publisher/processing/processors.go:95 +0x159\ngithub.com/elastic/beats/v7/libbeat/processors.Close(...)\n\tgithub.com/elastic/beats/v7/libbeat/processors/processor.go:58\ngithub.com/elastic/beats/v7/libbeat/publisher/processing.(*group).Close(0x0?)\n\tgithub.com/elastic/beats/v7/libbeat/publisher/processing/processors.go:95 +0x159\ngithub.com/elastic/beats/v7/libbeat/processors.Close(...)\n\tgithub.com/elastic/beats/v7/libbeat/processors/processor.go:58\ngithub.com/elastic/beats/v7/libbeat/publisher/pipeline.(*client).Close.func1()\n\tgithub.com/elastic/beats/v7/libbeat/publisher/pipeline/client.go:167 +0x2df\nsync.(*Once).doSlow(0x0?, 0x0?)\n\tsync/once.go:68 +0xc2\nsync.(*Once).Do(...)\n\tsync/once.go:59\ngithub.com/elastic/beats/v7/libbeat/publisher/pipeline.(*client).Close(0x56037c4e8346?)\n\tgithub.com/elastic/beats/v7/libbeat/publisher/pipeline/client.go:148 +0x59\ngithub.com/elastic/beats/v7/filebeat/beater.(*countingClient).Close(0x56037c4e82bf?)\n\tgithub.com/elastic/beats/v7/filebeat/beater/channels.go:145 +0x22\ngithub.com/elastic/beats/v7/filebeat/input/filestream/internal/input-logfile.startHarvester.func1({0x56037eaecf68?, 0xc00096e600})\n\tgithub.com/elastic/beats/v7/filebeat/input/filestream/internal/input-logfile/harvester.go:219 +0x929\ngithub.com/elastic/go-concert/unison.(*TaskGroup).Go.func1()\n\tgithub.com/elastic/[email protected]/unison/taskgroup.go:163 +0xc3\ncreated by github.com/elastic/go-concert/unison.(*TaskGroup).Go\n\tgithub.com/elastic/[email protected]/unison/taskgroup.go:159 +0xca\n",
"component":
{
"binary": "filebeat",
"dataset": "elastic_agent.filebeat",
"id": "filestream-monitoring",
"type": "filestream"
},
"service.name": "filebeat",
"id": "filestream-monitoring-agent",
"ecs.version": "1.6.0",
"log.logger": "input.filestream",
"log.origin":
{
"file.line": 168,
"file.name": "input-logfile/harvester.go"
},
"source_file": "filestream::filestream-monitoring-agent::native::17035104-64778"
} I'm now at 20 minutes post-upgrade, and the agent is still showing as "Updating." If there's anything else I can pull for you, please let me know. |
The agent can show as The harvester panic is from elastic/beats#34219, which is usually just log noise on shutdown. That you are seeing this repeatedly suggests that Filebeat here might be starting and stopping quickly, which might be a symptom of another problem. What we would want to further investigate this is the archive output from The complete logs from the agent are a good substitute for this, if you can provide a zip of the |
I assumed that the harvester panic was associated with that, just posting what was written to the log. To answer your question, the update did not succeed. I will take a look at the contents of the complete agent logs and post them, if I'm able. If I'm not able to post them to the issue, is it possible to open a ticket through our Elastic Cloud subscription and get them to you that way? |
@drenze-athene, we just need the agent log output for when it tried to run an upgrade. I'll also try to recreate with the scheduled upgrades. How likely is it to fail? |
I've had 100% failure when scheduling an upgrade or selecting a maintenance window other than immediately. |
Our QA team has found a similar issue, probably the same root cause: #2508 |
@juliaElastic - Thanks. I was wondering if upgrade failure was a contributing factor, but without getting a scheduled upgrade to succeed, I wasn't able to determine. |
I'm not able to recreate by scheduling the upgrade. @drenze-athene, can you turn on debug logs for an agent and recreate the failure and post the agent diagnostics? |
Alright, was able to recreate and investigate; it looks like this effects scheduled actions for
We can see the scheduled action ( The However, if a checkin returns no actions; nothing is sent: I think that if we get rid of the if statement in This should be a simple fix. |
I've added the backport-8.6.0 label even though there is no further 8.6 release planned at this time. @cmacknz I think we should add this as a known issue to the 8.6 release notes |
To add this as a known issue open a PR against the 8.6 release notes: https://github.com/elastic/observability-docs/blob/main/docs/en/ingest-management/release-notes/release-notes-8.6.asciidoc Note that you'll need to copy the known issue to each of 8.6.2, 8.6.1, and 8.6.0 if they are all effected. There are plenty of examples there to use as a reference. You can add me as a reviewer. |
Issue
Since upgrading from Elastic Cloud 8.5.3, I have been unable to manage Elastic Agent integrations through Fleet in Kibana.
I initially upgraded from 8.5.3->8.6.1 and scheduled an Agent upgrade. A few of the agents upgraded as expected, the remaining remained stuck in "Updating..." status in Fleet for more than a week. I later updated to 8.6.2 and continued to experience the same issue.
I cancelled out the "Updating..." status per the instructions in this post, which returned the agents to Healthy status, as promised. A second attempt at upgrading yielded the same results. After cancelling out the "Updating..." status again, I determined that I was able to schedule agents to upgrade immediately and it would usually succeed, as long as they were in "Healthy" status, but if an "Offline" or even a "Not Healthy" agent was included, the upgrade would fail. If I attempted to schedule the update for a later time or with a maintenance window other than "immediately" the upgrade fails.
Additionally, I observed this morning when I checked the "Agent Activity" log in fleet that multiple activities showing "Reassigning X of Y agents to [new policy]" that I executed approximately a week ago are still showing as in progress. However, in this case, when I investigate, all agents have been reassigned to the new policy and are reporting in as expected. I don't know whether the two items are related.
Deployment
My deployment is:
Steps to Reproduce
The text was updated successfully, but these errors were encountered: