fix(promtail): Fix bug with Promtail config reloading getting stuck indefinitely #12795

ptodev · 2024-04-25T17:28:43Z

What this PR does / why we need it:

Recently, a memory issue was reported with the Agent Static mode. The memory of the Agent was creeping up steadily, until it eventually OOMs. That Agent was having its config reloaded every 30 seconds.

A goroutine dump indicated that these calls have been taking a long time:

goroutine 152484 [chan receive, 1214 minutes]:
github.com/grafana/loki/clients/pkg/promtail/targets/file.(*FileTarget).Stop(...)
	/go/pkg/mod/github.com/grafana/[email protected]/clients/pkg/promtail/targets/file/filetarget.go:159

goroutine 152424 [chan send, 1220 minutes]:
github.com/grafana/loki/clients/pkg/promtail/targets/file.(*FileTarget).startWatching(0xc0030ab5f0, 0xc002d9fe48?)
	/go/pkg/mod/github.com/grafana/[email protected]/clients/pkg/promtail/targets/file/filetarget.go:314 +0x20a

goroutine 152426 [chan send, 1220 minutes]:
github.com/grafana/loki/clients/pkg/promtail/targets/file.(*FileTarget).startWatching(0xc0030ab6c0, 0xc002e31e48?)
	/go/pkg/mod/github.com/grafana/[email protected]/clients/pkg/promtail/targets/file/filetarget.go:314 +0x20a

goroutine 152428 [chan send, 1210 minutes]:
github.com/grafana/loki/clients/pkg/promtail/targets/file.(*FileTarget).stopWatching(0xc0030ab790, 0xc002da1d88?)
	/go/pkg/mod/github.com/grafana/[email protected]/clients/pkg/promtail/targets/file/filetarget.go:327 +0x20a

What is probably happening is that FileTargetManager begins a Stop(), but doesn't yet close the targetEventHandler channel. As a result, startWatching and stopWatching seem stuck with sending on the channel. This causes the sync call to never complete, which on the other hand means that the FileTarget's Stop() function can't complete.

The memory build up is probably due to lots of calls to the config reload function which never complete.

cc @paul1r who recently committed similar fixes.

Should I add a changelog entry? And do you think there is a way to test this? Also, I haven't yet tested with the customer. If you think the code looks ok, we could merge it and verify later that it does fix the customer issue?

Checklist

Reviewed the CONTRIBUTING.md guide (required)
Documentation added
Tests updated
Title matches the required conventional commits format, see here
Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR
If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

cstyan · 2024-04-25T20:13:11Z

Should I add a changelog entry? Also, I haven't yet tested with the customer.

Changelog entries will be auto generated now via conventional commit format for the PR title, see the failed check here

And do you think there is a way to test this?

We definitely need a test. I imagine if we added a test that calls Stop() or startWatchers/stopWatchers in a goroutine and then used a timeout to fail the test if those functions didn't return within X seconds it should fail without the changes in your current commit here, and pass with your changes.

If you think the code looks ok, we could merge it and verify later that it does fix the customer issue?

It looks okay but why can't we at least verify out of band before we consider merging? I would assume that if the agent or alloy codebase is still pulling in upstream promtail code then this change could be hacked in somehow (via go mod I guess since I think you guys are not using a vendor directory) and deployed so that we can trigger a config reload and see if there's still a deadlock.

…ndefinitely Signed-off-by: Paulin Todev <[email protected]>

ptodev · 2024-04-26T18:24:48Z

@cstyan thank you so much for the quick and thorough feedback!

The problem with the test is that we need to make sure FileTarget has already started a sync, but has not yet sent all its data to the channel. If we call Stop() after it already sent the data on the channel, or before it kicked off a sync, then the test wouldn't be valid.

I updated the PR with a test which I believe works.

It looks okay but why can't we at least verify out of band before we consider merging?

I'm just not sure if I can replicate the circumstances required for this bug in real life. I think it's most likely to replicate if there is a long list of directories to watch, and a very quick config reload frequency. I could try replicating it next week, but if I'm not successful in a few hours I think we should just merge the PR. I do believe that the PR fixes a real bug anyway.

ptodev · 2024-04-26T18:27:43Z

The problem with the test is that we need to make sure FileTarget has already started a sync, but has not yet sent all its data to the channel.

One way to do this is to call sync directly, just like some other tests do. However, I want to avoid this because I don't want to make assumptions about what sync's internals are.

Signed-off-by: Paulin Todev <[email protected]>

clients/pkg/promtail/targets/file/filetarget_test.go

clients/pkg/promtail/targets/file/filetarget.go

cstyan

approved 👍 just one last nit to be fixed

👍 thanks for your patience and continued effort with various promtail upstream work @ptodev

ptodev · 2024-05-08T17:13:26Z

@cstyan No worries, sorry for the late reply - I removed the "continue" comment just now.

grafanabot · 2024-05-10T18:11:57Z

Hello @MasslessParticle!
Backport pull requests need to be either:

Pull requests which address bugs,
Urgent fixes which need product approval, in order to get merged,
Docs changes.

Please, if the current pull request addresses a bug fix, label it with the type/bug label.
If it already has the product approval, please add the product-approved label. For docs changes, please add the type/docs label.
If the pull request modifies CI behaviour, please add the type/ci label.
If none of the above applies, please consider removing the backport label and target the next major/minor release.
Thanks!

grafanabot · 2024-05-10T18:18:32Z

The backport to k190 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new branch
git switch --create backport-12795-to-k190 origin/k190
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x 4d761acd85b90cbdcafdf8d2547f0db14f6ae4dd

When the conflicts are resolved, stage and commit the changes:

git add . && git cherry-pick --continue

If you have the GitHub CLI installed:

# Push the branch to GitHub:
git push --set-upstream origin backport-12795-to-k190
# Create the PR body template
PR_BODY=$(gh pr view 12795 --json body --template 'Backport 4d761acd85b90cbdcafdf8d2547f0db14f6ae4dd from #12795{{ "\n\n---\n\n" }}{{ index . "body" }}')
# Create the PR on GitHub
echo "${PR_BODY}" | gh pr create --title "chore: [k190] fix(promtail): Fix bug with Promtail config reloading getting stuck indefinitely" --body-file - --label "size/L" --label "type/bug" --label "backport" --base k190 --milestone k190 --web

Or, if you don't have the GitHub CLI installed (we recommend you install it!):

# Push the branch to GitHub:
git push --set-upstream origin backport-12795-to-k190

# Create a pull request where the `base` branch is `k190` and the `compare`/`head` branch is `backport-12795-to-k190`.

# Remove the local backport branch
git switch main
git branch -D backport-12795-to-k190

…ndefinitely (#12795) Signed-off-by: Paulin Todev <[email protected]> (cherry picked from commit 4d761ac)

ptodev requested a review from a team as a code owner April 25, 2024 17:28

pull-request-size bot added the size/M label Apr 25, 2024

ptodev changed the title ~~Fix issue with stopping a target during a sync~~ Fix bug with Promtail config reloading getting stuck indefinitely Apr 25, 2024

fix(promtail): Fix bug with Promtail config reloading getting stuck i…

3102cce

…ndefinitely Signed-off-by: Paulin Todev <[email protected]>

ptodev force-pushed the ptodev/fix-target-stop branch from d97954b to 86ad684 Compare April 26, 2024 18:23

pull-request-size bot added size/L and removed size/M labels Apr 26, 2024

ptodev changed the title ~~Fix bug with Promtail config reloading getting stuck indefinitely~~ fix(promtail): Fix bug with Promtail config reloading getting stuck indefinitely Apr 26, 2024

fix(promtail): Add a unit test

0bcaabc

Signed-off-by: Paulin Todev <[email protected]>

ptodev force-pushed the ptodev/fix-target-stop branch from 86ad684 to 0bcaabc Compare April 26, 2024 18:38

cstyan requested changes Apr 26, 2024

View reviewed changes

clients/pkg/promtail/targets/file/filetarget_test.go Outdated Show resolved Hide resolved

Fail test if FileTarget doesn't stop within a few seconds.

d4e30c2

ptodev requested a review from cstyan April 30, 2024 17:43

cstyan reviewed Apr 30, 2024

View reviewed changes

clients/pkg/promtail/targets/file/filetarget.go Outdated Show resolved Hide resolved

cstyan reviewed May 1, 2024

View reviewed changes

Remove unnecessary "continue" statements.

eec3a96

ptodev requested a review from cstyan May 8, 2024 17:12

cstyan approved these changes May 9, 2024

View reviewed changes

cstyan merged commit 4d761ac into main May 9, 2024
58 checks passed

cstyan deleted the ptodev/fix-target-stop branch May 9, 2024 17:56

MasslessParticle added the backport k190 label May 10, 2024

grafanabot added the missing-labels label May 10, 2024

MasslessParticle added the type/bug Somehing is not working as expected label May 10, 2024

grafanabot removed the missing-labels label May 10, 2024

MasslessParticle removed the backport k190 label May 10, 2024

MasslessParticle added the backport k190 label May 10, 2024

grafanabot added the backport-failed label May 10, 2024

MasslessParticle pushed a commit that referenced this pull request May 10, 2024

fix(promtail): Fix bug with Promtail config reloading getting stuck i…

8c19569

…ndefinitely (#12795) Signed-off-by: Paulin Todev <[email protected]> (cherry picked from commit 4d761ac)

MasslessParticle mentioned this pull request May 10, 2024

fix(promtail): Fix bug with Promtail config reloading getting stuck i… #12939

Merged

7 tasks

loki-gh-app bot mentioned this pull request May 13, 2024

chore(k202): release 3.1.0 #12945

Closed

This was referenced May 13, 2024

Update Loki and sync some of the Promtail code grafana/alloy#836

Merged

Update Loki dependency grafana/agent#6905

Merged

loki-gh-app bot mentioned this pull request May 20, 2024

chore(k203): release 3.1.0 #12988

Closed

This was referenced May 27, 2024

chore(k204): release 3.1.0 #13037

Closed

chore(k205): release 3.1.0 #13102

Closed

This was referenced Jun 10, 2024

chore(k206): release 3.1.0 #13184

Closed

chore(k207): release 3.1.0 #13225

Merged

loki-gh-app bot mentioned this pull request Jun 24, 2024

chore(k208): release 3.1.0 #13291

Closed

loki-gh-app bot mentioned this pull request Jul 1, 2024

chore(k209): release 3.1.0 #13356

Closed

grafanabot mentioned this pull request Jul 2, 2024

chore: [main] chore(k207): release 3.1.0 #13391

Open

This was referenced Jul 3, 2024

chore(release-3.1.x): release 3.0.1 #13402

Closed

chore(k210): release 3.1.0 #13435

Closed

chore(k210): release 3.1.0 #13462

Closed

RodrigoCMoraes mentioned this pull request Jul 12, 2024

incognia inloco/loki#22

Closed

loki-gh-app bot mentioned this pull request Jul 15, 2024

chore(k211): release 3.1.0 #13521

Closed

loki-gh-app bot mentioned this pull request Jul 22, 2024

chore(k212): release 3.1.0 #13595

Closed

loki-gh-app bot mentioned this pull request Oct 18, 2024

chore(release-3.1.x): release 3.1.3 #14531

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(promtail): Fix bug with Promtail config reloading getting stuck indefinitely #12795

fix(promtail): Fix bug with Promtail config reloading getting stuck indefinitely #12795

ptodev commented Apr 25, 2024

cstyan commented Apr 25, 2024

ptodev commented Apr 26, 2024

ptodev commented Apr 26, 2024

cstyan left a comment

ptodev commented May 8, 2024

grafanabot commented May 10, 2024

grafanabot commented May 10, 2024

fix(promtail): Fix bug with Promtail config reloading getting stuck indefinitely #12795

fix(promtail): Fix bug with Promtail config reloading getting stuck indefinitely #12795

Conversation

ptodev commented Apr 25, 2024

cstyan commented Apr 25, 2024

ptodev commented Apr 26, 2024

ptodev commented Apr 26, 2024

cstyan left a comment

Choose a reason for hiding this comment

ptodev commented May 8, 2024

grafanabot commented May 10, 2024

grafanabot commented May 10, 2024