Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oomparser: update to use kmsg based parser #1544

Merged
merged 2 commits into from
Aug 30, 2017
Merged

Conversation

euank
Copy link
Collaborator

@euank euank commented Nov 20, 2016

My comment for PR kubernetes/node-problem-detector#41 (comment) applies here too.

This would fix kubernetes/kubernetes#34965, probably fix #1484, and work-around/fix #645.

Things that should be considered before merging:

  1. anywhere kubelet or cadvisor runs in a container now needs access to /dev/kmsg. This should be documented as a breaking change and various examples / docs might need changing.
  2. this reduces a little how far back we can see (namely after the system reboots, we don't get things before that). This implies that a "kubelet crash -> container OOM -> node reboot" chain of events can lose that container OOM.
  3. this works everywhere where the kernel is modern enough to run docker! Fewer dependencies!

Testing note: I removed the log-file formatted files. The parsing of log-files is now the responsibility of the kmsgparser package which lives elsewhere, so we can trust it to do the right thing. We also get to assume the timestamp etc.
I have not done manual testing of this change, but I have done decent manual testing of the kmsgparser.

@timstclair
Copy link
Contributor

Nice! This LGTM, but I'd like to get input from a few more people who understand the OOM pathways better than I do.

@timstclair
Copy link
Contributor

@vishh - Could you take a look at this?

}()
// StreamOoms writes to a provided a stream of OomInstance objects representing
// OOM events that are found in the logs.
// It will blocks and should be called from a goroutine.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/blocks/block/

ioreader: bufio.NewReader(tail),
}, nil
// Should not happen
glog.Warningf("exiting analyzeLines. OOM events will not be reported.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If this really should not happen it should be error level

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before it was info and I thought it should be error, so I picked the middle as a compromise 😇

"[ 1532] 1020 1532 410347 398810 788 0 0 badsysprogram",
"Out of memory: Kill process 1532 (badsysprogram) score 919 or sacrifice child",
"Killed process 1532 (badsysprogram) total-vm:1641388kB, anon-rss:1595164kB, file-rss:76kB",
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test case with multiple OOM events (note: I think writeAll needs to run in a goroutine to prevent deadlock)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

parser, err = tryLogFile()
if err == nil {
return parser, nil
parser, err := kmsgparser.NewParser()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I like the logger pattern, consider setting the logger to use glog for consistency?

// OOM events that are found in the logs.
// It will blocks and should be called from a goroutine.
func (self *OomParser) StreamOoms(outStream chan<- *OomInstance) {
kmsgEntries := self.parser.Parse()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Should we call self.parser.SeekEnd() here or not?

I think SeekEnd more closely matches the journalctl .. -f behavior of before (start at the end of kmsg), but I think that looking a little into the past makes the kubelet behave more correctly after a restart.

Will OOM events that happen while the kubelet is down, but are detected on kubelet startup actually do the right thing?
Are they idempotent so that power-cycling the kubelet having bunches of duplicate events is okay?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think reporting recent OOM events is desireable, and it's also consistent with the non-journald behavior.

@euank
Copy link
Collaborator Author

euank commented Nov 21, 2016

@timstclair addressed your comments and fixed a bug where OOM events with a process name containing a - in them would break the parser (wouldn't match the end-line for that oom). Lucky me that my test program has a dash 😄

Happy to break that one-line fix out into its own PR if you'd like

@euank
Copy link
Collaborator Author

euank commented Dec 9, 2016

Rebased and squashed the commit addressing comments from back whenever.

@goettl79
Copy link

This fix would also make the packaging issue of cadvisor / journalctl inside a docker container obsolete. #1313

@euank
Copy link
Collaborator Author

euank commented Jan 23, 2017

Bump on this, did a rote rebase and afaict there aren't open concerns (other than waiting on whether @vishh feels like reviewing)

@euank
Copy link
Collaborator Author

euank commented Jan 31, 2017

@k8s-bot test this

Copy link

@goettl79 goettl79 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me. Most of the essential work is implemented in the dependency https://github.com/euank/go-kmsg-parser.

From my point of view this change should be integrated, although I'm not a official reviewer.

@dchen1107
Copy link
Collaborator

cc/ @timstclair @dashpole @vishh

@dashpole dashpole requested a review from vishh March 15, 2017 05:10
euank added a commit to euank/cadvisor that referenced this pull request Jul 27, 2017
The oomparser logic would end up stuck, unable to detect the end of a
given oom trace, for any process with name that didn't match \w+.

This includes processes like "python3.4" due to the '.', or
'docker-containerd' due to the '-'.

This fix was included in pr google#1544 last year, but since that PR seems
dead it seems like a good idea to break this more important fix out.

I've updated the test such that it would have caught this issue.
euank added a commit to euank/cadvisor that referenced this pull request Jul 27, 2017
The oomparser logic would end up stuck, unable to detect the end of a
given oom trace, for any process with a name that didn't match \w+.

This includes processes like 'python3.4' due to the '.', or
'docker-containerd' due to the '-'.

This fix was included in pr google#1544 last year, but since that PR seems
dead it seems like a good idea to break this more important fix out.

I've updated the tests such that they would have caught this issue.
euank added a commit to euank/cadvisor that referenced this pull request Aug 1, 2017
The oomparser logic would end up stuck, unable to detect the end of a
given oom trace, for any process with a name that didn't match \w+.

This includes processes like 'python3.4' due to the '.', or
'docker-containerd' due to the '-'.

This fix was included in pr google#1544 last year, but since that PR seems
dead it seems like a good idea to break this more important fix out.

I've updated the tests such that they would have caught this issue.
euank added a commit to euank/cadvisor that referenced this pull request Aug 1, 2017
The oomparser logic would end up stuck, unable to detect the end of a
given oom trace, for any process with a name that didn't match \w+.

This includes processes like 'python3.4' due to the '.', or
'docker-containerd' due to the '-'.

This fix was included in pr google#1544 last year, but since that PR seems
dead it seems like a good idea to break this more important fix out.

I've updated the tests such that they would have caught this issue.
euank added a commit to euank/cadvisor that referenced this pull request Aug 1, 2017
The oomparser logic would end up stuck, unable to detect the end of a
given oom trace, for any process with a name that didn't match \w+.

This includes processes like 'python3.4' due to the '.', or
'docker-containerd' due to the '-'.

This fix was included in pr google#1544 last year, but since that PR seems
dead it seems like a good idea to break this more important fix out.

I've updated the tests such that they would have caught this issue.
@euank
Copy link
Collaborator Author

euank commented Aug 30, 2017

Rebased for probably the 10th or so time.

Manually tested via the following:

$ sudo ./cadvisor -v 3 -logtostderr
...
# cgroup OOM
$ docker run -m 15M euank/gunpowder-memhog:latest 30M
# cadvisor logs:
I0830 20:24:01.875221    3264 manager.go:1142] Created an OOM event in container "/docker/6da85dc9675db12149e970c6ef6835240bab712e5499963d13e88a446248e82b" at 2017-08-30 20:24:01.583571597 +0000 UTC

# system OOM
$ docker run -d euank/gunpowder-memhog 500M
$ sleep 5
$ echo f | sudo tee /proc/sysrq-trigger
# cadvisor logs
I0830 20:59:27.831699    1611 manager.go:1142] Created an OOM event in container "/" at 2017-08-30 20:59:27.717686391 +0000 UTC

@timstclair if this is going to land, can we land it?
Having to come by and rebase it once a month is ridiculous.

@vishh
Copy link
Contributor

vishh commented Aug 30, 2017

@euank I can help review this. is this ready for a review?

This provides much more robust support for kernel logs via accessing the
`/dev/kmsg` interface to them directly.
@euank
Copy link
Collaborator Author

euank commented Aug 30, 2017

@vishh yup, it's been ready for review for 9 months now.

@vishh vishh merged commit 03d7288 into google:master Aug 30, 2017
@euank euank deleted the kmsg-parser branch August 30, 2017 22:45
@euank
Copy link
Collaborator Author

euank commented Aug 30, 2017

🎉

corneliusweig referenced this pull request in corneliusweig/kubernetes-lxd Nov 15, 2020
This is required, because recent versions of containerd try to read from
this file. Older LXC versions had a dedicated setting to create this
symlink, so we have to rely on systemd here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants