Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Engine healthcheck: deal with empty uuid file #2926

Merged
merged 1 commit into from
Dec 21, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,29 @@

set -o errexit

. /usr/libexec/os-helpers-logging

BALENAD_SOCKET="/run/balena.sock"
CONTAINERD_SOCKET="/run/balena-engine/containerd/balena-engine-containerd.sock"

# Check if balena-engine-daemon is responding.
curl --fail --unix-socket $BALENAD_SOCKET http:/v1.40/_ping > /dev/null 2>&1

# Due to a non-atomic file creation and writing operation in containerd, we
# sometimes end up with an empty `uuid` file. This causes `ctr version` (and
# hence the health check) to fail. We therefore remove this file if it is
# present and empty. See https://github.com/balena-os/balena-engine/issues/322
UUID_FILE="/mnt/data/docker/containerd/daemon/io.containerd.grpc.v1.introspection/uuid"
if [ -f "$UUID_FILE" -a ! -s "$UUID_FILE" ]; then
warn "removing empty $UUID_FILE"
rm -f "$UUID_FILE"
fi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lmbarros would logging something in the journal not work (and persisting on request if needed)? I know this is meant to be temporary but the writing logs to disk without persistent logging enabled makes me uneasy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexgg, I expected this would be controversial. :-)

Here's my reasoning:

  • We don't know when these events happen -- and in the case of the the second one (ctr version creating an empty uuid file), we don't even know if they ever happen. Asking a user to enable persistent logging after the fact would almost certainly mean that we'd lose data.
  • I am not logging to persistent storage in any way that will cause those files to get larger over time. I'll, at most, create two files each one a few bytes long (it's just the date of the last event).
    • I considered appending to these files, but decided keep just the last event time to avoid the growing files. (Side note: we can also get the date of the first event, by stating these files to get their creation date.)
  • Based on what we have seen with users and in my manual testing, these events are quite rare and don't happen again (or even more rarely do) after we remove the empty uuid file. So, I don't think we'd be adding significantly to storage media wearing.

The code is already logging to the journal. I can remove the creation of the breadcrumb files, but I think it could make it quite harder for us to get to the root cause.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @lmbarros

asking a user to enable persistent logging after the fact

We ask the user to enable persistent logging, reproduce, and then provide the logs

I am not logging to persistent storage in any way that will cause those files to get larger over time. I'll, at most, create two files each one a few bytes long (it's just the date of the last event).

I fail to see how this is better that having all the events logged in persistent logs

I don't think we'd be adding significantly to storage media wearing.

Let's not speculate, if we don't write there is no media wear. If we do write, we don't have data to tell.

Could you summarize why file logging is better in any way than enabling persistent logging before the event?

Copy link
Contributor

@alexgg alexgg Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But also, I don't think logging these will help us debug the root cause. We will also need extra debug in containerd to understand when the empty file is being created, and we need those events to be in context with other system events to understand for example whether containerd is killed. So enabling persistent logging to debug the root cause is something we will need anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @alexgg,

We ask the user to enable persistent logging, reproduce, and then provide the logs

The problem is that we don't know how to reproduce. My goal here is to collect data about a rare event that would be otherwise lost. If we know how to reproduce I wouldn't bother with the breadcrumbs.

But also, I don't think logging these will help us debug the root cause. We will also need extra debug in containerd to understand when the empty file is being created, and we need those events to be in context with other system events to understand for example whether containerd is killed. So enabling persistent logging to debug the root cause is something we will need anyway.

I largely, but not completely agree here.

First, all evidence we have so far points that this is a rare event. What do we do after asking a user to enable persistent logging? Maybe we remove the uuid or replace it with an empty one to try to force the issue happening? There's a very high probability the result will be a fully working device and we don't learn anything. (Maybe you have some different procedure in mind that what I said.)

Second, I would be particularly interested in learning that ctr version could complete successfully and still generate an empty uuid file. This is a sanity check: I don't think this can possibly happen, but if it ever does it means that there's something important out of our radar. I am not sure we'd be able to detect this from inside containerd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to look at it: what's the worst thing that could happen if we leave the breadcrumbs creation in? Even if all our beliefs about how rare this is are completely wrong, we'll make one or two writes to disk every time the health checks run. This is not good, granted, but for years we did much worse than that (when we used to run a container as part of the health check, this causes even more writes to disk).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @lmbarros there are several things I don't like:

  • We are creating these files for all device types, on all fleets. This is just not good practice. Even if I agree the wear is probably not going to be a problem, it's just not a good example to commit.
  • You are basically saying that persistent logging is not good enough to debug this problem, and I don't see this. It's the mechanism that has been agreed to have in the product, so if it has shortcomings we need to discuss those in arch calls and agree to fix them.

But there might be something that persistent logs are missing that I can't see. I'll open a thread in Zulip to discuss how you view the debugging.

# Check if balena-engine-containerd is responding.
balena-engine-containerd-ctr --address $CONTAINERD_SOCKET version > /dev/null 2>&1

# The uuid file is expected to exist and be non-empty after `ctr version`. If
# this is not the case, log the event.
if [ -f "$UUID_FILE" -a ! -s "$UUID_FILE" ]; then
warn "$UUID_FILE empty after 'ctr version'"
fi