Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make entrypoint script fail if any step fails, remove unnecessary command substitution #839

Merged
merged 2 commits into from
Feb 12, 2020

Conversation

drakedevel
Copy link
Contributor

Issue #, if available: None

Description of changes:

The entrypoint script added in #735 is not run with set -e, so if any of the commands in it fails (such as installing the CNI), the script will silently succeed. This is a regression, since the previous code would catch this and error out.

I ran into this when I noticed some nodes on our cluster running 1.6rc6 were failing to start pods with CNI errors, despite having an apparently-healthy CNI pod running. Upon looking at the pod's logs, I saw the following:

2020-02-11T17:51:20.908196958+00:00 stdout P starting IPAM daemon in background ... 
2020-02-11T17:51:20.908344789+00:00 stdout F ok.
2020-02-11T17:51:20.908357137+00:00 stdout P checking for IPAM connectivity ... 
2020-02-11T17:51:40.933888193+00:00 stdout F ok.
2020-02-11T17:51:40.933888193+00:00 stdout P copying CNI plugin binaries and config files ... 
2020-02-11T17:51:41.056950001+00:00 stderr F cp: error writing '/host/opt/cni/bin/aws-cni': Cannot allocate memory
2020-02-11T17:51:41.056950001+00:00 stderr F cp: failed to extend '/host/opt/cni/bin/aws-cni': Cannot allocate memory
2020-02-11T17:51:41.064360344+00:00 stdout F  ok.
2020-02-11T17:51:41.064424913+00:00 stdout F foregrounding IPAM daemon ... 

The issue turned out to be on our end: an overly-aggressive memory limit (which worked fine on rc4, but the shell script approach increases the peak memory required by quite a bit). But nevertheless, cp failed, and the script blindly charged on, when it should probably have crashed (which would have drawn my attention to it much more quickly).

The main change I made to the script was to add set -e, which will make the script fail if any command it invokes fails (if not in the context of an if or || or similar). While I was at it, I made two additional changes that aren't causing issues but are probably good ideas:

  • set -u, which will make the script abort on undefined variable access (helps with typos)
  • Two places in the code used $(...) (command substitution) when they don't intend to execute the output of the command inside. One instance was completely unnecessary, the other I replaced with { } to group the commands.

I'm happy to revert either or both of these extra changes, but they jumped out at me while I was making the first one so I figured I'd submit them.

I've tested this change on our development cluster and it doesn't seem to have any issues. Additionally, I simulated an issue where it couldn't write to the aws-cni binary (by creating a bogus symlink), and confirmed that the pod failed to start with the following log:

starting IPAM daemon in background ... ok.
checking for IPAM connectivity ... ok.
copying CNI plugin binaries and config files ... cp: not writing through dangling symlink '/host/opt/cni/bin/aws-cni'

which is the expected outcome with this change.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@mogren mogren requested a review from jaypipes February 12, 2020 04:16
@mogren
Copy link
Contributor

mogren commented Feb 12, 2020

@drakedevel Thanks for testing this! I've compared rc4 and rc6, and the main difference is that the aws-cni is 11 MB in rc4, but has ballooned to 26 MB in rc6. I'll take a look at what has caused this increase. Since rc5 is even bigger at 27 MB, I wonder if it was the adding the k8s.io/cri-api that increased that size that much.

Copy link
Contributor

@jaypipes jaypipes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good changes, thanks @drakedevel and thanks for the excellent commit message! :)

@jaypipes jaypipes merged commit 44ee4d6 into aws:release-1.6 Feb 12, 2020
@drakedevel
Copy link
Contributor Author

@jaypipes thanks for the quick review!

Would it be helpful for me to submit a PR for a cherry-pick to master, or is that handled another way?

@drakedevel drakedevel deleted the entrypoint-improvements branch February 12, 2020 21:17
@jaypipes
Copy link
Contributor

Gah! I didn't realize this was proposed for merging directly into a release branch... Generally we try to only accept PRs into the master branch and then cherry-pick into the release branch.

No worries, though. Yes, please propose the same PR to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants