-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove running containers in case of a sigterm #8306
Conversation
@rhcarvalho @smarterclayton ptal. part of #8266 (comment) |
@@ -156,15 +156,27 @@ func dockerRun(client DockerClient, createOpts docker.CreateContainerOptions, lo | |||
|
|||
containerName := containerNameOrID(c) | |||
|
|||
// Container was created, so we defer its removal. | |||
defer func() { | |||
// Container was created, so we defer its removal, and also remove it if we get a sigterm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use Kube pkg/util/interrupt handlers for this, specifically Run(func() error)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That function appears to leak go routines, it has no handling for the case where the signal never arrives. (calling Close() does not appear to do anything to terminate the go routine that I see), so it doesn't seem well suited for this use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is design exactly fro this use case. Put your remove in your defer inside run(func() {}). No channels or signals needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, not in the defer, in the handler function to Chain()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interrupt.New(nil, removeContainer).Run(func() {
defer removeContainer()
...
})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not leak go routines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not leak go routines.
IIUC it does leak a goroutine if there is no signal received.
Code from pkg/util/interrupt/interrupt.go, with my own inline comments:
func (h *Handler) Run(fn func() error) error {
ch := make(chan os.Signal, 1)
signal.Notify(ch, childSignals...)
defer signal.Stop(ch)
go func() {
sig, ok := <-ch // will block forever if we go out of the scope of `Run` without having received a signal.
if !ok {
// unreachable code, no code ever closes ch.
return
}
h.Signal(sig)
}()
defer h.Close()
return fn()
}
The reason it can block forever is that we have no code closing ch
to notify the goroutine to terminate.
After calling signal.Stop(ch)
, we know that the runtime won't send anything through the ch
channel, so the receive operation will block forever, leaking the goroutine ∎ (Q.E.D.)
I'll send a patch to Kube.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Patch -> kubernetes/kubernetes#23674
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line got a bit buried in the discussion above. Would be nice to get it to what we have in the source-to-image PR:
Container was created, so we defer its removal, and also remove it if we get a SIGINT/SIGTERM/SIGQUIT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do.
@smarterclayton reworked, but probably should not merge until your fix for interrupt.go is merged into origin, or else builds are going to leak a lot of goroutines. |
How? The build runs once then exits? On Wed, Mar 30, 2016 at 4:49 PM, Ben Parees [email protected]
|
sorry, you're right, i was thinking about this as running in the main process, but it's in a pod. |
if err := client.Logs(logsOpts); err != nil { | ||
return fmt.Errorf("streaming logs of %q: %v", containerName, err) | ||
} | ||
return interrupt.New(nil, removeContainer).Run(func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bparees could you please verify that using Run
doesn't affect the glog messages in the terminal?
I don't expect them to be affected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does not appear to:
I0331 23:00:14.457942 1 dockerutil.go:173] Starting container "openshift_s2i-build_ruby-sample-build-3_p1_post-commit" ...
I0331 23:00:14.720660 1 dockerutil.go:180] Streaming logs of container "openshift_s2i-build_ruby-sample-build-3_p1_post-commit" with options {Container:92fd0861d8531010703f53391ef5850ee1027e33147f44773a347f7d3d410019 OutputStream:0xc20802e008 ErrorStream:0xc20802e010 Follow:true Stdout:true Stderr:true Since:0 Timestamps:false Tail: RawTerminal:false} ...
/opt/rh/rh-ruby22/root/usr/bin/ruby -I"lib" -I"/opt/app-root/src/bundle/ruby/gems/rake-10.3.2/lib" "/opt/app-root/src/bundle/ruby/gems/rake-10.3.2/lib/rake/rake_test_loader.rb" "test/*_test.rb"
Run options: --seed 28810
# Running:
.
Finished in 0.001928s, 518.5598 runs/s, 518.5598 assertions/s.
1 runs, 1 assertions, 0 failures, 0 errors, 0 skips
I0331 23:00:16.057152 1 dockerutil.go:186] Waiting for container "openshift_s2i-build_ruby-sample-build-3_p1_post-commit" to stop ...
I0331 23:00:16.623796 1 dockerutil.go:163] Removing container "openshift_s2i-build_ruby-sample-build-3_p1_post-commit" ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't want to be called for bike-shedding, but dockerRun
would be easier to read if we'd give a name to the function we pass to interrupt...Run
.
func removeContainer(name, id string) { ... }
func startWaitContainer(name, id string, logOpts ...) { ... }
func dockerRun(client DockerClient, createOpts docker.CreateContainerOptions, ...) ... {
// ...
// Start and wait for the container to run, and ensure we call removeContainer even if
// the program catches a SIGINT/SIGTERM/SIGQUIT signal.
return interrupt.New(nil, removeContainer).Run(func() error {
return startWaitContainer(...)
})
// or maybe:
//startWaitContainer := func() error { return startWaitContainer(...) }
//return interrupt.New(nil, removeContainer).Run(startWaitContainer)
}
I have in mind that using interrupt
as a "better defer" is not obvious for readers familiar with conventional Go code, so it's good to make it clear what's going on for our future selves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm happy to make it a named function (i've asked others to do the same before), but i can't make it a first class function w/ args because the InterruptHandler isn't designed to take functions w/ args. so it needs to be a closure.
@bparees for sanity, I think it would be desirable to ensure Right now it can be executed by both the deferred call and the code from the interrupt package, and they are not mutually exclusive. |
My previous comment made me wonder what happens when the process receives two SIGINT signals in a row? The first one will trigger |
The wait package handles run once. On Thu, Mar 31, 2016 at 7:59 AM, Rodolfo Carvalho [email protected]
|
There should be no defer inside of the function, chain guarantees execution On Thu, Mar 31, 2016 at 10:23 AM, Clayton Coleman [email protected]
|
I was going off @smarterclayton's example code:
but i see that Close (not Chain) guarantees execution of all handlers in the event no signal was received. Which frankly, I find weird. If i register a handler for signals, i expect to get called when a signal happens, not under other circumstances. but regardless, i'll remove the defer, so that should resolve @rhcarvalho's concern about removeContainer being called twice. |
Run's contract isn't about signals, just exiting a critical section On Thu, Mar 31, 2016 at 5:36 PM, Ben Parees [email protected]
|
then it's weird to call the arguments to run "handlers" (or "notifies" as i guess the variable is named notify in the constructor). the closest thing they appear to be is finalizers. (or defers that respect sigterm) |
@rhcarvalho @smarterclayton comments addressed, i think. |
@bparees kubernetes/kubernetes#23643 was not merged yet... shouldn't we have a Apart from #8306 (comment) to document the usage of the interrupt package as a defer replacement, and the commentary #8306 (comment), LGTM. |
this won't merge until post-3.2 i don't think (unless @smarterclayton feels otherwise about the importance of having this clean up in place) and regardless we don't actually leak go routines even without the patch because we just run this stuff once within the builder pod and then the builder pod dies, so there's no danger of accumulating go routines (as pointed out to me by clayton). So i think we can just wait for the normal processes to bring it in.
will add a little more verbiage around that. |
(again using a loose definition of leak by which i mean, they won't accumulate) |
@rhcarvalho comments addressed. |
// the interrupt handler acts as a super-defer which will guarantee removeContainer is executed | ||
// either when startWaitContainer finishes, or when a SIGQUIT/SIGINT/SIGTERM is received. | ||
return interrupt.New(nil, removeContainer).Run(func() error { | ||
return startWaitContainer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
startWaitContainer
is a func() error
, we don't need to wrap it in an anonymous function:
return interrupt.New(nil, removeContainer).Run(startWaitContainer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
@rhcarvalho @smarterclayton I think this is ready for a final review, but i don't intend to merge it for 3.2. |
LGTM |
Approved |
[merge] risk should be low since this is a mainline path that would fail tests if it were broken. |
[Test]ing while waiting on the merge queue |
Evaluated for origin test up to 11bfe8e |
continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/2788/) |
continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/5539/) (Image: devenv-rhel7_3928) |
router flake. |
Evaluated for origin merge up to 11bfe8e |
No description provided.