Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic starting shared informer #15035

Closed
bparees opened this issue Jul 4, 2017 · 15 comments
Closed

panic starting shared informer #15035

bparees opened this issue Jul 4, 2017 · 15 comments
Assignees
Labels
component/restapi kind/bug Categorizes issue or PR as related to a bug. priority/P2

Comments

@bparees
Copy link
Contributor

bparees commented Jul 4, 2017

I0704 14:25:49.346171   23086 start_master.go:764] Started "disruption"
I0704 14:25:49.346182   23086 start_master.go:754] Starting "certificatesigningrequests"
E0704 14:25:49.346204   23086 runtime.go:66] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/asm_amd64.s:479
/usr/local/go/src/runtime/panic.go:458
/usr/local/go/src/runtime/panic.go:62
/usr/local/go/src/runtime/sigpanic_unix.go:24
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/disruption/disruption.go:383
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/disruption/disruption.go:311
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/disruption/disruption.go:130
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/client-go/tools/cache/controller.go:192
<autogenerated>:54
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/client-go/tools/cache/shared_informer.go:547
/usr/local/go/src/runtime/asm_amd64.s:2086
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x199f0cb]

goroutine 26867 [running]:
panic(0x4671740, 0xc420010150)
	/usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x126
panic(0x4671740, 0xc420010150)
	/usr/local/go/src/runtime/panic.go:458 +0x243
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/disruption.(*DisruptionController).getPdbForPod(0xc42acf0000, 0xc4235d1200, 0x17)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/disruption/disruption.go:383 +0x5b
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/disruption.(*DisruptionController).addPod(0xc42acf0000, 0x4e195c0, 0xc4235d1200)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/disruption/disruption.go:311 +0x10f
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/disruption.(*DisruptionController).(github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/disruption.addPod)-fm(0x4e195c0, 0xc4235d1200)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/disruption/disruption.go:130 +0x3e
github.com/openshift/origin/vendor/k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(0xc4396c2d40, 0xc4396c2d60, 0xc4396c2d70, 0x4e195c0, 0xc4235d1200)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/client-go/tools/cache/controller.go:192 +0x49
github.com/openshift/origin/vendor/k8s.io/client-go/tools/cache.(*ResourceEventHandlerFuncs).OnAdd(0xc43b302f80, 0x4e195c0, 0xc4235d1200)
	<autogenerated>:54 +0x78
github.com/openshift/origin/vendor/k8s.io/client-go/tools/cache.(*processorListener).run(0xc42dbfee70, 0xc4200e4d20)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/client-go/tools/cache/shared_informer.go:547 +0x3db
created by github.com/openshift/origin/vendor/k8s.io/client-go/tools/cache.(*sharedIndexInformer).AddEventHandlerWithResyncPeriod
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/client-go/tools/cache/shared_informer.go:327 +0x357

https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_integration/4149/

@bparees bparees added kind/bug Categorizes issue or PR as related to a bug. priority/P1 labels Jul 4, 2017
@bparees
Copy link
Contributor Author

bparees commented Jul 4, 2017

possible this is just a test flake (setup issue) but it should be initially triaged as something more significant, i think.

@smarterclayton
Copy link
Contributor

@mfojtik this could be a memory race (use before initialization goroutine completes), not sure how else this can happen

@mfojtik
Copy link
Contributor

mfojtik commented Jul 4, 2017

yeah, seems like the pdbLister is nil which is pretty weird... do we wait for the shared informers to be initialized before we start the PDB controllers?

i'm off for vacation tomorrow, so probably won't have traffic to investigate closer.

@ironcladlou
Copy link
Contributor

A quick scan of the controller code shows the controller accepts a shared informer (a PodInformer) and adds an event handler which relies on state that isn't initialized until after the event handler is added. Because the informer lifecycle isn't managed by the controller, it seems the handler could be invoked before the handler's dependent state is initialized.

I wouldn't be surprised if the race exists for other handlers attached to PodInformer or PodDisruptionBudgetInformer.

Will try to get a simple reproducer going today before I'm out on PTO tomorrow.

@ironcladlou
Copy link
Contributor

Looks like most/all of the controllers are started and run in goroutines prior to shared informer startup (see the disruption controller specifically).

Since NewDisruptionController is executed in a goroutine, it seems at least possible that sharedInformers.Start could be called between AddEventHandlerWithResyncPeriod and the state assignment. Unlikely, but possible.

@ironcladlou
Copy link
Contributor

I think there's a subtle but critical flaw in my assessment: the async executions are like go New().Run() which means New is executed in the calling goroutine and Run is executed in a new goroutine. This means the struct should be fully initialized and event handlers added synchronously before informers can be started.

Pretty sure that invalidates my theory.

@ironcladlou
Copy link
Contributor

The next line of inquiry: is one of the controllers themselves starting a SharedInformer?

@ironcladlou
Copy link
Contributor

The error seems to imply that a SharedInformer for Pod was started before the call to NewDisruptionController, but so far I can't reason through how that could be possible in either the upstream or origin controller init code. Shifting focus to reproducing.

@ironcladlou
Copy link
Contributor

It just dawned on me that the original issue was reported in the context of an integration test, which can do all sorts of potentially hazardous things with informers during custom controller bootstrapping logic. The original linked job is long gone, but it seems entirely possible the issue is some custom code in a test's setup as @bparees originally suggested.

@ironcladlou
Copy link
Contributor

One more note for anybody else picking this up later: based on the reported stack trace, the controller init occurred via start_master.go, which in the context of an integration test could be invoked via StartConfiguredMasterWithOptions, which doesn't provide the caller with any apparent means to influence/inject the Informers created internally within start_master.go. So, I'm not even sure yet how any test bootstrapping code might be responsible.

@mfojtik
Copy link
Contributor

mfojtik commented Jul 12, 2017

@ironcladlou thanks, I can pick this up and review the integration bootstrapping, it seems like we are just lucky that the tests are running... Also I would not consider this a blocker if this is just an integration test bootstrapping issue. @smarterclayton agree?

@smarterclayton
Copy link
Contributor

smarterclayton commented Jul 12, 2017 via email

@ironcladlou
Copy link
Contributor

Thanks. Wish I had more log context for this one: if anybody notices it again before we solve, please try and get the full logs here.

@ironcladlou
Copy link
Contributor

The fix for kubernetes/kubernetes#51013 will probably solve this.

@ironcladlou
Copy link
Contributor

Pretty sure this is resolved by:

kubernetes/kubernetes@e63fcf7
kubernetes/kubernetes#51633 (xref #16617)

Please re-open if it happens again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/restapi kind/bug Categorizes issue or PR as related to a bug. priority/P2
Projects
None yet
Development

No branches or pull requests

6 participants