Garbage collector deleted deployer pod prematurely #13995

ncdc · 2017-05-02T14:10:26Z

In at least 1 CI run, the kube garbage collector controller decided to delete the ipfailover-1-deploy pod shortly after it was created and scheduled to a node. It looks like it fell through to here which resulted in the deletion.

cc @Kargakis @mfojtik @smarterclayton @sttts @deads2k @liggitt @soltysh

ncdc · 2017-05-02T14:11:21Z

Log file: https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin_integration/1618/s3/download/scripts/test-end-to-end/logs/containers/origin.log

0xmichalis · 2017-05-02T14:12:09Z

@mfojtik probably unrelated but I just realized that you set the DC ref into the deployer pods.. Should be the RC.

mfojtik · 2017-05-02T14:12:18Z

hrm. i thought the deployer pods already have owner ref set.

ncdc · 2017-05-02T14:14:16Z

It does, but GC couldn't find it?

ncdc · 2017-05-02T14:14:54Z

I ran oc delete ...; oadm ipfailover in a loop and was not able to reproduce locally.

mfojtik · 2017-05-29T08:16:40Z

@ncdc the #13996 should properly set the owner reference for the deployer pod and also for the lifecycle pods.

@Kargakis ok to close this or is there something left (besides dc->rc) ?

0xmichalis · 2017-06-02T15:20:22Z

IIUC, the problem wasn't that we were setting the wrong Kind in the owner reference since the GC only cares about the UID. If that's correct, then this issue is not fixed?

liggitt · 2017-06-02T15:35:18Z

the GC only cares about the UID. If that's correct, then this issue is not fixed?

I don't think that is correct... GC cares about kind as well, right @deads2k ?

0xmichalis · 2017-06-02T15:41:33Z

I am pretty sure we (@mfojtik and I) tested this and it was just the UID.

deads2k · 2017-06-02T15:42:40Z

I am pretty sure we (@mfojtik and I) tested this and it was just the UID.

If its just the UID, that sounds like a bug.

mfojtik · 2017-06-06T08:11:05Z

@deads2k @Kargakis I guess this was fixed in #13996 can we close this?

0xmichalis · 2017-06-06T08:12:38Z

@deads2k @Kargakis I guess this was fixed in #13996 can we close this?

IIUC, the problem wasn't that we were setting the wrong Kind in the owner reference since the GC only cares about the UID. If that's correct, then this issue is not fixed?

mfojtik · 2017-06-06T08:20:27Z

@Kargakis ah you're right. also I think I can reproduce this when I create the registry/router DC right after the master server starts, will dig deeper into this.

mfojtik · 2017-06-06T09:01:21Z

@deads2k I did a little investigation as I can reproduce it when I run fresh new cluster and create the router right after it starts (so it might be cache issue).

What happens here is that the DC controller creates the RC "router-1" right after the server starts. Right after the "router-1-deploy" pod is created with the ownerRef set to the UUID of "router-1" RC.
Then GC decide to remove the "router-1" RC (for unknown reasons ;-). That means the "router-1-deploy" pod is also automatically deleted.
Then the DC controller recreate the RC "router-1" automatically, but since the deployer pod is now gone, it fails.

Relevant logs:

I0605 21:32:27.404937   31478 wrap.go:42] DELETE /api/v1/namespaces/default/replicationcontrollers/router-1: (4.472203ms) 200 [[openshift/v1.6.1+5115d708d7 (linux/amd64) kubernetes/5115d70/system:serviceaccount:kube-system:generic-garbage-collector] 192.168.64.3:57376]
I0605 21:32:27.417135   31478 wrap.go:42] DELETE /api/v1/namespaces/default/pods/router-1-deploy: (4.681127ms) 200 [[openshift/v1.6.1+5115d708d7 (linux/amd64) kubernetes/5115d70/system:serviceaccount:kube-system:generic-garbage-collector] 192.168.64.3:57376]
I0605 21:32:31.440483   31478 kubelet.go:1829] SyncLoop (DELETE, "api"): "router-1-deploy_default(ae4bcfb5-4a25-11e7-8329-a268e445cf32)"
I0605 21:32:31.450378   31478 kubelet.go:1829] SyncLoop (DELETE, "api"): "router-1-deploy_default(ae4bcfb5-4a25-11e7-8329-a268e445cf32)"
I0605 21:32:31.452215   31478 wrap.go:42] DELETE /api/v1/namespaces/default/pods/router-1-deploy: (5.924044ms) 200 [[openshift/v1.6.1+5115d708d7 (linux/amd64) kubernetes/5115d70] 192.168.64.3:57373]

This is when the RC is gone:

I0605 21:32:27.412241   31478 garbagecollector.go:327] classify references of [v1/Pod, namespace: default, name: router-1-deploy, uid: ae4bcfb5-4a25-11e7-8329-a268e445cf32].
I0605 21:32:27.412268   31478 garbagecollector.go:373] delete object [v1/Pod, namespace: default, name: router-1-deploy, uid: ae4bcfb5-4a25-11e7-8329-a268e445cf32] with Default

The RC now have ownerRef to DC, so it should not be removed and also this failed before the RC has any ownerRef, so I assume there is a bug in GC and it is somehow related to the cache population.

liggitt · 2017-06-06T16:40:10Z

@mfojtik what ownerRefs does the DC controller create the RC with?

mfojtik · 2017-06-08T10:20:09Z

@liggitt

    ownerReferences:
    - apiVersion: v1
      blockOwnerDeletion: true
      controller: true
      kind: DeploymentConfig
      name: router
      uid: dab1530b-4c33-11e7-80cd-a268e445cf32

But this was broken even before we set the ownerRef (the ownerRef to DC was added recently by @tnozicka )

mfojtik · 2017-06-08T10:20:31Z

(the UUID is correct)

mfojtik · 2017-06-08T10:29:34Z

@liggitt @deads2k for what is worth I also found this in logs:

I0608 12:27:19.896080    4273 wrap.go:42] GET /api/v1/namespaces/default/deploymentconfigs/docker-registry: (4.714444ms) 404 [[openshift/v1.6.1+5115d708d7 (linux/amd64) kubernetes/010d313/system:serviceaccount:kube-system:generic-garbage-collector] 192.168.64.3:54776]
I0608 12:27:19.896023    4273 wrap.go:42] GET /api/v1/namespaces/default/deploymentconfigs/router: (3.568338ms) 404 [[openshift/v1.6.1+5115d708d7 (linux/amd64) kubernetes/010d313/system:serviceaccount:kube-system:generic-garbage-collector] 192.168.64.3:54776]
....
I0608 12:27:19.896545    4273 garbagecollector.go:219] object 07d0f2ec-4c35-11e7-8bbb-a268e445cf32's owner v1/DeploymentConfig, router is not found
I0608 12:27:19.896555    4273 garbagecollector.go:327] classify references of [v1/ReplicationController, namespace: default, name: router-1, uid: 07d0f2ec-4c35-11e7-8bbb-a268e445cf32].
solid: []v1.OwnerReference(nil)
dangling: []v1.OwnerReference{v1.OwnerReference{APIVersion:"v1", Kind:"DeploymentConfig", Name:"router", UID:"07238217-4c35-11e7-8bbb-a268e445cf32", Controller:(*bool)(0xc425f163b0), BlockOwnerDeletion:(*bool)(0xc425f163b1)}}
waitingForDependentsDeletion: []v1.OwnerReference(nil)

Seems like the GC does not see that UUID (although without that DC exists the RC will never be created).

jupierce · 2017-06-09T18:36:51Z

DC & RC yaml + events available: http://file.rdu.redhat.com/~jupierce/share/free-int-data.txt

0xmichalis · 2017-06-09T18:42:50Z

Justin, can you post the API server logs too? It seems that all the posted deployments failed due to missing deployers!

…

On Fri, Jun 9, 2017 at 8:36 PM, Justin Pierce ***@***.***> wrote: DC & RC yaml + events available: http://file.rdu.redhat.com/~ jupierce/share/free-int-data.txt — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13995 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADuFf5E3TbG9Kq4KduBgpIxy_2KPy43kks5sCZDHgaJpZM4NOMax> .

smarterclayton · 2017-06-09T18:45:32Z

Also this is an HA etcd so it's extremely likely that race conditions happen - so it may be that this bug happens on every deployment, not just some.

mfojtik · 2017-06-09T18:50:06Z

@Kargakis #13995 (comment)

in a nutshell:

controller create RC-1
controller create deployer-1
deployer-1 has ownerRef to RC-1
GC remove RC-1 <- because GC client returns 404 for DC
controller create RC-1 again
deployer-1 now points to dead RC-1 UUID
GC remove deployer-1

We traced this with @sttts to the storage layer when the GC is trying to GET the DC (using dynamic client). This returns 404 and it hits the rest.go, so etcd is giving us 404.

mfojtik · 2017-06-09T18:54:25Z

@Kargakis @tnozicka does rollback require running deployer pod? if yes, we will be screwed that way as well. if not then i wonder why the rollback did not kicked in.

0xmichalis · 2017-06-09T18:56:30Z

We traced this with @sttts to the storage layer when the GC is trying to GET the DC (using dynamic client). This returns 404 and it hits the rest.go, so etcd is giving us 404.

So my initial guess (GC+dynamic client) wasn't off? :)

mfojtik · 2017-06-09T18:57:10Z

@Kargakis close enough, but it isn't dynamic client, we actually hit the REST endpoint and then storage gives us 404.

smarterclayton · 2017-06-09T18:57:11Z

Yes, the quorum discussion will be to continue.

0xmichalis · 2017-06-09T18:57:44Z

@Kargakis @tnozicka does rollback require running deployer pod? if yes, we will be screwed that way as well. if not then i wonder why the rollback did not kicked in.

Normal rollback, yes, automatic rollback in case of failure, no.

mfojtik · 2017-06-09T18:57:46Z

@smarterclayton till we have quorum on quorum?

mfojtik · 2017-06-09T18:58:20Z

@Kargakis why automatic failure rollback did not kicked in? (you are more familiar with that code)

0xmichalis · 2017-06-09T18:59:20Z

@Kargakis why automatic failure rollback did not kicked in? (you are more familiar with that code)

In the example you gave here?

mfojtik · 2017-06-09T19:00:28Z

@Kargakis yes that is the problem we seeing in free-int, rolling back will at least allow us to proceed with upgrade (not sure in what state that will leave the prod)

0xmichalis · 2017-06-09T19:01:36Z

In order for automatic rollback to work, a complete deployment needs to exist in the history of a DC.

mfojtik · 2017-06-09T19:02:22Z

@Kargakis router-384 doesn't seems like an initial deployement.

0xmichalis · 2017-06-09T19:11:51Z

@mfojtik not sure that's helpful. Do you have controller logs somewhere and the actual manifests? router-384 doesn't seem to be included in Justin's link.

smarterclayton · 2017-06-09T19:25:34Z

All of the DCs already existed.

mfojtik · 2017-06-09T19:34:56Z

@Kargakis was answering the question of previous deployment needs to exists :-)

my question is why the automatic rollback on failure had not kicked in...

don't have controller/api master logs, but it is easy to reproduce locally, just delete deployer pod in middle of the deployment. what @smarterclayton is saying is that in such case we should rollback automatically and not leave the deployment in failed state (in this case with 0 pods running).

jupierce · 2017-06-09T20:42:31Z

@Kargakis Here is a snapshot of logs: http://file.rdu.redhat.com/~jupierce/share/free-int-logs.tgz

smarterclayton · 2017-06-09T21:38:04Z

Let's spawn the "didn't rollback" as a separate high priority issue.

0xmichalis · 2017-06-10T00:17:40Z

That's what I was going to suggest - opened #14561

0xmichalis · 2017-06-10T00:49:36Z

Two out of three controller manager instaces fail on leader election for more than 2 hours, from 18:10 up until 20:38(EOF) and there is an unusual amount of dropped watches and TLS handshake errors in the API servers.

mfojtik · 2017-06-11T12:13:28Z

I wonder if we can enable more verbose logging in free-int, to see the REST requests at least which will give us timeline of events (why and by whom the RC/deployer was deleted)

jupierce · 2017-06-11T12:30:23Z

Which RC? I deleted docker-registry RC in this environment before to force a redeployment. On June 8th sometime. This problem observed in free-int appears to be affecting all DCs, not just docker-registry.

mfojtik · 2017-06-12T14:11:01Z

Seems like the wrong apiVersion in ownerRef was used: #14582

stevekuznetsov · 2017-06-12T15:11:08Z

Router/registry rollout confirmation for Ansible is here: openshift/openshift-ansible#4402

ncdc added the priority/P1 label May 2, 2017

mfojtik mentioned this issue May 2, 2017

deploy: fix the owner reference kind to be rc #13996

Merged

0xmichalis assigned mfojtik May 2, 2017

0xmichalis added the component/apps label May 2, 2017

mfojtik added priority/P2 and removed priority/P1 labels May 29, 2017

mfojtik assigned deads2k and unassigned mfojtik Jun 6, 2017

mfojtik added the component/kubernetes label Jun 6, 2017

mfojtik self-assigned this Jun 6, 2017

mfojtik mentioned this issue Jun 8, 2017

Use quorum read for origin api resources #14520

Closed

smarterclayton added the priority/P0 label Jun 9, 2017

0xmichalis mentioned this issue Jun 10, 2017

Investigate if and why automatic rollback doesn't work (+integration test) #14561

Closed

mfojtik mentioned this issue Jun 12, 2017

deploy: set ownerRef from RC to grouped API version #14582

Merged

stevekuznetsov mentioned this issue Jun 12, 2017

Cluster provision fails - registry deployment #14597

Closed

smarterclayton closed this as completed in #14582 Jun 12, 2017

Garbage collector deleted deployer pod prematurely #13995

Garbage collector deleted deployer pod prematurely #13995

Comments

ncdc commented May 2, 2017

ncdc commented May 2, 2017

0xmichalis commented May 2, 2017

mfojtik commented May 2, 2017

ncdc commented May 2, 2017

ncdc commented May 2, 2017

mfojtik commented May 29, 2017

0xmichalis commented Jun 2, 2017

liggitt commented Jun 2, 2017

0xmichalis commented Jun 2, 2017

deads2k commented Jun 2, 2017

mfojtik commented Jun 6, 2017

0xmichalis commented Jun 6, 2017

mfojtik commented Jun 6, 2017

mfojtik commented Jun 6, 2017

liggitt commented Jun 6, 2017

mfojtik commented Jun 8, 2017 • edited by stevekuznetsov Loading

mfojtik commented Jun 8, 2017

mfojtik commented Jun 8, 2017 • edited Loading

jupierce commented Jun 9, 2017

0xmichalis commented Jun 9, 2017 via email

smarterclayton commented Jun 9, 2017

mfojtik commented Jun 9, 2017 • edited Loading

mfojtik commented Jun 9, 2017

0xmichalis commented Jun 9, 2017

mfojtik commented Jun 9, 2017

smarterclayton commented Jun 9, 2017

0xmichalis commented Jun 9, 2017

mfojtik commented Jun 9, 2017

mfojtik commented Jun 9, 2017

0xmichalis commented Jun 9, 2017

mfojtik commented Jun 9, 2017

0xmichalis commented Jun 9, 2017

mfojtik commented Jun 9, 2017

0xmichalis commented Jun 9, 2017

smarterclayton commented Jun 9, 2017

mfojtik commented Jun 9, 2017 • edited Loading

jupierce commented Jun 9, 2017

smarterclayton commented Jun 9, 2017

0xmichalis commented Jun 10, 2017

0xmichalis commented Jun 10, 2017

mfojtik commented Jun 11, 2017

jupierce commented Jun 11, 2017

mfojtik commented Jun 12, 2017

stevekuznetsov commented Jun 12, 2017

mfojtik commented Jun 8, 2017 •

edited by stevekuznetsov

Loading

mfojtik commented Jun 8, 2017 •

edited

Loading

mfojtik commented Jun 9, 2017 •

edited

Loading

mfojtik commented Jun 9, 2017 •

edited

Loading