Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbage collector deleted deployer pod prematurely #13995

Closed
ncdc opened this issue May 2, 2017 · 48 comments · Fixed by #14582
Closed

Garbage collector deleted deployer pod prematurely #13995

ncdc opened this issue May 2, 2017 · 48 comments · Fixed by #14582

Comments

@ncdc
Copy link
Contributor

ncdc commented May 2, 2017

Forking from #13943 (comment)

In at least 1 CI run, the kube garbage collector controller decided to delete the ipfailover-1-deploy pod shortly after it was created and scheduled to a node. It looks like it fell through to here which resulted in the deletion.

cc @Kargakis @mfojtik @smarterclayton @sttts @deads2k @liggitt @soltysh

@0xmichalis
Copy link
Contributor

@mfojtik probably unrelated but I just realized that you set the DC ref into the deployer pods.. Should be the RC.

@mfojtik
Copy link
Contributor

mfojtik commented May 2, 2017

hrm. i thought the deployer pods already have owner ref set.

@ncdc
Copy link
Contributor Author

ncdc commented May 2, 2017

It does, but GC couldn't find it?

@ncdc
Copy link
Contributor Author

ncdc commented May 2, 2017

I ran oc delete ...; oadm ipfailover in a loop and was not able to reproduce locally.

@mfojtik
Copy link
Contributor

mfojtik commented May 29, 2017

@ncdc the #13996 should properly set the owner reference for the deployer pod and also for the lifecycle pods.

@Kargakis ok to close this or is there something left (besides dc->rc) ?

@0xmichalis
Copy link
Contributor

IIUC, the problem wasn't that we were setting the wrong Kind in the owner reference since the GC only cares about the UID. If that's correct, then this issue is not fixed?

@liggitt
Copy link
Contributor

liggitt commented Jun 2, 2017

the GC only cares about the UID. If that's correct, then this issue is not fixed?

I don't think that is correct... GC cares about kind as well, right @deads2k ?

@0xmichalis
Copy link
Contributor

I am pretty sure we (@mfojtik and I) tested this and it was just the UID.

@deads2k
Copy link
Contributor

deads2k commented Jun 2, 2017

I am pretty sure we (@mfojtik and I) tested this and it was just the UID.

If its just the UID, that sounds like a bug.

@mfojtik
Copy link
Contributor

mfojtik commented Jun 6, 2017

@deads2k @Kargakis I guess this was fixed in #13996 can we close this?

@0xmichalis
Copy link
Contributor

@deads2k @Kargakis I guess this was fixed in #13996 can we close this?

IIUC, the problem wasn't that we were setting the wrong Kind in the owner reference since the GC only cares about the UID. If that's correct, then this issue is not fixed?

@mfojtik
Copy link
Contributor

mfojtik commented Jun 6, 2017

@Kargakis ah you're right. also I think I can reproduce this when I create the registry/router DC right after the master server starts, will dig deeper into this.

@mfojtik
Copy link
Contributor

mfojtik commented Jun 6, 2017

@deads2k I did a little investigation as I can reproduce it when I run fresh new cluster and create the router right after it starts (so it might be cache issue).

What happens here is that the DC controller creates the RC "router-1" right after the server starts. Right after the "router-1-deploy" pod is created with the ownerRef set to the UUID of "router-1" RC.
Then GC decide to remove the "router-1" RC (for unknown reasons ;-). That means the "router-1-deploy" pod is also automatically deleted.
Then the DC controller recreate the RC "router-1" automatically, but since the deployer pod is now gone, it fails.

Relevant logs:

I0605 21:32:27.404937   31478 wrap.go:42] DELETE /api/v1/namespaces/default/replicationcontrollers/router-1: (4.472203ms) 200 [[openshift/v1.6.1+5115d708d7 (linux/amd64) kubernetes/5115d70/system:serviceaccount:kube-system:generic-garbage-collector] 192.168.64.3:57376]
I0605 21:32:27.417135   31478 wrap.go:42] DELETE /api/v1/namespaces/default/pods/router-1-deploy: (4.681127ms) 200 [[openshift/v1.6.1+5115d708d7 (linux/amd64) kubernetes/5115d70/system:serviceaccount:kube-system:generic-garbage-collector] 192.168.64.3:57376]
I0605 21:32:31.440483   31478 kubelet.go:1829] SyncLoop (DELETE, "api"): "router-1-deploy_default(ae4bcfb5-4a25-11e7-8329-a268e445cf32)"
I0605 21:32:31.450378   31478 kubelet.go:1829] SyncLoop (DELETE, "api"): "router-1-deploy_default(ae4bcfb5-4a25-11e7-8329-a268e445cf32)"
I0605 21:32:31.452215   31478 wrap.go:42] DELETE /api/v1/namespaces/default/pods/router-1-deploy: (5.924044ms) 200 [[openshift/v1.6.1+5115d708d7 (linux/amd64) kubernetes/5115d70] 192.168.64.3:57373]

This is when the RC is gone:

I0605 21:32:27.412241   31478 garbagecollector.go:327] classify references of [v1/Pod, namespace: default, name: router-1-deploy, uid: ae4bcfb5-4a25-11e7-8329-a268e445cf32].
I0605 21:32:27.412268   31478 garbagecollector.go:373] delete object [v1/Pod, namespace: default, name: router-1-deploy, uid: ae4bcfb5-4a25-11e7-8329-a268e445cf32] with Default

The RC now have ownerRef to DC, so it should not be removed and also this failed before the RC has any ownerRef, so I assume there is a bug in GC and it is somehow related to the cache population.

@mfojtik mfojtik assigned deads2k and unassigned mfojtik Jun 6, 2017
@mfojtik mfojtik self-assigned this Jun 6, 2017
@liggitt
Copy link
Contributor

liggitt commented Jun 6, 2017

@mfojtik what ownerRefs does the DC controller create the RC with?

@mfojtik
Copy link
Contributor

mfojtik commented Jun 8, 2017

@liggitt

    ownerReferences:
    - apiVersion: v1
      blockOwnerDeletion: true
      controller: true
      kind: DeploymentConfig
      name: router
      uid: dab1530b-4c33-11e7-80cd-a268e445cf32

But this was broken even before we set the ownerRef (the ownerRef to DC was added recently by @tnozicka )

@mfojtik
Copy link
Contributor

mfojtik commented Jun 8, 2017

(the UUID is correct)

@mfojtik
Copy link
Contributor

mfojtik commented Jun 8, 2017

@liggitt @deads2k for what is worth I also found this in logs:

I0608 12:27:19.896080    4273 wrap.go:42] GET /api/v1/namespaces/default/deploymentconfigs/docker-registry: (4.714444ms) 404 [[openshift/v1.6.1+5115d708d7 (linux/amd64) kubernetes/010d313/system:serviceaccount:kube-system:generic-garbage-collector] 192.168.64.3:54776]
I0608 12:27:19.896023    4273 wrap.go:42] GET /api/v1/namespaces/default/deploymentconfigs/router: (3.568338ms) 404 [[openshift/v1.6.1+5115d708d7 (linux/amd64) kubernetes/010d313/system:serviceaccount:kube-system:generic-garbage-collector] 192.168.64.3:54776]
....
I0608 12:27:19.896545    4273 garbagecollector.go:219] object 07d0f2ec-4c35-11e7-8bbb-a268e445cf32's owner v1/DeploymentConfig, router is not found
I0608 12:27:19.896555    4273 garbagecollector.go:327] classify references of [v1/ReplicationController, namespace: default, name: router-1, uid: 07d0f2ec-4c35-11e7-8bbb-a268e445cf32].
solid: []v1.OwnerReference(nil)
dangling: []v1.OwnerReference{v1.OwnerReference{APIVersion:"v1", Kind:"DeploymentConfig", Name:"router", UID:"07238217-4c35-11e7-8bbb-a268e445cf32", Controller:(*bool)(0xc425f163b0), BlockOwnerDeletion:(*bool)(0xc425f163b1)}}
waitingForDependentsDeletion: []v1.OwnerReference(nil)

Seems like the GC does not see that UUID (although without that DC exists the RC will never be created).

@jupierce
Copy link
Contributor

jupierce commented Jun 9, 2017

DC & RC yaml + events available: http://file.rdu.redhat.com/~jupierce/share/free-int-data.txt

@0xmichalis
Copy link
Contributor

0xmichalis commented Jun 9, 2017 via email

@smarterclayton
Copy link
Contributor

Also this is an HA etcd so it's extremely likely that race conditions happen - so it may be that this bug happens on every deployment, not just some.

@mfojtik
Copy link
Contributor

mfojtik commented Jun 9, 2017

@Kargakis #13995 (comment)

in a nutshell:

  1. controller create RC-1
  2. controller create deployer-1
  3. deployer-1 has ownerRef to RC-1
  4. GC remove RC-1 <- because GC client returns 404 for DC
  5. controller create RC-1 again
  6. deployer-1 now points to dead RC-1 UUID
  7. GC remove deployer-1

We traced this with @sttts to the storage layer when the GC is trying to GET the DC (using dynamic client). This returns 404 and it hits the rest.go, so etcd is giving us 404.

@mfojtik
Copy link
Contributor

mfojtik commented Jun 9, 2017

@Kargakis @tnozicka does rollback require running deployer pod? if yes, we will be screwed that way as well. if not then i wonder why the rollback did not kicked in.

@0xmichalis
Copy link
Contributor

We traced this with @sttts to the storage layer when the GC is trying to GET the DC (using dynamic client). This returns 404 and it hits the rest.go, so etcd is giving us 404.

So my initial guess (GC+dynamic client) wasn't off? :)

@mfojtik
Copy link
Contributor

mfojtik commented Jun 9, 2017

@Kargakis close enough, but it isn't dynamic client, we actually hit the REST endpoint and then storage gives us 404.

@smarterclayton
Copy link
Contributor

Yes, the quorum discussion will be to continue.

@0xmichalis
Copy link
Contributor

@Kargakis @tnozicka does rollback require running deployer pod? if yes, we will be screwed that way as well. if not then i wonder why the rollback did not kicked in.

Normal rollback, yes, automatic rollback in case of failure, no.

@mfojtik
Copy link
Contributor

mfojtik commented Jun 9, 2017

@smarterclayton till we have quorum on quorum?

@mfojtik
Copy link
Contributor

mfojtik commented Jun 9, 2017

@Kargakis why automatic failure rollback did not kicked in? (you are more familiar with that code)

@0xmichalis
Copy link
Contributor

@Kargakis why automatic failure rollback did not kicked in? (you are more familiar with that code)

In the example you gave here?

@mfojtik
Copy link
Contributor

mfojtik commented Jun 9, 2017

@Kargakis yes that is the problem we seeing in free-int, rolling back will at least allow us to proceed with upgrade (not sure in what state that will leave the prod)

@0xmichalis
Copy link
Contributor

In order for automatic rollback to work, a complete deployment needs to exist in the history of a DC.

@mfojtik
Copy link
Contributor

mfojtik commented Jun 9, 2017

@Kargakis router-384 doesn't seems like an initial deployement.

@0xmichalis
Copy link
Contributor

@mfojtik not sure that's helpful. Do you have controller logs somewhere and the actual manifests? router-384 doesn't seem to be included in Justin's link.

@smarterclayton
Copy link
Contributor

All of the DCs already existed.

@mfojtik
Copy link
Contributor

mfojtik commented Jun 9, 2017

@Kargakis was answering the question of previous deployment needs to exists :-)

my question is why the automatic rollback on failure had not kicked in...

don't have controller/api master logs, but it is easy to reproduce locally, just delete deployer pod in middle of the deployment. what @smarterclayton is saying is that in such case we should rollback automatically and not leave the deployment in failed state (in this case with 0 pods running).

@jupierce
Copy link
Contributor

jupierce commented Jun 9, 2017

@smarterclayton
Copy link
Contributor

Let's spawn the "didn't rollback" as a separate high priority issue.

@0xmichalis
Copy link
Contributor

That's what I was going to suggest - opened #14561

@0xmichalis
Copy link
Contributor

Two out of three controller manager instaces fail on leader election for more than 2 hours, from 18:10 up until 20:38(EOF) and there is an unusual amount of dropped watches and TLS handshake errors in the API servers.

@mfojtik
Copy link
Contributor

mfojtik commented Jun 11, 2017

I wonder if we can enable more verbose logging in free-int, to see the REST requests at least which will give us timeline of events (why and by whom the RC/deployer was deleted)

@jupierce
Copy link
Contributor

Which RC? I deleted docker-registry RC in this environment before to force a redeployment. On June 8th sometime. This problem observed in free-int appears to be affecting all DCs, not just docker-registry.

@mfojtik
Copy link
Contributor

mfojtik commented Jun 12, 2017

Seems like the wrong apiVersion in ownerRef was used: #14582

@stevekuznetsov
Copy link
Contributor

Router/registry rollout confirmation for Ansible is here: openshift/openshift-ansible#4402

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants