-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Garbage collector deleted deployer pod prematurely #13995
Comments
@mfojtik probably unrelated but I just realized that you set the DC ref into the deployer pods.. Should be the RC. |
hrm. i thought the deployer pods already have owner ref set. |
It does, but GC couldn't find it? |
I ran |
IIUC, the problem wasn't that we were setting the wrong Kind in the owner reference since the GC only cares about the UID. If that's correct, then this issue is not fixed? |
I don't think that is correct... GC cares about kind as well, right @deads2k ? |
I am pretty sure we (@mfojtik and I) tested this and it was just the UID. |
If its just the UID, that sounds like a bug. |
@Kargakis ah you're right. also I think I can reproduce this when I create the registry/router DC right after the master server starts, will dig deeper into this. |
@deads2k I did a little investigation as I can reproduce it when I run fresh new cluster and create the router right after it starts (so it might be cache issue). What happens here is that the DC controller creates the RC "router-1" right after the server starts. Right after the "router-1-deploy" pod is created with the ownerRef set to the UUID of "router-1" RC. Relevant logs:
This is when the RC is gone:
The RC now have ownerRef to DC, so it should not be removed and also this failed before the RC has any ownerRef, so I assume there is a bug in GC and it is somehow related to the cache population. |
@mfojtik what ownerRefs does the DC controller create the RC with? |
(the UUID is correct) |
@liggitt @deads2k for what is worth I also found this in logs:
Seems like the GC does not see that UUID (although without that DC exists the RC will never be created). |
DC & RC yaml + events available: http://file.rdu.redhat.com/~jupierce/share/free-int-data.txt |
Justin, can you post the API server logs too? It seems that all the posted
deployments failed due to missing deployers!
…On Fri, Jun 9, 2017 at 8:36 PM, Justin Pierce ***@***.***> wrote:
DC & RC yaml + events available: http://file.rdu.redhat.com/~
jupierce/share/free-int-data.txt
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#13995 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADuFf5E3TbG9Kq4KduBgpIxy_2KPy43kks5sCZDHgaJpZM4NOMax>
.
|
Also this is an HA etcd so it's extremely likely that race conditions happen - so it may be that this bug happens on every deployment, not just some. |
in a nutshell:
We traced this with @sttts to the storage layer when the GC is trying to GET the DC (using dynamic client). This returns 404 and it hits the rest.go, so etcd is giving us 404. |
So my initial guess (GC+dynamic client) wasn't off? :) |
@Kargakis close enough, but it isn't dynamic client, we actually hit the REST endpoint and then storage gives us 404. |
Yes, the quorum discussion will be to continue. |
@smarterclayton till we have quorum on quorum? |
@Kargakis why automatic failure rollback did not kicked in? (you are more familiar with that code) |
@Kargakis yes that is the problem we seeing in free-int, rolling back will at least allow us to proceed with upgrade (not sure in what state that will leave the prod) |
In order for automatic rollback to work, a complete deployment needs to exist in the history of a DC. |
@Kargakis |
@mfojtik not sure that's helpful. Do you have controller logs somewhere and the actual manifests? router-384 doesn't seem to be included in Justin's link. |
All of the DCs already existed. |
@Kargakis was answering the question of previous deployment needs to exists :-) my question is why the automatic rollback on failure had not kicked in... don't have controller/api master logs, but it is easy to reproduce locally, just delete deployer pod in middle of the deployment. what @smarterclayton is saying is that in such case we should rollback automatically and not leave the deployment in failed state (in this case with 0 pods running). |
@Kargakis Here is a snapshot of logs: http://file.rdu.redhat.com/~jupierce/share/free-int-logs.tgz |
Let's spawn the "didn't rollback" as a separate high priority issue. |
That's what I was going to suggest - opened #14561 |
Two out of three controller manager instaces fail on leader election for more than 2 hours, from 18:10 up until 20:38(EOF) and there is an unusual amount of dropped watches and TLS handshake errors in the API servers. |
I wonder if we can enable more verbose logging in free-int, to see the REST requests at least which will give us timeline of events (why and by whom the RC/deployer was deleted) |
Which RC? I deleted docker-registry RC in this environment before to force a redeployment. On June 8th sometime. This problem observed in free-int appears to be affecting all DCs, not just docker-registry. |
Seems like the wrong apiVersion in ownerRef was used: #14582 |
Router/registry rollout confirmation for Ansible is here: openshift/openshift-ansible#4402 |
Forking from #13943 (comment)
In at least 1 CI run, the kube garbage collector controller decided to delete the ipfailover-1-deploy pod shortly after it was created and scheduled to a node. It looks like it fell through to here which resulted in the deletion.
cc @Kargakis @mfojtik @smarterclayton @sttts @deads2k @liggitt @soltysh
The text was updated successfully, but these errors were encountered: