Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deploy: set ownerRef from RC to grouped API version #14582

Merged
merged 1 commit into from
Jun 12, 2017

Conversation

mfojtik
Copy link
Contributor

@mfojtik mfojtik commented Jun 12, 2017

@smarterclayton @sttts this should fix the deployer gone missing problem (I believe).

Some explanation:

When the GC cache is cold, the GC will try to use a live GET using the dynamic client to check if the resource specified in the ownerRef exists. In this case the RC is pointing to a DC. However, the dynamic client use wrong API prefix (/api instead of /oapi) it this will return 404 and the GC thinks that the DC is gone and mark the RC for deletion. The RC is deleted and since the deployer pod has an ownerRef pointing to this RC the pod is also terminated and deleted.
When the GC caches warm up, it will see the DC (UUID) and so it will not do live GET lookup and keep the RC untouched, allowing successfull rollout. I guess we were just lucky in our CI/CD and we warmed the cache fast enough to not hit this (the ipfailover flaked)...

What this patch does is to force API group version for the DC, which the dynamic client should handle (i can confirm it returns 200).

Fixes: #13995

@mfojtik
Copy link
Contributor Author

mfojtik commented Jun 12, 2017

@tnozicka @smarterclayton do we need migration for RC that have already legacy API group set or the refManager will patch them again with the correct ref?

@mfojtik
Copy link
Contributor Author

mfojtik commented Jun 12, 2017

[test]

@mfojtik
Copy link
Contributor Author

mfojtik commented Jun 12, 2017

To confirm this works:

I0612 11:56:06.641729    4880 wrap.go:42] GET /apis/apps.openshift.io/v1/namespaces/default/deploymentconfigs/docker-registry: (2.215821ms) 200 [[openshift/v1.6.1+5115d708d7 (linux/amd64) kubernetes/010d313/system:serviceaccount:kube-system:generic-garbage-collector] 192.168.64.3:47245]
I0612 11:56:06.641792    4880 wrap.go:42] GET /apis/apps.openshift.io/v1/namespaces/default/deploymentconfigs/router: (2.318231ms) 200 [[openshift/v1.6.1+5115d708d7 (linux/amd64) kubernetes/010d313/system:serviceaccount:kube-system:generic-garbage-collector] 192.168.64.3:47245]

Now the GC client gets the DC successfully.

@mfojtik
Copy link
Contributor Author

mfojtik commented Jun 12, 2017

@bparees fyi, i guess the dynamic client will be broken for all ownerRefs that build* creates?

@mfojtik
Copy link
Contributor Author

mfojtik commented Jun 12, 2017

@jwforres @spadgett dunno if this affects web console or not.

Copy link
Contributor

@tnozicka tnozicka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to 0824d22

@deads2k
Copy link
Contributor

deads2k commented Jun 12, 2017

lgtm [merge]

@deads2k
Copy link
Contributor

deads2k commented Jun 12, 2017

@mfojtik let's try to think of a way to make sure that these cause errors, not disasters. Seems like using a RESTMapper without our /oapi types could be a winner.

@mfojtik
Copy link
Contributor Author

mfojtik commented Jun 12, 2017

@smarterclayton i think this was 85% of the friday's problem, the rest 15% was quorum read which we should address as well.

@openshift-bot
Copy link
Contributor

openshift-bot commented Jun 12, 2017

continuous-integration/openshift-jenkins/merge Waiting: You are in the build queue at position: 5

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to 0824d22

@pweil-
Copy link
Contributor

pweil- commented Jun 12, 2017

@jupierce FYI

@tnozicka
Copy link
Contributor

I don't think we need migration for this as this was not in any released version, yet. (Also this should kind of migrate itself over time as those old RCs get deleted by mistake new RCs gets created properly, with history loss. This should happen only on start up [because of caches] not causing almost any downtime since containers were not really started yet anyways.)

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/2129/) (Base Commit: c09c601)

@smarterclayton
Copy link
Contributor

Don't we have to do this for builds and everything else?

@mfojtik
Copy link
Contributor Author

mfojtik commented Jun 12, 2017

@smarterclayton @bparees was already warned

@smarterclayton
Copy link
Contributor

Force merging to clear the queue, does not conflict with currently merging PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Garbage collector deleted deployer pod prematurely
6 participants