Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Router fails to get endpoints during e2e #8403

Closed
smarterclayton opened this issue Apr 7, 2016 · 40 comments
Closed

Router fails to get endpoints during e2e #8403

smarterclayton opened this issue Apr 7, 2016 · 40 comments

Comments

@smarterclayton
Copy link
Contributor

https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/5537/

Happening fairly frequently

There was no error output from the command.
[INFO] Back to 'default' project with 'admin' user...
Running test/end-to-end/core.sh:286: executing 'oc project default/172-18-4-219:8443/system:admin' expecting success...
SUCCESS after 0.140s: test/end-to-end/core.sh:286: executing 'oc project default/172-18-4-219:8443/system:admin' expecting success
Standard output from the command:
Now using project "default" from context named "default/172-18-4-219:8443/system:admin" on server "https://172.18.4.219:8443".
There was no error output from the command.
Running test/end-to-end/core.sh:290: executing 'oc get endpoints router --output-version=v1beta3 --template='{{ if .subsets }}{{ len .subsets }}{{ else }}0{{ end }}'' expecting any result and text '[1-9]+'; re-trying every 0.2s until completion or 300.000s...
FAILURE after 299.580s: test/end-to-end/core.sh:290: executing 'oc get endpoints router --output-version=v1beta3 --template='{{ if .subsets }}{{ len .subsets }}{{ else }}0{{ end }}'' expecting any result and text '[1-9]+'; re-trying every 0.2s until completion or 300.000s: the command timed out
There was no output from the command.
There was no error output from the command.
@smarterclayton
Copy link
Contributor Author

Early in the run:

Using project "default".
There was no error output from the command.
Running test/end-to-end/core.sh:108: executing 'oc whoami' expecting success and text 'e2e-user'...
SUCCESS after 0.095s: test/end-to-end/core.sh:108: executing 'oc whoami' expecting success and text 'e2e-user'
Standard output from the command:
e2e-user
There was no error output from the command.
Running test/end-to-end/core.sh:111: executing 'oc status -n default' expecting success...
SUCCESS after 0.252s: test/end-to-end/core.sh:111: executing 'oc status -n default' expecting success
Standard output from the command:
In project default on server https://172.18.4.219:8443

svc/docker-registry - 172.30.48.168:5000
  dc/docker-registry deploys docker.io/openshift/origin-docker-registry:53ac3df 
    deployment #1 deployed 53 seconds ago - 1 pod

svc/kubernetes - 172.30.0.1 ports 443, 53, 53

svc/router - 172.30.18.235 ports 80, 443, 1936
  dc/router deploys docker.io/openshift/origin-haproxy-router:53ac3df 
    deployment #2 deployed 53 seconds ago - 0 pods
    deployment #1 failed 53 seconds ago: cancelled as a newer deployment was found running - 1 pod
I0407 17:38:27.261965    4763 prober.go:104] Liveness probe for "router-1-0nvm4_default(52d2f7a7-fce7-11e5-b5a6-0e870c302687):router" failed (failure): Get http://localhost:1936/healthz: dial tcp [::1]:1936: getsockopt: connection refused
I0407 17:38:27.262359    4763 server.go:606] Event(api.ObjectReference{Kind:"Pod", Namespace:"default", Name:"router-1-0nvm4", UID:"52d2f7a7-fce7-11e5-b5a6-0e870c302687", APIVersion:"v1", ResourceVersion:"607", FieldPath:"spec.containers{router}"}): typ
e: 'Warning' reason: 'Unhealthy' Liveness probe failed: Get http://localhost:1936/healthz: dial tcp [::1]:1936: getsockopt: connection refused
I0407 17:38:27.262463    4763 prober.go:104] Readiness probe for "router-1-0nvm4_default(52d2f7a7-fce7-11e5-b5a6-0e870c302687):router" failed (failure): Get http://localhost:1936/healthz: dial tcp [::1]:1936: getsockopt: connection refused
I0407 17:38:27.262497    4763 server.go:606] Event(api.ObjectReference{Kind:"Pod", Namespace:"default", Name:"router-1-0nvm4", UID:"52d2f7a7-fce7-11e5-b5a6-0e870c302687", APIVersion:"v1", ResourceVersion:"607", FieldPath:"spec.containers{router}"}): typ
e: 'Warning' reason: 'Unhealthy' Readiness probe failed: Get http://localhost:1936/healthz: dial tcp [::1]:1936: getsockopt: connection refused
I0407 17:38:28.005261    4763 manager.go:1400] Container "a97340019c2b7329c7a8ecbe7f65407b7838b4fe72657b3f6e3eefdfa763d707 router default/router-1-0nvm4" exited after 929.656383ms
I0407 17:38:28.005746    4763 server.go:606] Event(api.ObjectReference{Kind:"Pod", Namespace:"default", Name:"router-1-0nvm4", UID:"52d2f7a7-fce7-11e5-b5a6-0e870c302687", APIVersion:"v1", ResourceVersion:"607", FieldPath:"spec.containers{router}"}): typ
e: 'Normal' reason: 'Killing' Killing container with docker id a97340019c2b: Need to kill pod.
I0407 17:38:28.322312    4763 manager.go:1368] Killing container "32fa2c8be13a0c35bf120ba21264667bd75e5dbc29035a55dcb552e656457c3f default/router-1-0nvm4" with 30 second grace period
I0407 17:38:28.540623    4763 helpers.go:101] Unable to get network stats from pid 12216: couldn't read network stats: failure opening /rootfs/proc/12216/net/dev: open /rootfs/proc/12216/net/dev: no such file or directory
I0407 17:38:28.691022    4763 generic.go:138] GenericPLEG: 52d2f7a7-fce7-11e5-b5a6-0e870c302687/a97340019c2b7329c7a8ecbe7f65407b7838b4fe72657b3f6e3eefdfa763d707: running -> exited
I0407 17:38:28.751013    4763 manager.go:855] Destroyed container: "/system.slice/docker-32fa2c8be13a0c35bf120ba21264667bd75e5dbc29035a55dcb552e656457c3f.scope" (aliases: [k8s_POD.fe97dd62_router-1-0nvm4_default_52d2f7a7-fce7-11e5-b5a6-0e870c302687_e892
7fd6 32fa2c8be13a0c35bf120ba21264667bd75e5dbc29035a55dcb552e656457c3f], namespace: "docker")
I0407 17:38:28.751104    4763 handler.go:320] Added event &{/system.slice/docker-32fa2c8be13a0c35bf120ba21264667bd75e5dbc29035a55dcb552e656457c3f.scope 2016-04-07 17:38:28.751090997 +0000 UTC containerDeletion {<nil>}}
I0407 17:38:28.931523    4763 manager.go:1400] Container "32fa2c8be13a0c35bf120ba21264667bd75e5dbc29035a55dcb552e656457c3f default/router-1-0nvm4" exited after 609.176107ms

@Kargakis @ironcladlou should this happen in e2e? Getting two deployments of the router?

@smarterclayton
Copy link
Contributor Author

NM, that's because of how the router install works. It's triggering a reload by changing the env for the syn eater.

@smarterclayton
Copy link
Contributor Author

This seems wrong:

I0407 17:36:47.737760    4763 factory.go:272] About to try and schedule pod router-2-kna27
...
I0407 17:36:56.208057    4763 factory.go:272] About to try and schedule pod router-1-0nvm4

@smarterclayton
Copy link
Contributor Author

I0407 17:36:33.449502    4763 factory.go:272] About to try and schedule pod router-1-deploy
I0407 17:36:33.744055    4763 factory.go:272] About to try and schedule pod router-2-deploy
I0407 17:36:34.531094    4763 factory.go:272] About to try and schedule pod router-1-deploy
I0407 17:36:34.748442    4763 factory.go:272] About to try and schedule pod router-2-deploy
I0407 17:36:36.534555    4763 factory.go:272] About to try and schedule pod router-1-deploy
I0407 17:36:36.752137    4763 factory.go:272] About to try and schedule pod router-2-deploy
I0407 17:36:40.537860    4763 factory.go:272] About to try and schedule pod router-1-deploy
I0407 17:36:40.758626    4763 factory.go:272] About to try and schedule pod router-2-deploy
I0407 17:36:47.737760    4763 factory.go:272] About to try and schedule pod router-2-kna27
I0407 17:36:56.208057    4763 factory.go:272] About to try and schedule pod router-1-0nvm4
I0407 17:36:57.211333    4763 factory.go:272] About to try and schedule pod router-1-0nvm4
I0407 17:36:59.214447    4763 factory.go:272] About to try and schedule pod router-1-0nvm4
I0407 17:37:03.217359    4763 factory.go:272] About to try and schedule pod router-1-0nvm4
I0407 17:37:05.825139    4763 kubelet.go:2430] SyncLoop (PLEG): "router-2-deploy_default(456e572f-fce7-11e5-b5a6-0e870c302687)", event: &pleg.PodLifecycleEvent{ID:"456e572f-fce7-11e5-b5a6-0e870c302687", Type:"ContainerStarted", Data:"1f02c778a6f5825547b
I0407 17:37:06.354535    4763 kubelet.go:2430] SyncLoop (PLEG): "router-1-deploy_default(45435414-fce7-11e5-b5a6-0e870c302687)", event: &pleg.PodLifecycleEvent{ID:"45435414-fce7-11e5-b5a6-0e870c302687", Type:"ContainerStarted", Data:"03befa80c633bc4a41b
I0407 17:37:11.225722    4763 factory.go:272] About to try and schedule pod router-1-0nvm4
I0407 17:37:12.443523    4763 kubelet.go:2430] SyncLoop (PLEG): "router-2-deploy_default(456e572f-fce7-11e5-b5a6-0e870c302687)", event: &pleg.PodLifecycleEvent{ID:"456e572f-fce7-11e5-b5a6-0e870c302687", Type:"ContainerDied", Data:"d41e3a850987f70cdf9db6
I0407 17:37:27.229644    4763 factory.go:272] About to try and schedule pod router-1-0nvm4
I0407 17:37:49.132053    4763 kubelet.go:2430] SyncLoop (PLEG): "router-1-deploy_default(45435414-fce7-11e5-b5a6-0e870c302687)", event: &pleg.PodLifecycleEvent{ID:"45435414-fce7-11e5-b5a6-0e870c302687", Type:"ContainerDied",

@smarterclayton
Copy link
Contributor Author

So it looks like we schedule both deployment pods. The second one? starts running, and creates at least one pod in the second group. It scales up, then for some reason gets scaled down (failed, death?) at

I0407 17:36:54.011987    4763 replication_controller.go:451] Too many "default"/"router-2" replicas, need 0, deleting 1

It is incredibly worrying to me that we don't see the logs of the failed pods - it may be because they were empty, but not seeing how the deployment logs would be empty at all.

@smarterclayton
Copy link
Contributor Author

Looks like we had two deployment pods running at the same time. That shouldn't be happening?

@smarterclayton smarterclayton assigned pweil- and unassigned knobunc Apr 7, 2016
@smarterclayton
Copy link
Contributor Author

Moving this to @pweil-'s team for triage

@smarterclayton
Copy link
Contributor Author

@pweil- pweil- assigned 0xmichalis and unassigned pweil- Apr 7, 2016
@pweil-
Copy link
Contributor

pweil- commented Apr 7, 2016

@Kargakis - I haven't been able to reproduce this yet. Of note in the logs, there were some scheduling issues due to the node not being registered. I tried to reproduce that by deleting my node and re-adding right after a back off for scheduling deploy-1 after I triggered a config change but that didn't seem to work.

Sending your way to continue looking. Will check in in the morning

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Apr 7, 2016 via email

@0xmichalis
Copy link
Contributor

The second deployment comes from the config generator

17:36:33.557873    4763 controller.go:89] Updated DeploymentConfig default/router:1 from version 1 to 2 for existing deployment default/router-1

https://github.com/openshift/origin/blob/28e2e753a278d5c07191f8aeae19273fa282b7c5/pkg/deploy/controller/configchange/controller.go#L80-l89

What might have changed the router template between 1 and 2? Or is it probably a race? Looking

@0xmichalis
Copy link
Contributor

Updated deployment default/router-1 status from Failed to Complete (scale: 1)

ouch

@0xmichalis
Copy link
Contributor

It seems that the first deployment is cancelled by the deploymentconfig controller because it noticed the second version:

deployment.Annotations[deployapi.DeploymentCancelledAnnotation] = deployapi.DeploymentCancelledAnnotationValue

The deployment correctly switches from New to Failed:

controller.go:234] Updated deployment default/router-1 status from New to Failed (scale: 0)

Later, the already existing deployer is being scheduled

About to try and schedule pod router-1-deploy

The deployer controller and the deployer don't care a lot about the deployment phase so eventually the deployer pod will complete successfully and switch the deployment from Failed to Complete.

Still not sure why the second deployment started - we probably were hitting it since ever but after #8163 we started cleaning up cancel annotation on successful deployers - I think we should ensure the deployment phase direction for now and fail deployers if a deployment is already failed.

@0xmichalis
Copy link
Contributor

The second deployment happens because of

oc set env dc/router -c router DROP_SYN_DURING_RESTART=true

I think 8417 and 8418 should be enough for this

@smarterclayton
Copy link
Contributor Author

Saw this happen again https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/14441/artifact/origin/artifacts/test-end-to-end-docker/logs/container-router-1-deploy-deployment.log

I0411 23:36:00.351914       1 deployer.go:194] Scaled down prior deployment default/router-2
I0411 23:36:00.353956       1 deployer.go:200] Deploying default/router-1 for the first time (replicas: 1)
I0411 23:36:00.354481       1 recreate.go:126] Scaling default/router-1 to 1 before performing acceptance check
I0411 23:36:02.385391       1 recreate.go:131] Performing acceptance check of default/router-1
I0411 23:36:02.385576       1 lifecycle.go:445] Waiting 600 seconds for pods owned by deployment "default/router-1" to become ready (checking every 1 seconds; 0 pods previously accepted)

@ironcladlou
Copy link
Contributor

@Kargakis One thing I don't think we accounted for is the strategy code itself checking to ensure the deployment it's about to execute is still valid. Even though we take steps to prevent their scheduling in the first place, and handle the outcomes more correctly, it's still possible the invalidated deployer pod gets scheduled and then spins up before deadlines or deletions are effective. I'm thinking the strategy code itself needs to inspect the state of things and bail.

This is evidenced by the output @smarterclayton brought: the strategy code knew about the later deployment which should invalidate the executing code, but instead just scales it down and keeps going.

@ironcladlou
Copy link
Contributor

I think by this point we have more than enough context to exit early if we detect the deployment is superceded.

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Apr 12, 2016 via email

@ironcladlou
Copy link
Contributor

Looked to me like router-2 deployed successfully, and then the router-1-deploy pod got scheduled and executed. The common deployer logic which detects the "from" deployment and "old" deployments has a problem: it considers the latest of all completed deployments not equal to the target (router-1 in this case) to be the "from" and everything else to be "old". So if router-2 is completed before router-1-deploy is scheduled and router-1-deploy gets an opportunity to execute, router-2 is considered "from". I think version checking here will correct it.

I'm not sure it has anything to do with parallelism- it still seems to be related to a pending deployer pod getting hung up in scheduling and getting a chance to run after a newer deployment was already created and deployed.

@Kargakis something else I noticed after a fresh look this morning is that the deploymentConfigController should be waiting for cancelled deployments to have an outcome before creating the new one. If the old deployment has a deployer pod hanging out in pending awaiting scheduling, I'm not sure why a new version would even be created leading to a new version/deployer pod unless the old deployment erroneously got updated to failed/complete before its deployer pod finished.

So in addition to the common deployer code not doing the right thing when we get into this situation, it's not yet clear to me how router-2 was allowed to exist to begin with prior to the completion of router-1.

@ironcladlou
Copy link
Contributor

Didn't realize @Kargakis is out of the office- I'll be digging into this later in the morning.

@0xmichalis
Copy link
Contributor

Instead of waiting for previous cancelled deployments, can't we just delete
any older deployers if we find out there is a newer deployment? Or cleanup
on failed cancelled:
#8439 (comment)

Also, the deployer code seems broken.

On Tue, Apr 12, 2016 at 2:27 PM, Dan Mace [email protected] wrote:

Looked to me like router-2 deployed successfully, and then the
router-1-deploy pod got scheduled and executed. The common deployer logic
which detects the "from" deployment and "old" deployments has a problem: it
considers the latest of all completed deployments not equal to the target
(router-1 in this case) to be the "from" and everything else to be "old".
So if router-2 is completed before router-1-deploy is scheduled and
router-1-deploy gets an opportunity to execute, router-2 is considered
"from". I think version checking here will correct it.

I'm not sure it has anything to do with parallelism- it still seems to be
related to a pending deployer pod getting hung up in scheduling and getting
a chance to run after a newer deployment was already created and deployed.

@Kargakis https://github.com/kargakis something else I noticed after a
fresh look this morning is that the deploymentConfigController should be
waiting for cancelled deployments to have an outcome
https://github.com/openshift/origin/blob/master/pkg/deploy/controller/deploymentconfig/controller.go#L88
before creating the new one. If the old deployment has a deployer pod
hanging out in pending awaiting scheduling, I'm not sure why a new version
would even be created leading to a new version/deployer pod unless the old
deployment erroneously got updated to failed/complete before its deployer
pod finished.

So in addition to the common deployer code not doing the right thing when
we get into this situation, it's not yet clear to me how router-2 was
allowed to exist to begin with prior to the completion of router-1.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8403 (comment)

@ironcladlou
Copy link
Contributor

Thinking about this yet again, patching the deployer CLI is only a bandaid. The router-2 RC shouldn't exist since router-1-deploy exists/is non-terminal and router-1 is presumably not complete/failed. That seems to be the root issue from what I can see.

@0xmichalis
Copy link
Contributor

Wouldn't it be confusing if a second deployment is expected but we still
hang on the first?

On Tue, Apr 12, 2016 at 5:11 PM, Dan Mace [email protected] wrote:

Thinking about this yet again, patching the deployer CLI is only a
bandaid. The router-2 RC shouldn't exist since router-1-deploy exists/is
non-terminal and router-1 is presumably not complete/failed. That seems to
be the root issue from what I can see.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8403 (comment)

@smarterclayton
Copy link
Contributor Author

We should NEVER have more than one deployer pod running at a time. Ever.

On Tue, Apr 12, 2016 at 10:13 AM, Michail Kargakis <[email protected]

wrote:

Wouldn't it be confusing if a second deployment is expected but we still
hang on the first?

On Tue, Apr 12, 2016 at 5:11 PM, Dan Mace [email protected]
wrote:

Thinking about this yet again, patching the deployer CLI is only a
bandaid. The router-2 RC shouldn't exist since router-1-deploy exists/is
non-terminal and router-1 is presumably not complete/failed. That seems
to
be the root issue from what I can see.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8403 (comment)


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8403 (comment)

@0xmichalis
Copy link
Contributor

Agreed. That;s why I am proposing deletion of older deployers when a newer
deployment is detected.

On Tue, Apr 12, 2016 at 5:14 PM, Clayton Coleman [email protected]
wrote:

We should NEVER have more than one deployer pod running at a time. Ever.

On Tue, Apr 12, 2016 at 10:13 AM, Michail Kargakis <
[email protected]

wrote:

Wouldn't it be confusing if a second deployment is expected but we still
hang on the first?

On Tue, Apr 12, 2016 at 5:11 PM, Dan Mace [email protected]
wrote:

Thinking about this yet again, patching the deployer CLI is only a
bandaid. The router-2 RC shouldn't exist since router-1-deploy
exists/is
non-terminal and router-1 is presumably not complete/failed. That seems
to
be the root issue from what I can see.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
<
https://github.com/openshift/origin/issues/8403#issuecomment-208925491>


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8403 (comment)


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8403 (comment)

@ironcladlou
Copy link
Contributor

This is a problem: https://github.com/openshift/origin/blob/master/pkg/deploy/controller/deployment/controller.go#L61

If we set the RC phase to failed without waiting for the deployer pods to resolve to a terminal phase, the deployment config controller will see the RC in a terminal phase and decide that there are no outstanding deployments, and make a new one.

I don't think we should ever transition the RC phase to a terminal phase unless:

  1. We observed completion of all deployer pods, or
  2. The deployer pods vanished

Otherwise we can't rely on the RC terminal phases as being true in terms of deployer pods.

@0xmichalis
Copy link
Contributor

Good catch, I think this is the only place we do such a thing and never
felt good about it.

On Tue, Apr 12, 2016 at 5:31 PM, Dan Mace [email protected] wrote:

This is a problem:
https://github.com/openshift/origin/blob/master/pkg/deploy/controller/deployment/controller.go#L61

If we set the RC phase to failed without waiting for the deployer pods to
resolve to a terminal phase, the deployment config controller will see the
RC in a terminal phase and decide that there are no outstanding
deployments, and make a new one.

I don't think we should ever transition the RC phase to a terminal phase
unless:

  1. We observed completion of all deployer pods, or
  2. The deployer pods vanished

Otherwise we can't rely on the RC terminal phases as being true in terms
of deployer pods.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8403 (comment)

@ironcladlou
Copy link
Contributor

Actually, that failed update is only happening when the RC is in New, in which case how would any deployer pods have been created to cause an issue? Maybe I jumped the gun...

@0xmichalis
Copy link
Contributor

We create the deployer when the deployment is New, right?

On Tue, Apr 12, 2016 at 5:35 PM, Dan Mace [email protected] wrote:

Actually, that failed update is only happening when the RC is in New, in
which case how would any deployer pods have been created to cause an issue?
Maybe I jumped the gun...


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8403 (comment)

@ironcladlou
Copy link
Contributor

We create the deployer when the deployment is New, right?

I just noticed that we create the pod and then break without transitioning out of New, so if before the next sync there's a cancellation, we'll enter the scenario I describe. So yeah, I think my analysis still stands.

@pweil-
Copy link
Contributor

pweil- commented Apr 14, 2016

@ironcladlou - assigning this to you since you're already working it and @Kargakis is on PTO until the end of the week.

@pweil- pweil- assigned ironcladlou and unassigned 0xmichalis Apr 14, 2016
@ironcladlou
Copy link
Contributor

@smarterclayton is this now considered fixed as of #8478?

@0xmichalis
Copy link
Contributor

@ironcladlou can we have an issue to fix the deployer too? Not a blocker but I think we should fix it for 1.3

@ironcladlou
Copy link
Contributor

@Kargakis

can we have an issue to fix the deployer too? Not a blocker but I think we should fix it for 1.3

It'll always be racing with something (cancellation, other deployments it might just miss when looking, etc.)

Seems more valuable to keep on top of enforcing the "only 1 deployer pod should exist at a time" invariant externally.

It's still probably useful to have a "last line of defense" in the deployer code. Not sure about 1.3- I'll let @pweil- and @smarterclayton chime in there.

@0xmichalis
Copy link
Contributor

Seems more valuable to keep on top of enforcing the "only 1 deployer pod should exist at a time" invariant externally.

Related: #8500

@smarterclayton
Copy link
Contributor Author

Deployers should bail if their invariants are violated and they detect it. We can do more in 1.3. Leaving open to ensure its well and truly gone.

@pweil-
Copy link
Contributor

pweil- commented Apr 15, 2016

We can do more in 1.3

removing from the blocker list for now then

@0xmichalis
Copy link
Contributor

Opened #8972 for the deployer. Is there anything else tbd here?

@smarterclayton
Copy link
Contributor Author

No.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants