Pod shall not transition from terminated phase: "Failed" -> "Succeeded" #17595

0xmichalis · 2017-12-05T10:03:41Z

/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/deployments/deployments.go:1299
2017-12-05 09:50:41.393147976 +0000 UTC: detected deployer pod transition from terminated phase: "Failed" -> "Succeeded"
Expected
    <bool>: true
to be false
/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/deployments/util.go:727

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17589/test_pull_request_origin_extended_conformance_install/3497/

/sig master
/kind test-flake
/assign mfojtik tnozicka

xref https://bugzilla.redhat.com/show_bug.cgi?id=1534492

The text was updated successfully, but these errors were encountered:

tnozicka · 2017-12-05T11:31:43Z

@sjenning it seems to already have #17514

I don't think your condition prevents failed->succeded

origin/vendor/k8s.io/kubernetes/pkg/kubelet/status/status_manager.go

Line 286 in 21271eb

if oldStatus.State.Terminated != nil && newStatus.State.Terminated == nil {

There few other flakes likely caused by kubelet not respecting pod state transition diagram:

/assign @sjenning
/unassign @tnozicka @mfojtik
/priority P0

tnozicka · 2017-12-05T11:36:12Z

(previous issue #17011)

tnozicka · 2017-12-11T16:00:57Z

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17713/test_pull_request_origin_extended_conformance_install/3886/

@sjenning any progress on this?

sjenning · 2017-12-18T03:52:20Z

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17839/test_pull_request_origin_extended_conformance_install/4317/

tnozicka · 2017-12-20T16:53:54Z

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17827/test_pull_request_origin_extended_conformance_install/4519/

mfojtik · 2018-01-08T13:47:14Z

@sjenning any chance the pod team can investigate this? regardless of flakes, this might affect production deployments where the deployer pod can go from failed to success.

tnozicka · 2018-01-11T10:16:08Z

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17773/test_pull_request_origin_extended_conformance_install/5243/

sjenning · 2018-01-22T19:48:38Z

I'm looking into this now. Sorry for the delay.

sjenning · 2018-01-30T16:59:18Z

@tnozicka just wondering, have you seen this one lately? Doesn't look like we have hit this one for a while now. Did it disappear at the same time we bumped the disk size on the CI instances?

tnozicka · 2018-01-31T10:58:34Z

I haven't. @ironcladlou @mfojtik got any flakes looking like this? (Those are usually different tests as the detector is asynchronous and run for every deployment test.)

Any chance that this might be related #18233?

tnozicka · 2018-02-01T16:42:54Z

likely not caused by the informers issue I pointed you to as @deads2k just saw it and the watch cache is already fixed on master

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/18387/test_pull_request_origin_extended_conformance_install/6640/

sjenning · 2018-02-01T22:28:33Z

@tnozicka still trying to get to the root cause on this kubernetes/kubernetes#58711 (comment)

tnozicka · 2018-02-08T11:02:15Z

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/batch/test_pull_request_origin_extended_conformance_install/7087/

sjenning · 2018-02-12T21:39:22Z

Opened a PR upstream:
kubernetes/kubernetes#59767

@dashpole

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. kubelet: check for illegal phase transition I have been unable to root cause the transition from `Failed` phase to `Succeeded`. I have been unable to recreate as well. However, our CI in Origin, where we have controllers that look for these transitions and rely on the phase transition rules being respected, obviously indicate that the illegal transition does occur. This PR will prevent the illegal phase transition from propagating into the kubelet caches or the API server. Fixes #58711 xref openshift/origin#17595 @dashpole @yujuhong @derekwaynecarr @smarterclayton @tnozicka

smarterclayton · 2018-02-25T06:23:25Z

Did this make it into 3.9? If so, still happening (below). If not, we need to get it in before release.

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/18739/test_pull_request_origin_extended_conformance_install/7966/

sjenning · 2018-02-27T15:28:25Z

It did go in about 2 weeks ago
#18585

Looking into it. If this is legitimately happening even with the check I added, I'm very confused.

tnozicka · 2018-03-01T11:41:09Z

I don't think the requested check in apiserver is present.

To check, if I do

curl -k -v -XPATCH -H "Authorization: Bearer ${TOKEN}" -H "Accept: application/json" -H "Content-Type: application/strategic-merge-patch+json" https://172.16.20.11:8443/api/v1/namespaces/test/pods/busyapp-76589968df-xnthx/status -d '{"status": {"phase": "Failed"}}'

it gets overwritten back to Running.

This is a serious bug and I feel we need that enforcement in the apiserver as well as the fix.

It feels like the whole kubelet status manager ignores optimistic concurrency with resourceVersion and just does GET before updating the status no matter what resourceVersion was used for computing that status - the precondition might have changed in that case.

origin/vendor/k8s.io/kubernetes/pkg/kubelet/status/status_manager.go

Lines 451 to 476 in 710998e

    
           pod, err := m.kubeClient.CoreV1().Pods(status.podNamespace).Get(status.podName, metav1.GetOptions{}) 
        
           if errors.IsNotFound(err) { 
        
           	glog.V(3).Infof("Pod %q (%s) does not exist on the server", status.podName, uid) 
        
           	// If the Pod is deleted the status will be cleared in 
        
           	// RemoveOrphanedStatuses, so we just ignore the update here. 
        
           	return 
        
           } 
        
           if err != nil { 
        
           	glog.Warningf("Failed to get status for pod %q: %v", format.PodDesc(status.podName, status.podNamespace, uid), err) 
        
           	return 
        
           } 
        
           translatedUID := m.podManager.TranslatePodUID(pod.UID) 
        
           // Type convert original uid just for the purpose of comparison. 
        
           if len(translatedUID) > 0 && translatedUID != kubetypes.ResolvedPodUID(uid) { 
        
           	glog.V(2).Infof("Pod %q was deleted and then recreated, skipping status update; old UID %q, new UID %q", format.Pod(pod), uid, translatedUID) 
        
           	m.deletePodStatus(uid) 
        
           	return 
        
           } 
        
           pod.Status = status.status 
        
           // TODO: handle conflict as a retry, make that easier too. 
        
           newPod, err := m.kubeClient.CoreV1().Pods(pod.Namespace).UpdateStatus(pod) 
        
           if err != nil { 
        
           	glog.Warningf("Failed to update status for pod %q: %v", format.Pod(pod), err) 
        
           	return 
        
           }

sjenning · 2018-03-01T18:27:16Z

switching back to this issue

sjenning · 2018-03-01T19:33:37Z

I swear, I'm losing it. I totally forgot that this PR got reopened and merged upstream kubernetes/kubernetes#54530. I'll pick it right now.

@tnozicka

Automatic merge from submit-queue. [3.9] UPSTREAM: 54530: api: validate container phase transitions master PR #18791 kubernetes/kubernetes#54530 fixes #17595 xref https://bugzilla.redhat.com/show_bug.cgi?id=1534492 @tnozicka @smarterclayton @derekwaynecarr

openshift-ci-robot assigned mfojtik and tnozicka Dec 5, 2017

openshift-ci-robot added sig/master kind/test-flake Categorizes issue or PR as related to test flakes. labels Dec 5, 2017

0xmichalis mentioned this issue Dec 5, 2017

Automated cherry-pick of #17577 on release-3.8 #17589

Merged

openshift-ci-robot unassigned mfojtik and tnozicka Dec 5, 2017

openshift-ci-robot added the priority/P0 label Dec 5, 2017

openshift-ci-robot assigned sjenning Dec 5, 2017

tnozicka mentioned this issue Dec 5, 2017

Pod status.phase shall not transition from Succeeded to Pending #17011

Closed

tnozicka added component/containers sig/containers and removed sig/master labels Dec 5, 2017

tnozicka mentioned this issue Dec 5, 2017

flake: jenkins-plugin test imagestream SCM DSL #17487

Closed

tnozicka changed the title ~~deploymentconfigs keep the deployer pod invariant valid [Conformance] should deal with config change in case the deployment is still running [Suite:openshift/conformance/parallel]~~ Pod shall not transition from terminated phase: "Failed" -> "Succeeded" Dec 5, 2017

openshift-ci-robot added the sig/master label Dec 5, 2017

gabemontero mentioned this issue Dec 5, 2017

add image soft prune (api object yes, registry no) #17480

Merged

tnozicka mentioned this issue Dec 11, 2017

[WIP] - Cleanup custom deployments test #17713

Closed

sjenning mentioned this issue Dec 18, 2017

[3.8] UPSTREAM: opencontainers/runc: 1651: systemd: adjust CPUQuotaPerSecUSec to compensate for systemd internal handling #17839

Merged

tnozicka mentioned this issue Jan 11, 2018

[Feature:DeploymentConfig] deploymentconfigs keep the deployer pod invariant valid [Conformance] should deal with cancellation of running deployment [Suite:openshift/conformance/parallel] #18047

Closed

tnozicka mentioned this issue Jan 12, 2018

Multiple running deployments reported in a test - invariant violated #16003

Closed

tnozicka added this to the 3.9.0 milestone Jan 12, 2018

sjenning mentioned this issue Jan 23, 2018

kubelet: invalid pod phase transition from Failed to Succeeded kubernetes/kubernetes#58711

Closed

sjenning mentioned this issue Feb 12, 2018

kubelet: check for illegal phase transition kubernetes/kubernetes#59767

Merged

This was referenced Mar 1, 2018

UPSTREAM: 54530: api: validate container phase transitions #18791

Merged

[3.9] UPSTREAM: 54530: api: validate container phase transitions #18792

Merged

derekwaynecarr closed this as completed in #18791 Mar 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod shall not transition from terminated phase: "Failed" -> "Succeeded" #17595

Pod shall not transition from terminated phase: "Failed" -> "Succeeded" #17595

0xmichalis commented Dec 5, 2017 •

edited by sjenning

Loading

tnozicka commented Dec 5, 2017

tnozicka commented Dec 5, 2017

tnozicka commented Dec 11, 2017

sjenning commented Dec 18, 2017

tnozicka commented Dec 20, 2017

mfojtik commented Jan 8, 2018

tnozicka commented Jan 11, 2018

sjenning commented Jan 22, 2018

sjenning commented Jan 30, 2018

tnozicka commented Jan 31, 2018

tnozicka commented Feb 1, 2018

sjenning commented Feb 1, 2018

tnozicka commented Feb 8, 2018

sjenning commented Feb 12, 2018

smarterclayton commented Feb 25, 2018

sjenning commented Feb 27, 2018

tnozicka commented Mar 1, 2018

sjenning commented Mar 1, 2018

sjenning commented Mar 1, 2018

Pod shall not transition from terminated phase: "Failed" -> "Succeeded" #17595

Pod shall not transition from terminated phase: "Failed" -> "Succeeded" #17595

Comments

0xmichalis commented Dec 5, 2017 • edited by sjenning Loading

tnozicka commented Dec 5, 2017

tnozicka commented Dec 5, 2017

tnozicka commented Dec 11, 2017

sjenning commented Dec 18, 2017

tnozicka commented Dec 20, 2017

mfojtik commented Jan 8, 2018

tnozicka commented Jan 11, 2018

sjenning commented Jan 22, 2018

sjenning commented Jan 30, 2018

tnozicka commented Jan 31, 2018

tnozicka commented Feb 1, 2018

sjenning commented Feb 1, 2018

tnozicka commented Feb 8, 2018

sjenning commented Feb 12, 2018

smarterclayton commented Feb 25, 2018

sjenning commented Feb 27, 2018

tnozicka commented Mar 1, 2018

sjenning commented Mar 1, 2018

sjenning commented Mar 1, 2018

0xmichalis commented Dec 5, 2017 •

edited by sjenning

Loading