Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-1.12] Don't drop traffic when upgrading a deployment fails #14840

Merged

Conversation

dprotaso
Copy link
Member

@dprotaso dprotaso commented Jan 28, 2024

Part of #14660

@knative-prow knative-prow bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. area/test-and-release It flags unit/e2e/conformance/perf test issues for product features labels Jan 28, 2024
@knative-prow knative-prow bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 28, 2024
Copy link

codecov bot commented Jan 28, 2024

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (0d199d9) 86.02% compared to head (f50114f) 86.02%.
Report is 1 commits behind head on release-1.12.

Files Patch % Lines
pkg/apis/serving/v1/revision_lifecycle.go 70.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##           release-1.12   #14840   +/-   ##
=============================================
  Coverage         86.02%   86.02%           
=============================================
  Files               197      197           
  Lines             14922    14931    +9     
=============================================
+ Hits              12837    12845    +8     
- Misses             1775     1776    +1     
  Partials            310      310           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dprotaso
Copy link
Member Author

/test upgrade-tests_serving_release-1.12

3 similar comments
@dprotaso
Copy link
Member Author

/test upgrade-tests_serving_release-1.12

@dprotaso
Copy link
Member Author

/test upgrade-tests_serving_release-1.12

@dprotaso
Copy link
Member Author

/test upgrade-tests_serving_release-1.12

@dprotaso dprotaso force-pushed the test-upgrade-failure-test branch 2 times, most recently from dde064a to a6f145f Compare January 30, 2024 17:28
@knative-prow knative-prow bot added area/API API objects and controllers area/autoscale labels Jan 30, 2024
Copy link

knative-prow bot commented Jan 30, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dprotaso dprotaso changed the title [wip] [release-1.12] test deployment failures don't drop traffic on upgrade [release-1.12] Don't drop traffic when upgrading a deployment fails Jan 30, 2024
@knative-prow knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 30, 2024
When transforming the deployment status to the revision
we want to bubble up the more severe condition to Ready.

Since Replica failures will include a more actionable error
message this condition is preferred
This isn't accurate when the Revision has failed to rollout
an update to it's deployment
1. PA Reachability now depends on the status of the Deployment

If we have available replicas we don't mark the revision as
unreachable. This allows ongoing requests to be handled

2. Always propagate the K8s Deployment Status to the Revision.

We don't need to gate this depending on whether the Revision
required activation. Since the only two conditions we propagate
from the Deployment is Progressing and ReplicaSetFailure=False

3. Mark Revision as Deploying if the PA's service name isn't set
@dprotaso
Copy link
Member Author

rebased

@knative knative deleted a comment from knative-prow bot Jan 30, 2024
@dprotaso
Copy link
Member Author

ambient is flaky - removing from 1.12 branch here - #14848

/override "test (v1.26.x, istio-ambient, runtime)"
/override "test (v1.26.x, istio-ambient, api)"
/override "test (v1.26.x, istio-ambient, e2e)"

@@ -144,9 +143,3 @@ func (rs *RevisionStatus) IsActivationRequired() bool {
c := revisionCondSet.Manage(rs).GetCondition(RevisionConditionActive)
return c != nil && c.Status != corev1.ConditionTrue
}

// IsReplicaSetFailure returns true if the deployment replicaset failed to create
func (rs *RevisionStatus) IsReplicaSetFailure(deploymentStatus *appsv1.DeploymentStatus) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we cover this part?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here https://github.com/knative/serving/pull/14840/files#diff-f23ba125b15b4c6e491938d04e4d412d0b550197c51078c7de359ba7abc0da17R71

We always propagate the status now - and this is surfaced as a deployment condition

if c := rev.Status.GetCondition(cond); c != nil && c.IsFalse() {
if infraFailure && deployment != nil && deployment.Spec.Replicas != nil {
// If we have an infra failure and no ready replicas - then this revision is unreachable
if *deployment.Spec.Replicas > 0 && deployment.Status.ReadyReplicas == 0 {
Copy link
Contributor

@skonto skonto Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being verbose just trying to summarize.
So in the past we moved from checking if rev routing state is active for pa reachability:

if !rev.IsReachable() { return autoscalingv1alpha1.ReachabilityUnreachable }

func (r *Revision) IsReachable() bool {
	return RoutingState(r.Labels[serving.RoutingStateLabelKey]) == RoutingStateActive
}

to checking some rev conditions before checking the routing state in order to avoid old revision pods being created until the new revision is up.
However, it was proven a bit too aggressive with the case of a broken webhook and the upgrade of a deployment, thus cutting traffic.
Now with this patch we only set pa to "unreachable" due to revision being not healthy or active, only if there are no ready replicas (when >0 are required) to allow traffic.
Btw what is the effect on the revision when we mark this unreachable, does it propagate to the revision? I am bit confused with all these states.
Could we also do a follow up PR to document this state machine of resources, so we feel more confident about changing stuff?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now with this patch we only set pa to "unreachable" due to revision being not healthy or active, only if there are no ready replicas (when >0 are required).

The other condition is if the revision is not being pointed to by a route - then it's unreachable as well.

Btw what is the effect on the revision when we mark this unreachable, does it propagate to the revision?

If the revision marks the PA unreachable then the autoscaler will scale the deployment down to zero.

Could we also do a follow up PR to document this state machine of resources, so we feel more confident about changing stuff?

Sure - I also included the necessary tests to cover this case

if ps.IsScaleTargetInitialized() && !resUnavailable {
// Precondition for PA being initialized is SKS being active and
// that implies that |service.endpoints| > 0.
rs.MarkResourcesAvailableTrue()
rs.MarkContainerHealthyTrue()
}

// Mark resource unavailable if we don't have a Service Name and the deployment is ready
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we somehow combine this with the above statements (from https://github.com/knative/serving/pull/14840/files#diff-831a9383e7db7880978acf31f7dfec777beb08b900b1d0e1c55a5aed42e602cbR173 down)?
It feels like both parts work on RevisionConditionResourcesAvailable and PodAutoscalerConditionReady and set rs.MarkResourcesAvailableUnknown

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or in other words, the full function is a bit hard to grasp.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you want to combine it? My hope here is to keep the conditionals straight forward. Keeping them separate helps with that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I'd need more time to fiddle around with the current code. But maybe better to keep it here and do it on main afterwards (if even).

Copy link
Member

@ReToCode ReToCode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a tough one. I agree with Stavros that it's quite hard to understand the current state machine.

As far as I can tell (also reading your explanations and comments in the previous PR) I think this looks good.

}},
},
}, {
name: "replica failure has priority over progressing",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate where priority is defined, I see that DeploymentConditionProgressing is the same before and after, so no change there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'priority' here means that the replicafailure message is the last one applied so it is surfaced to the deployment's Ready condition.

},
want: &duckv1.Status{
Conditions: []apis.Condition{{
Type: DeploymentConditionProgressing,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this I was kind of confused seeing that:

DeploymentConditionProgressing apis.ConditionType = "Progressing"
DeploymentProgressing DeploymentConditionType = "Progressing"

condition types defined in knative.dev/pkg are just two:

	// ConditionReady specifies that the resource is ready.
	// For long-running resources.
	ConditionReady ConditionType = "Ready"
	// ConditionSucceeded specifies that the resource has finished.
	// For resource which run to completion.
	ConditionSucceeded ConditionType = "Succeeded"

Also going from deployment conditions to duckv1 conditions and back
seems a bit complex, eventually we have:

func TransformDeploymentStatus(ds *appsv1.DeploymentStatus) *duckv1.Status {
	s := &duckv1.Status{}

	depCondSet.Manage(s).InitializeConditions()
	// The absence of this condition means no failure has occurred. If we find it
	// below, we'll overwrite this.
	depCondSet.Manage(s).MarkTrue(DeploymentConditionReplicaSetReady)
	depCondSet.Manage(s).MarkUnknown(DeploymentConditionProgressing, "Deploying", "")
....
func (rs *RevisionStatus) PropagateDeploymentStatus(original *appsv1.DeploymentStatus) {
	ds := serving.TransformDeploymentStatus(original)
	cond := ds.GetCondition(serving.DeploymentConditionReady)
...

I am wondering if mapping deployment conditions directly to revision conditions would be more readable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if mapping deployment conditions directly to revision conditions would be more readable.

I'm open folks cleaning this up in a follow up PR.

The thing with the deployment conditions is their polarity is weird - ReplicaCreateFailure=False is actually good

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing with the deployment conditions is their polarity is weird - ReplicaCreateFailure=False is actually good

Yes, had to read that three times to get :D

@@ -1303,7 +1303,7 @@ func TestGlobalResyncOnUpdateAutoscalerConfigMap(t *testing.T) {
rev := newTestRevision(testNamespace, testRevision)
newDeployment(ctx, t, fakedynamicclient.Get(ctx), testRevision+"-deployment", 3)

kpa := revisionresources.MakePA(rev)
kpa := revisionresources.MakePA(rev, nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not we pass the deployment above instead of nil?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are really fixtures - and passing in a deployment doesn't change the fixture so I didn't think it was necessary.

resources, err := v1test.CreateServiceReady(c.T, clients, names, func(s *v1.Service) {
s.Spec.Template.Annotations = map[string]string{
autoscaling.MinScaleAnnotation.Key(): "1",
autoscaling.MaxScaleAnnotation.Key(): "1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if maxScale = 10 and we deploy the failing webhook before all replicas are up? Would the new revision be reachable since some replicas are up?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The replica set has techinically progressed so there would be no failure surfaced on the deployment because it's a scaling issue for the replicaset.

@skonto
Copy link
Contributor

skonto commented Feb 1, 2024

/lgtm
/hold for @ReToCode
Are ambient tests broken?

@knative-prow knative-prow bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. labels Feb 1, 2024
@ReToCode
Copy link
Member

ReToCode commented Feb 2, 2024

/lgtm
/unhold

Are ambient tests broken?

They are super flaky --> #14637 (I think we also have other tests that are pretty flaky, also in kourier for example). At some point we need to take some time to look into it.

@knative-prow knative-prow bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 2, 2024
@dprotaso
Copy link
Member Author

dprotaso commented Feb 2, 2024

Yeah I cherry-picked disabling ambient back to the 1.12 branch here - #14848

@dprotaso
Copy link
Member Author

dprotaso commented Feb 2, 2024

/override "test (v1.27.x, istio-ambient, api)"

Copy link

knative-prow bot commented Feb 2, 2024

@dprotaso: Overrode contexts on behalf of dprotaso: test (v1.27.x, istio-ambient, api)

In response to this:

/override "test (v1.27.x, istio-ambient, api)"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dprotaso
Copy link
Member Author

dprotaso commented Feb 2, 2024

/override "test (v1.28.x, istio-ambient, e2e)"

Copy link

knative-prow bot commented Feb 2, 2024

@dprotaso: Overrode contexts on behalf of dprotaso: test (v1.28.x, istio-ambient, e2e)

In response to this:

/override "test (v1.28.x, istio-ambient, e2e)"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dprotaso
Copy link
Member Author

dprotaso commented Feb 2, 2024

/override "test (v1.28.x, istio-ambient, runtime)"
/override "test (v1.28.x, istio-ambient, api)"

/override "test (v1.27.x, istio-ambient, e2e)"
/override "test (v1.27.x, istio-ambient, runtime)"

Copy link

knative-prow bot commented Feb 2, 2024

@dprotaso: /override requires failed status contexts, check run or a prowjob name to operate on.
The following unknown contexts/checkruns were given:

  • test (v1.27.x, istio-ambient, e2e)
  • test (v1.27.x, istio-ambient, runtime)
  • test (v1.28.x, istio-ambient, api)

Only the following failed contexts/checkruns were expected:

  • EasyCLA
  • build-tests_serving_release-1.12
  • istio-latest-no-mesh-tls_serving_release-1.12
  • istio-latest-no-mesh_serving_release-1.12
  • style / suggester / github_actions
  • style / suggester / shell
  • style / suggester / yaml
  • test (v1.28.x, istio-ambient, runtime)
  • tide
  • unit-tests_serving_release-1.12
  • upgrade-tests_serving_release-1.12

If you are trying to override a checkrun that has a space in it, you must put a double quote on the context.

In response to this:

/override "test (v1.28.x, istio-ambient, runtime)"
/override "test (v1.28.x, istio-ambient, api)"

/override "test (v1.27.x, istio-ambient, e2e)"
/override "test (v1.27.x, istio-ambient, runtime)"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dprotaso
Copy link
Member Author

dprotaso commented Feb 2, 2024

/override "test (v1.28.x, istio-ambient, runtime)"

Copy link

knative-prow bot commented Feb 2, 2024

@dprotaso: Overrode contexts on behalf of dprotaso: test (v1.28.x, istio-ambient, runtime)

In response to this:

/override "test (v1.28.x, istio-ambient, runtime)"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knative-prow knative-prow bot merged commit 9d21588 into knative:release-1.12 Feb 2, 2024
79 of 85 checks passed
@dprotaso dprotaso deleted the test-upgrade-failure-test branch February 2, 2024 16:14
@dprotaso
Copy link
Member Author

dprotaso commented Feb 2, 2024

/cherry-pick release-1.13

@knative-prow-robot
Copy link
Contributor

@dprotaso: new pull request created: #14864

In response to this:

/cherry-pick release-1.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/API API objects and controllers area/autoscale area/test-and-release It flags unit/e2e/conformance/perf test issues for product features lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants