-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release-1.12] Don't drop traffic when upgrading a deployment fails #14840
[release-1.12] Don't drop traffic when upgrading a deployment fails #14840
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## release-1.12 #14840 +/- ##
=============================================
Coverage 86.02% 86.02%
=============================================
Files 197 197
Lines 14922 14931 +9
=============================================
+ Hits 12837 12845 +8
- Misses 1775 1776 +1
Partials 310 310 ☔ View full report in Codecov by Sentry. |
8f5284e
to
5f22a01
Compare
/test upgrade-tests_serving_release-1.12 |
3 similar comments
/test upgrade-tests_serving_release-1.12 |
/test upgrade-tests_serving_release-1.12 |
/test upgrade-tests_serving_release-1.12 |
dde064a
to
a6f145f
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dprotaso The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
When transforming the deployment status to the revision we want to bubble up the more severe condition to Ready. Since Replica failures will include a more actionable error message this condition is preferred
This isn't accurate when the Revision has failed to rollout an update to it's deployment
1. PA Reachability now depends on the status of the Deployment If we have available replicas we don't mark the revision as unreachable. This allows ongoing requests to be handled 2. Always propagate the K8s Deployment Status to the Revision. We don't need to gate this depending on whether the Revision required activation. Since the only two conditions we propagate from the Deployment is Progressing and ReplicaSetFailure=False 3. Mark Revision as Deploying if the PA's service name isn't set
e47b232
to
f50114f
Compare
rebased |
ambient is flaky - removing from 1.12 branch here - #14848 /override "test (v1.26.x, istio-ambient, runtime)" |
@@ -144,9 +143,3 @@ func (rs *RevisionStatus) IsActivationRequired() bool { | |||
c := revisionCondSet.Manage(rs).GetCondition(RevisionConditionActive) | |||
return c != nil && c.Status != corev1.ConditionTrue | |||
} | |||
|
|||
// IsReplicaSetFailure returns true if the deployment replicaset failed to create | |||
func (rs *RevisionStatus) IsReplicaSetFailure(deploymentStatus *appsv1.DeploymentStatus) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we cover this part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always propagate the status now - and this is surfaced as a deployment condition
if c := rev.Status.GetCondition(cond); c != nil && c.IsFalse() { | ||
if infraFailure && deployment != nil && deployment.Spec.Replicas != nil { | ||
// If we have an infra failure and no ready replicas - then this revision is unreachable | ||
if *deployment.Spec.Replicas > 0 && deployment.Status.ReadyReplicas == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being verbose just trying to summarize.
So in the past we moved from checking if rev routing state is active for pa reachability:
if !rev.IsReachable() { return autoscalingv1alpha1.ReachabilityUnreachable }
func (r *Revision) IsReachable() bool {
return RoutingState(r.Labels[serving.RoutingStateLabelKey]) == RoutingStateActive
}
to checking some rev conditions before checking the routing state in order to avoid old revision pods being created until the new revision is up.
However, it was proven a bit too aggressive with the case of a broken webhook and the upgrade of a deployment, thus cutting traffic.
Now with this patch we only set pa to "unreachable" due to revision being not healthy or active, only if there are no ready replicas (when >0 are required) to allow traffic.
Btw what is the effect on the revision when we mark this unreachable, does it propagate to the revision? I am bit confused with all these states.
Could we also do a follow up PR to document this state machine of resources, so we feel more confident about changing stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now with this patch we only set pa to "unreachable" due to revision being not healthy or active, only if there are no ready replicas (when >0 are required).
The other condition is if the revision is not being pointed to by a route - then it's unreachable as well.
Btw what is the effect on the revision when we mark this unreachable, does it propagate to the revision?
If the revision marks the PA unreachable then the autoscaler will scale the deployment down to zero.
Could we also do a follow up PR to document this state machine of resources, so we feel more confident about changing stuff?
Sure - I also included the necessary tests to cover this case
if ps.IsScaleTargetInitialized() && !resUnavailable { | ||
// Precondition for PA being initialized is SKS being active and | ||
// that implies that |service.endpoints| > 0. | ||
rs.MarkResourcesAvailableTrue() | ||
rs.MarkContainerHealthyTrue() | ||
} | ||
|
||
// Mark resource unavailable if we don't have a Service Name and the deployment is ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we somehow combine this with the above statements (from https://github.com/knative/serving/pull/14840/files#diff-831a9383e7db7880978acf31f7dfec777beb08b900b1d0e1c55a5aed42e602cbR173 down)?
It feels like both parts work on RevisionConditionResourcesAvailable
and PodAutoscalerConditionReady
and set rs.MarkResourcesAvailableUnknown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or in other words, the full function is a bit hard to grasp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you want to combine it? My hope here is to keep the conditionals straight forward. Keeping them separate helps with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm I'd need more time to fiddle around with the current code. But maybe better to keep it here and do it on main
afterwards (if even).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a tough one. I agree with Stavros that it's quite hard to understand the current state machine.
As far as I can tell (also reading your explanations and comments in the previous PR) I think this looks good.
}}, | ||
}, | ||
}, { | ||
name: "replica failure has priority over progressing", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate where priority is defined, I see that DeploymentConditionProgressing
is the same before and after, so no change there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'priority' here means that the replicafailure message is the last one applied so it is surfaced to the deployment's Ready condition.
}, | ||
want: &duckv1.Status{ | ||
Conditions: []apis.Condition{{ | ||
Type: DeploymentConditionProgressing, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading this I was kind of confused seeing that:
DeploymentConditionProgressing apis.ConditionType = "Progressing"
DeploymentProgressing DeploymentConditionType = "Progressing"
condition types defined in knative.dev/pkg are just two:
// ConditionReady specifies that the resource is ready.
// For long-running resources.
ConditionReady ConditionType = "Ready"
// ConditionSucceeded specifies that the resource has finished.
// For resource which run to completion.
ConditionSucceeded ConditionType = "Succeeded"
Also going from deployment conditions to duckv1 conditions and back
seems a bit complex, eventually we have:
func TransformDeploymentStatus(ds *appsv1.DeploymentStatus) *duckv1.Status {
s := &duckv1.Status{}
depCondSet.Manage(s).InitializeConditions()
// The absence of this condition means no failure has occurred. If we find it
// below, we'll overwrite this.
depCondSet.Manage(s).MarkTrue(DeploymentConditionReplicaSetReady)
depCondSet.Manage(s).MarkUnknown(DeploymentConditionProgressing, "Deploying", "")
....
func (rs *RevisionStatus) PropagateDeploymentStatus(original *appsv1.DeploymentStatus) {
ds := serving.TransformDeploymentStatus(original)
cond := ds.GetCondition(serving.DeploymentConditionReady)
...
I am wondering if mapping deployment conditions directly to revision conditions would be more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if mapping deployment conditions directly to revision conditions would be more readable.
I'm open folks cleaning this up in a follow up PR.
The thing with the deployment conditions is their polarity is weird - ReplicaCreateFailure=False is actually good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing with the deployment conditions is their polarity is weird - ReplicaCreateFailure=False is actually good
Yes, had to read that three times to get :D
@@ -1303,7 +1303,7 @@ func TestGlobalResyncOnUpdateAutoscalerConfigMap(t *testing.T) { | |||
rev := newTestRevision(testNamespace, testRevision) | |||
newDeployment(ctx, t, fakedynamicclient.Get(ctx), testRevision+"-deployment", 3) | |||
|
|||
kpa := revisionresources.MakePA(rev) | |||
kpa := revisionresources.MakePA(rev, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should not we pass the deployment above instead of nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are really fixtures - and passing in a deployment doesn't change the fixture so I didn't think it was necessary.
resources, err := v1test.CreateServiceReady(c.T, clients, names, func(s *v1.Service) { | ||
s.Spec.Template.Annotations = map[string]string{ | ||
autoscaling.MinScaleAnnotation.Key(): "1", | ||
autoscaling.MaxScaleAnnotation.Key(): "1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen if maxScale = 10 and we deploy the failing webhook before all replicas are up? Would the new revision be reachable since some replicas are up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The replica set has techinically progressed so there would be no failure surfaced on the deployment because it's a scaling issue for the replicaset.
/lgtm |
/lgtm
They are super flaky --> #14637 (I think we also have other tests that are pretty flaky, also in kourier for example). At some point we need to take some time to look into it. |
Yeah I cherry-picked disabling ambient back to the 1.12 branch here - #14848 |
/override "test (v1.27.x, istio-ambient, api)" |
@dprotaso: Overrode contexts on behalf of dprotaso: test (v1.27.x, istio-ambient, api) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/override "test (v1.28.x, istio-ambient, e2e)" |
@dprotaso: Overrode contexts on behalf of dprotaso: test (v1.28.x, istio-ambient, e2e) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/override "test (v1.28.x, istio-ambient, runtime)" /override "test (v1.27.x, istio-ambient, e2e)" |
@dprotaso: /override requires failed status contexts, check run or a prowjob name to operate on.
Only the following failed contexts/checkruns were expected:
If you are trying to override a checkrun that has a space in it, you must put a double quote on the context. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/override "test (v1.28.x, istio-ambient, runtime)" |
@dprotaso: Overrode contexts on behalf of dprotaso: test (v1.28.x, istio-ambient, runtime) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-1.13 |
@dprotaso: new pull request created: #14864 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Part of #14660