Reduce the period and failure threshold for activator readiness #12614

dprotaso · 2022-02-11T14:43:45Z

The default drain timeout is 45 seconds which was much shorter than
the time it takes the activator to be recognized as not ready (2 minutes)

This was resulting in 503s since the activator was receiving traffic when it
was not expecting it

Fixes test/upgrade.TestServingUpgrades/VerifyContinualTests/ProbeTest flake

Release Note

NONE

The default drain timeout is 45 seconds which was much shorter than the time it takes the activator to be recognized as not ready (2 minutes) This was resulting in 503s since the activator was receiving traffic when it was not expecting it

knative-prow-robot · 2022-02-11T14:43:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~config/OWNERS~~ [dprotaso]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov · 2022-02-11T14:49:14Z

Codecov Report

Merging #12614 (3cd9277) into main (16c94d1) will increase coverage by 0.09%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main   #12614      +/-   ##
==========================================
+ Coverage   87.39%   87.48%   +0.09%     
==========================================
  Files         195      195              
  Lines        9718     9718              
==========================================
+ Hits         8493     8502       +9     
+ Misses        937      931       -6     
+ Partials      288      285       -3

Impacted Files	Coverage Δ
pkg/activator/net/revision_backends.go	`92.60% <0.00%> (+0.86%)`	⬆️
pkg/reconciler/revision/background.go	`90.00% <0.00%> (+1.81%)`	⬆️
pkg/autoscaler/statforwarder/processor.go	`94.44% <0.00%> (+5.55%)`	⬆️
pkg/autoscaler/statforwarder/forwarder.go	`96.29% <0.00%> (+5.55%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 16c94d1...3cd9277. Read the comment docs.

psschwei · 2022-02-11T14:50:21Z

config/core/deployments/activator.yaml

        livenessProbe:
          httpGet:
            port: 8012
            httpHeaders:
            - name: k-kubelet-probe
              value: "activator"
+          periodSeconds: 10


Just curious, why the need for the livenessProbe update? (edlt: Don't think it's super important, 10s is the default value, so don't believe this is changing its behavior)

The default is 10 so I just made it explicit.

Cause when I saw the failureThreshold of 12 it wasn't obvious (without looking it up) that it would take 120s

psschwei

/lgtm

The activator's readiness depends on the status of web socket connection to the autoscaler. When the connection is down the activator will report ready=false. This can occur when the autoscaler deployment is updating. PR knative#12614 made the activator's readiness probe fail aggressively after a single failure. This didn't seem to impact istio but with contour it started returning 503s since the activator started to report ready=false immediately. This PR does two things to mitigate 503s: - bump the readiness threshold to give the autoscaler more time to rollout/startup. This still remains lower than the drain duration - Update the autoscaler rollout strategy so we spin up a new instance prior to bring down the older one. This is done using maxUnavailable=0

The activator's readiness depends on the status of web socket connection to the autoscaler. When the connection is down the activator will report ready=false. This can occur when the autoscaler deployment is updating. PR #12614 made the activator's readiness probe fail aggressively after a single failure. This didn't seem to impact istio but with contour it started returning 503s since the activator started to report ready=false immediately. This PR does two things to mitigate 503s: - bump the readiness threshold to give the autoscaler more time to rollout/startup. This still remains lower than the drain duration - Update the autoscaler rollout strategy so we spin up a new instance prior to bring down the older one. This is done using maxUnavailable=0

dprotaso · 2022-02-13T16:21:24Z

/cherry-pick release-1.2
/cherry-pick release-1.1
/cherry-pick release-1.0

knative-prow-robot · 2022-02-13T16:22:19Z

@dprotaso: new pull request created: #12618

In response to this:

/cherry-pick release-1.2
/cherry-pick release-1.1
/cherry-pick release-1.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dprotaso · 2022-02-13T16:23:56Z

/cherry-pick release-1.1

dprotaso · 2022-02-13T16:24:03Z

/cherry-pick release-1.0

knative-prow-robot · 2022-02-13T16:24:33Z

@dprotaso: new pull request created: #12619

In response to this:

/cherry-pick release-1.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

knative-prow-robot · 2022-02-13T16:24:40Z

@dprotaso: new pull request created: #12620

In response to this:

/cherry-pick release-1.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

The activator's readiness depends on the status of web socket connection to the autoscaler. When the connection is down the activator will report ready=false. This can occur when the autoscaler deployment is updating. PR knative#12614 made the activator's readiness probe fail aggressively after a single failure. This didn't seem to impact istio but with contour it started returning 503s since the activator started to report ready=false immediately. This PR does two things to mitigate 503s: - bump the readiness threshold to give the autoscaler more time to rollout/startup. This still remains lower than the drain duration - Update the autoscaler rollout strategy so we spin up a new instance prior to bring down the older one. This is done using maxUnavailable=0

The activator's readiness depends on the status of web socket connection to the autoscaler. When the connection is down the activator will report ready=false. This can occur when the autoscaler deployment is updating. PR #12614 made the activator's readiness probe fail aggressively after a single failure. This didn't seem to impact istio but with contour it started returning 503s since the activator started to report ready=false immediately. This PR does two things to mitigate 503s: - bump the readiness threshold to give the autoscaler more time to rollout/startup. This still remains lower than the drain duration - Update the autoscaler rollout strategy so we spin up a new instance prior to bring down the older one. This is done using maxUnavailable=0 Co-authored-by: dprotaso <[email protected]>

The activator's readiness depends on the status of web socket connection to the autoscaler. When the connection is down the activator will report ready=false. This can occur when the autoscaler deployment is updating. PR knative#12614 made the activator's readiness probe fail aggressively after a single failure. This didn't seem to impact istio but with contour it started returning 503s since the activator started to report ready=false immediately. This PR does two things to mitigate 503s: - bump the readiness threshold to give the autoscaler more time to rollout/startup. This still remains lower than the drain duration - Update the autoscaler rollout strategy so we spin up a new instance prior to bring down the older one. This is done using maxUnavailable=0 Co-authored-by: dprotaso <[email protected]>

* Pin to 1.23 S-O branch * Add 0-kourier.yaml and 1-config-network.yaml to kourier.yaml (#1122) * Rename kourier.yaml with 0-kourier.yaml * Concat the files * fix csv logic (#1125) * Reduce the period and failure threshold for activator readiness (knative#12618) The default drain timeout is 45 seconds which was much shorter than the time it takes the activator to be recognized as not ready (2 minutes) This was resulting in 503s since the activator was receiving traffic when it was not expecting it Co-authored-by: dprotaso <[email protected]> * Address 503s when the autoscaler is being rolled (knative#12621) The activator's readiness depends on the status of web socket connection to the autoscaler. When the connection is down the activator will report ready=false. This can occur when the autoscaler deployment is updating. PR knative#12614 made the activator's readiness probe fail aggressively after a single failure. This didn't seem to impact istio but with contour it started returning 503s since the activator started to report ready=false immediately. This PR does two things to mitigate 503s: - bump the readiness threshold to give the autoscaler more time to rollout/startup. This still remains lower than the drain duration - Update the autoscaler rollout strategy so we spin up a new instance prior to bring down the older one. This is done using maxUnavailable=0 Co-authored-by: dprotaso <[email protected]> * [release-1.2] Drop MaxDurationSeconds from the RevisionSpec (knative#12640) * Drop MaxDurationSeconds from the RevisionSpec (knative#12635) We added MaxDurationSeconds (knative#12322) because the behaviour of RevisionSpec.Timeout changed from total duration to time to first byte. In hindsight changing the behaviour of Timeout was a mistake since it goes against the original specification. Thus we're going to create a path for migration and the first part is to remove MaxDurationSeconds from the RevisionSpec. * fix conformance test * [release-1.2] fix ytt package name (knative#12657) * fix ytt package name * use correct path Co-authored-by: dprotaso <[email protected]> * Remove an unnecessary start delay when resolving tag to digests (knative#12669) Co-authored-by: dprotaso <[email protected]> * Drop collecting performance data in release branch (knative#12673) Co-authored-by: dprotaso <[email protected]> * bump ggcr which includes auth config lookup fixes for k8s (knative#12656) Includes the fixes: - google/go-containerregistry#1299 - google/go-containerregistry#1300 * Fixes an activator panic when the throttle encounters a cache.DeleteFinalStateUnknown (knative#12680) Co-authored-by: dprotaso <[email protected]> * upgrade to latest dependencies (knative#12674) bumping knative.dev/pkg 77555ea...083dd97: > 083dd97 Wait for reconciler/controllers to return prior to exiting the process (# 2438) > df430fa dizzy: we must use `flags` instead of `pflags`, since this is not working. It seems like pflag.* adds the var to its own flag set, not the one package flag uses, and it doesn't expose the internal flag.Var externally - hence this fix. (# 2415) Signed-off-by: Knative Automation <[email protected]> * [release-1.2] fix tag to digest resolution (ggcr bump) (knative#12834) * pin k8s dep * Fix tag to digest resolution with K8s secrets I forgot to bump ggcr's sub package in the prior release github.com/google/go-containerregistry/pkg/authn/k8schain * bump ggcr which fixes tag-to-digest resolution for Azure & GitLab (knative#12857) Co-authored-by: Stavros Kontopoulos <[email protected]> Co-authored-by: Knative Prow Robot <[email protected]> Co-authored-by: dprotaso <[email protected]> Co-authored-by: knative-automation <[email protected]>

knative-prow-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 11, 2022

psschwei reviewed Feb 11, 2022

View reviewed changes

knative-prow-robot assigned psschwei Feb 11, 2022

knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 11, 2022

knative-prow-robot merged commit 7ca76bb into knative:main Feb 11, 2022

dprotaso deleted the probe-test branch February 11, 2022 17:38

dprotaso mentioned this pull request Feb 12, 2022

Address 503s when the autoscaler is being rolled #12617

Merged

knative-prow-robot mentioned this pull request Feb 13, 2022

[release-1.2] Reduce the period and failure threshold for activator readiness #12618

Merged

knative-prow-robot mentioned this pull request Feb 13, 2022

[release-1.1] Reduce the period and failure threshold for activator readiness #12619

Merged

knative-prow-robot mentioned this pull request Feb 13, 2022

[release-1.0] Reduce the period and failure threshold for activator readiness #12620

Merged

dprotaso mentioned this pull request Feb 13, 2022

cherry-pick plugin not up to date? knative/test-infra#2769

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce the period and failure threshold for activator readiness #12614

Reduce the period and failure threshold for activator readiness #12614

dprotaso commented Feb 11, 2022 •

edited

Loading

knative-prow-robot commented Feb 11, 2022

codecov bot commented Feb 11, 2022 •

edited

Loading

psschwei Feb 11, 2022 •

edited

Loading

dprotaso Feb 11, 2022

psschwei left a comment

dprotaso commented Feb 13, 2022

knative-prow-robot commented Feb 13, 2022

dprotaso commented Feb 13, 2022

dprotaso commented Feb 13, 2022

knative-prow-robot commented Feb 13, 2022

knative-prow-robot commented Feb 13, 2022

Reduce the period and failure threshold for activator readiness #12614

Reduce the period and failure threshold for activator readiness #12614

Conversation

dprotaso commented Feb 11, 2022 • edited Loading

knative-prow-robot commented Feb 11, 2022

codecov bot commented Feb 11, 2022 • edited Loading

Codecov Report

psschwei Feb 11, 2022 • edited Loading

Choose a reason for hiding this comment

dprotaso Feb 11, 2022

Choose a reason for hiding this comment

psschwei left a comment

Choose a reason for hiding this comment

dprotaso commented Feb 13, 2022

knative-prow-robot commented Feb 13, 2022

dprotaso commented Feb 13, 2022

dprotaso commented Feb 13, 2022

knative-prow-robot commented Feb 13, 2022

knative-prow-robot commented Feb 13, 2022

dprotaso commented Feb 11, 2022 •

edited

Loading

codecov bot commented Feb 11, 2022 •

edited

Loading

psschwei Feb 11, 2022 •

edited

Loading