Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the period and failure threshold for activator readiness #12614

Merged
merged 1 commit into from
Feb 11, 2022

Conversation

dprotaso
Copy link
Member

@dprotaso dprotaso commented Feb 11, 2022

The default drain timeout is 45 seconds which was much shorter than
the time it takes the activator to be recognized as not ready (2 minutes)

This was resulting in 503s since the activator was receiving traffic when it
was not expecting it

Fixes test/upgrade.TestServingUpgrades/VerifyContinualTests/ProbeTest flake

Release Note

NONE

The default drain timeout is 45 seconds which was much shorter than
the time it takes the activator to be recognized as not ready (2 minutes)

This was resulting in 503s since the activator was receiving traffic when it
was not expecting it
@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 11, 2022
@codecov
Copy link

codecov bot commented Feb 11, 2022

Codecov Report

Merging #12614 (3cd9277) into main (16c94d1) will increase coverage by 0.09%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##             main   #12614      +/-   ##
==========================================
+ Coverage   87.39%   87.48%   +0.09%     
==========================================
  Files         195      195              
  Lines        9718     9718              
==========================================
+ Hits         8493     8502       +9     
+ Misses        937      931       -6     
+ Partials      288      285       -3     
Impacted Files Coverage Δ
pkg/activator/net/revision_backends.go 92.60% <0.00%> (+0.86%) ⬆️
pkg/reconciler/revision/background.go 90.00% <0.00%> (+1.81%) ⬆️
pkg/autoscaler/statforwarder/processor.go 94.44% <0.00%> (+5.55%) ⬆️
pkg/autoscaler/statforwarder/forwarder.go 96.29% <0.00%> (+5.55%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 16c94d1...3cd9277. Read the comment docs.

livenessProbe:
httpGet:
port: 8012
httpHeaders:
- name: k-kubelet-probe
value: "activator"
periodSeconds: 10
Copy link
Contributor

@psschwei psschwei Feb 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, why the need for the livenessProbe update? (edlt: Don't think it's super important, 10s is the default value, so don't believe this is changing its behavior)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is 10 so I just made it explicit.

Cause when I saw the failureThreshold of 12 it wasn't obvious (without looking it up) that it would take 120s

Copy link
Contributor

@psschwei psschwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 11, 2022
@knative-prow-robot knative-prow-robot merged commit 7ca76bb into knative:main Feb 11, 2022
@dprotaso dprotaso deleted the probe-test branch February 11, 2022 17:38
dprotaso added a commit to dprotaso/serving that referenced this pull request Feb 12, 2022
The activator's readiness depends on the status of web socket connection
to the autoscaler. When the connection is down the activator will report
ready=false. This can occur when the autoscaler deployment is updating.

PR knative#12614 made the activator's readiness probe fail aggressively after
a single failure. This didn't seem to impact istio but with contour it
started returning 503s since the activator started to report ready=false
immediately.

This PR does two things to mitigate 503s:
- bump the readiness threshold to give the autoscaler more time to
  rollout/startup. This still remains lower than the drain duration
- Update the autoscaler rollout strategy so we spin up a new instance
  prior to bring down the older one. This is done using maxUnavailable=0
knative-prow-robot pushed a commit that referenced this pull request Feb 12, 2022
The activator's readiness depends on the status of web socket connection
to the autoscaler. When the connection is down the activator will report
ready=false. This can occur when the autoscaler deployment is updating.

PR #12614 made the activator's readiness probe fail aggressively after
a single failure. This didn't seem to impact istio but with contour it
started returning 503s since the activator started to report ready=false
immediately.

This PR does two things to mitigate 503s:
- bump the readiness threshold to give the autoscaler more time to
  rollout/startup. This still remains lower than the drain duration
- Update the autoscaler rollout strategy so we spin up a new instance
  prior to bring down the older one. This is done using maxUnavailable=0
@dprotaso
Copy link
Member Author

/cherry-pick release-1.2
/cherry-pick release-1.1
/cherry-pick release-1.0

@knative-prow-robot
Copy link
Contributor

@dprotaso: new pull request created: #12618

In response to this:

/cherry-pick release-1.2
/cherry-pick release-1.1
/cherry-pick release-1.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dprotaso
Copy link
Member Author

/cherry-pick release-1.1

@dprotaso
Copy link
Member Author

/cherry-pick release-1.0

@knative-prow-robot
Copy link
Contributor

@dprotaso: new pull request created: #12619

In response to this:

/cherry-pick release-1.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knative-prow-robot
Copy link
Contributor

@dprotaso: new pull request created: #12620

In response to this:

/cherry-pick release-1.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

knative-prow-robot pushed a commit to knative-prow-robot/serving that referenced this pull request Feb 13, 2022
The activator's readiness depends on the status of web socket connection
to the autoscaler. When the connection is down the activator will report
ready=false. This can occur when the autoscaler deployment is updating.

PR knative#12614 made the activator's readiness probe fail aggressively after
a single failure. This didn't seem to impact istio but with contour it
started returning 503s since the activator started to report ready=false
immediately.

This PR does two things to mitigate 503s:
- bump the readiness threshold to give the autoscaler more time to
  rollout/startup. This still remains lower than the drain duration
- Update the autoscaler rollout strategy so we spin up a new instance
  prior to bring down the older one. This is done using maxUnavailable=0
knative-prow-robot pushed a commit to knative-prow-robot/serving that referenced this pull request Feb 13, 2022
The activator's readiness depends on the status of web socket connection
to the autoscaler. When the connection is down the activator will report
ready=false. This can occur when the autoscaler deployment is updating.

PR knative#12614 made the activator's readiness probe fail aggressively after
a single failure. This didn't seem to impact istio but with contour it
started returning 503s since the activator started to report ready=false
immediately.

This PR does two things to mitigate 503s:
- bump the readiness threshold to give the autoscaler more time to
  rollout/startup. This still remains lower than the drain duration
- Update the autoscaler rollout strategy so we spin up a new instance
  prior to bring down the older one. This is done using maxUnavailable=0
knative-prow-robot pushed a commit to knative-prow-robot/serving that referenced this pull request Feb 13, 2022
The activator's readiness depends on the status of web socket connection
to the autoscaler. When the connection is down the activator will report
ready=false. This can occur when the autoscaler deployment is updating.

PR knative#12614 made the activator's readiness probe fail aggressively after
a single failure. This didn't seem to impact istio but with contour it
started returning 503s since the activator started to report ready=false
immediately.

This PR does two things to mitigate 503s:
- bump the readiness threshold to give the autoscaler more time to
  rollout/startup. This still remains lower than the drain duration
- Update the autoscaler rollout strategy so we spin up a new instance
  prior to bring down the older one. This is done using maxUnavailable=0
knative-prow-robot added a commit that referenced this pull request Feb 13, 2022
The activator's readiness depends on the status of web socket connection
to the autoscaler. When the connection is down the activator will report
ready=false. This can occur when the autoscaler deployment is updating.

PR #12614 made the activator's readiness probe fail aggressively after
a single failure. This didn't seem to impact istio but with contour it
started returning 503s since the activator started to report ready=false
immediately.

This PR does two things to mitigate 503s:
- bump the readiness threshold to give the autoscaler more time to
  rollout/startup. This still remains lower than the drain duration
- Update the autoscaler rollout strategy so we spin up a new instance
  prior to bring down the older one. This is done using maxUnavailable=0

Co-authored-by: dprotaso <[email protected]>
knative-prow-robot added a commit that referenced this pull request Feb 13, 2022
The activator's readiness depends on the status of web socket connection
to the autoscaler. When the connection is down the activator will report
ready=false. This can occur when the autoscaler deployment is updating.

PR #12614 made the activator's readiness probe fail aggressively after
a single failure. This didn't seem to impact istio but with contour it
started returning 503s since the activator started to report ready=false
immediately.

This PR does two things to mitigate 503s:
- bump the readiness threshold to give the autoscaler more time to
  rollout/startup. This still remains lower than the drain duration
- Update the autoscaler rollout strategy so we spin up a new instance
  prior to bring down the older one. This is done using maxUnavailable=0

Co-authored-by: dprotaso <[email protected]>
knative-prow-robot added a commit that referenced this pull request Feb 13, 2022
The activator's readiness depends on the status of web socket connection
to the autoscaler. When the connection is down the activator will report
ready=false. This can occur when the autoscaler deployment is updating.

PR #12614 made the activator's readiness probe fail aggressively after
a single failure. This didn't seem to impact istio but with contour it
started returning 503s since the activator started to report ready=false
immediately.

This PR does two things to mitigate 503s:
- bump the readiness threshold to give the autoscaler more time to
  rollout/startup. This still remains lower than the drain duration
- Update the autoscaler rollout strategy so we spin up a new instance
  prior to bring down the older one. This is done using maxUnavailable=0

Co-authored-by: dprotaso <[email protected]>
nak3 pushed a commit to nak3/serving that referenced this pull request May 26, 2022
The activator's readiness depends on the status of web socket connection
to the autoscaler. When the connection is down the activator will report
ready=false. This can occur when the autoscaler deployment is updating.

PR knative#12614 made the activator's readiness probe fail aggressively after
a single failure. This didn't seem to impact istio but with contour it
started returning 503s since the activator started to report ready=false
immediately.

This PR does two things to mitigate 503s:
- bump the readiness threshold to give the autoscaler more time to
  rollout/startup. This still remains lower than the drain duration
- Update the autoscaler rollout strategy so we spin up a new instance
  prior to bring down the older one. This is done using maxUnavailable=0

Co-authored-by: dprotaso <[email protected]>
openshift-merge-robot pushed a commit to openshift/knative-serving that referenced this pull request May 26, 2022
* Pin to 1.23 S-O branch

* Add 0-kourier.yaml and 1-config-network.yaml to kourier.yaml (#1122)

* Rename kourier.yaml with 0-kourier.yaml

* Concat the files

* fix csv logic (#1125)

* Reduce the period and failure threshold for activator readiness (knative#12618)

The default drain timeout is 45 seconds which was much shorter than
the time it takes the activator to be recognized as not ready (2 minutes)

This was resulting in 503s since the activator was receiving traffic when it
was not expecting it

Co-authored-by: dprotaso <[email protected]>

* Address 503s when the autoscaler is being rolled (knative#12621)

The activator's readiness depends on the status of web socket connection
to the autoscaler. When the connection is down the activator will report
ready=false. This can occur when the autoscaler deployment is updating.

PR knative#12614 made the activator's readiness probe fail aggressively after
a single failure. This didn't seem to impact istio but with contour it
started returning 503s since the activator started to report ready=false
immediately.

This PR does two things to mitigate 503s:
- bump the readiness threshold to give the autoscaler more time to
  rollout/startup. This still remains lower than the drain duration
- Update the autoscaler rollout strategy so we spin up a new instance
  prior to bring down the older one. This is done using maxUnavailable=0

Co-authored-by: dprotaso <[email protected]>

* [release-1.2] Drop MaxDurationSeconds from the RevisionSpec  (knative#12640)

* Drop MaxDurationSeconds from the RevisionSpec (knative#12635)

We added MaxDurationSeconds (knative#12322) because the behaviour of
RevisionSpec.Timeout changed from total duration to time to first byte.

In hindsight changing the behaviour of Timeout was a mistake since
it goes against the original specification.

Thus we're going to create a path for migration and the first part is
to remove MaxDurationSeconds from the RevisionSpec.

* fix conformance test

* [release-1.2] fix ytt package name (knative#12657)

* fix ytt package name

* use correct path

Co-authored-by: dprotaso <[email protected]>

* Remove an unnecessary start delay when resolving tag to digests (knative#12669)

Co-authored-by: dprotaso <[email protected]>

* Drop collecting performance data in release branch (knative#12673)

Co-authored-by: dprotaso <[email protected]>

* bump ggcr which includes auth config lookup fixes for k8s (knative#12656)

Includes the fixes:
- google/go-containerregistry#1299
- google/go-containerregistry#1300

* Fixes an activator panic when the throttle encounters a cache.DeleteFinalStateUnknown (knative#12680)

Co-authored-by: dprotaso <[email protected]>

* upgrade to latest dependencies (knative#12674)

bumping knative.dev/pkg 77555ea...083dd97:
  > 083dd97 Wait for reconciler/controllers to return prior to exiting the process (# 2438)
  > df430fa dizzy: we must use `flags` instead of `pflags`, since this is not working. It seems like pflag.* adds the var to its own flag set, not the one package flag uses, and it doesn't expose the internal flag.Var externally - hence this fix. (# 2415)

Signed-off-by: Knative Automation <[email protected]>

* [release-1.2] fix tag to digest resolution (ggcr bump) (knative#12834)

* pin k8s dep

* Fix tag to digest resolution with K8s secrets

I forgot to bump ggcr's sub package in the prior release

github.com/google/go-containerregistry/pkg/authn/k8schain

* bump ggcr which fixes tag-to-digest resolution for Azure & GitLab (knative#12857)

Co-authored-by: Stavros Kontopoulos <[email protected]>
Co-authored-by: Knative Prow Robot <[email protected]>
Co-authored-by: dprotaso <[email protected]>
Co-authored-by: knative-automation <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[flaky] test/upgrade.TestServingUpgrades/VerifyContinualTests/ProbeTest
3 participants