-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(experimental-ec2-pattern): Add buffer to rolling update timeout #2462
Conversation
🦋 Changeset detectedLatest commit: fed2598 The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
export const RollingUpdateDurations: AutoScalingRollingUpdateDurations = { | ||
sleep: Duration.seconds(5), | ||
buffer: Duration.minutes(1), | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are exported mainly for access by the tests.
a4237d4
to
d18ad38
Compare
if (!construct.healthCheckGracePeriod) { | ||
throw new Error(`The healthcheck grace period not set for autoscaling group ${construct.node.id}.`); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The healthCheckGracePeriod
is typed number | undefined
. When undefined
, the CloudFormation default of 0 seconds is used.
Our user data will at least download the application artifact from S3. This'll take longer than 0 seconds. Furthermore, the GuEc2App
pattern defaults it to 2 minutes. That is, we can realistically expect this property to always be defined.
d18ad38
to
cfb3000
Compare
If we consider the health check grace period to be the time it takes the "normal" user data to run, the rolling update should be configured to be a little longer to cover the additional time spent polling the target group. A buffer of 1 minute is somewhat arbitrarily chosen. Too high a value, then we increase the time it takes to automatically rollback from a failing health check. Too low a value, then we risk flaky deploys.
cfb3000
to
fed2598
Compare
|
||
export const RollingUpdateDurations: AutoScalingRollingUpdateDurations = { | ||
sleep: Duration.seconds(5), | ||
buffer: Duration.minutes(1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can be more scientific with this value by making it relative to the target group's health check?
For example HealthCheckIntervalSeconds * HealthCheckTimeoutSeconds * HealthyThresholdCount.
What does this change?
Extends the rolling update's (#2417) pause time by a minute more than the health check grace period.
If we consider the health check grace period to be the time it takes the "normal" user data to run, the rolling update should be configured to be a little longer to cover the additional time spent polling the target group.
A buffer of 1 minute is somewhat arbitrarily chosen. Too high a value, then we increase the time it takes to automatically rollback from a failing health check. Too low a value, then we risk flaky deploys.
This experimental pattern is currently used by guardian/cdk-playground and guardian/security-hq. Since moving to it, a couple of deployments have failed.
We didn't see this behaviour during initial testing as we had the pause time set to 5 minutes, as noted in some of the observations1.
Timeline of a recent failed deployment
Below is a timeline from one failed deployment. We can see that although the signal was sent within 7 seconds of the the 2 minute timeout expiring, CloudFormation hadn't processed it and started rolling back. Adding a buffer should make deployments more stable.
How to test
See the updated unit tests. Ultimately, we'll have to use it to see.
How can we measure success?
More stable deployments.
Have we considered potential risks?
N/A.
Checklist
Footnotes
See https://github.com/guardian/cdk/pull/2417#discussion_r1762871176 for a related discussion. ↩
Consider whether this is something that will mean changes to projects that have already been migrated, or to the CDK CLI tool. If changes are required, consider adding a checklist here and/or linking to related PRs. ↩
If you are adding a new construct or pattern, has new documentation been added? If you are amending defaults or changing behaviour, are the existing docs still valid? ↩