fix(experimental-ec2-pattern): Add buffer to rolling update timeout #2462

akash1810 · 2024-09-18T17:22:57Z

What does this change?

Extends the rolling update's (#2417) pause time by a minute more than the health check grace period.

If we consider the health check grace period to be the time it takes the "normal" user data to run, the rolling update should be configured to be a little longer to cover the additional time spent polling the target group.

A buffer of 1 minute is somewhat arbitrarily chosen. Too high a value, then we increase the time it takes to automatically rollback from a failing health check. Too low a value, then we risk flaky deploys.

This experimental pattern is currently used by guardian/cdk-playground and guardian/security-hq. Since moving to it, a couple of deployments have failed.

We didn't see this behaviour during initial testing as we had the pause time set to 5 minutes, as noted in some of the observations¹.

Timeline of a recent failed deployment

Below is a timeline from one failed deployment. We can see that although the signal was sent within 7 seconds of the the 2 minute timeout expiring, CloudFormation hadn't processed it and started rolling back. Adding a buffer should make deployments more stable.

Time	Source	Message
2024-09-18 16:31:43 UTC+0100	cloudformation	UPDATE_IN_PROGRESS
2024-09-18 16:31:47 UTC+0100	cloudformation	Rolling update initiated. Terminating 1 obsolete instance(s) in batches of 1, while keeping at least 1 instance(s) in service. Waiting on resource signals with a timeout of PT2M when new instances are added to the autoscaling group.
2024-09-18 16:31:47 UTC+0100	cloudformation	Temporarily setting autoscaling group MinSize and DesiredCapacity to 2.
2024-09-18 16:31:47 UTC+0100	autoscaling group	Launching a new EC2 instance: i-00bf09262d69d6186. At 2024-09-18T15:31:48Z a user request update of AutoScalingGroup constraints to min: 2, max: 2, desired: 2 changing the desired capacity from 1 to 2. At 2024-09-18T15:31:56Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 1 to 2.
2024-09-18 16:31:57 UTC+0100	ec2 instance	Instance launched
2024-09-18 16:32:19 UTC+0100	cloudformation	New instance(s) added to autoscaling group - Waiting on 1 resource signal(s) with a timeout of PT2M.
2024-09-18 16:33:13 UTC+0100	ec2 instance	Instance not yet healthy within target group. Current state "unhealthy". Sleeping...
2024-09-18 16:34:11 UTC+0100	ec2 instance	Instance is healthy in target group.
2024-09-18 16:34:12 UTC+0100	ec2 instance	Cloud-init v. 24.2-0ubuntu1~22.04.1 finished at Wed, 18 Sep 2024 15:34:11 +0000. Datasource DataSourceEc2Local. Up 124.42 seconds
2024-09-18 16:34:19 UTC+0100	cloudformation	Failed to receive 1 resource signal(s) for the current batch. Each resource signal timeout is counted as a FAILURE.
2024-09-18 16:34:20 UTC+0100	cloudformation	Received 0 SUCCESS signal(s) out of 1. Unable to satisfy 100% MinSuccessfulInstancesPercent requirement
2024-09-18 16:34:21 UTC+0100	cloudformation	UPDATE_ROLLBACK_IN_PROGRESS
2024-09-18 16:34:21 UTC+0100	autoscaling group	Terminating EC2 instance: i-00bf09262d69d6186. At 2024-09-18T15:34:27Z a user request update of AutoScalingGroup constraints to min: 1, max: 2, desired: 1 changing the desired capacity from 2 to 1. At 2024-09-18T15:34:36Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 2 to 1. At 2024-09-18T15:34:36Z instance i-00bf09262d69d6186 was selected for termination.
2024-09-18 16:34:40 UTC+0100	cloudformation	UPDATE_ROLLBACK_COMPLETE

How to test

See the updated unit tests. Ultimately, we'll have to use it to see.

How can we measure success?

More stable deployments.

Have we considered potential risks?

N/A.

Checklist

I have listed any breaking changes, along with a migration path ²
I have updated the documentation as required for the described changes ³

See https://github.com/guardian/cdk/pull/2417#discussion_r1762871176 for a related discussion. ↩
Consider whether this is something that will mean changes to projects that have already been migrated, or to the CDK CLI tool. If changes are required, consider adding a checklist here and/or linking to related PRs. ↩
If you are adding a new construct or pattern, has new documentation been added? If you are amending defaults or changing behaviour, are the existing docs still valid? ↩

changeset-bot · 2024-09-18T17:23:01Z

🦋 Changeset detected

Latest commit: fed2598

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
@guardian/cdk	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

akash1810 · 2024-09-18T17:23:35Z

src/experimental/patterns/ec2-app.ts

+export const RollingUpdateDurations: AutoScalingRollingUpdateDurations = {
+  sleep: Duration.seconds(5),
+  buffer: Duration.minutes(1),
+};


These are exported mainly for access by the tests.

akash1810 · 2024-09-18T17:36:50Z

src/experimental/patterns/ec2-app.ts

+      if (!construct.healthCheckGracePeriod) {
+        throw new Error(`The healthcheck grace period not set for autoscaling group ${construct.node.id}.`);
+      }


The healthCheckGracePeriod is typed number | undefined. When undefined, the CloudFormation default of 0 seconds is used.

Our user data will at least download the application artifact from S3. This'll take longer than 0 seconds. Furthermore, the GuEc2App pattern defaults it to 2 minutes. That is, we can realistically expect this property to always be defined.

If we consider the health check grace period to be the time it takes the "normal" user data to run, the rolling update should be configured to be a little longer to cover the additional time spent polling the target group. A buffer of 1 minute is somewhat arbitrarily chosen. Too high a value, then we increase the time it takes to automatically rollback from a failing health check. Too low a value, then we risk flaky deploys.

akash1810 · 2024-09-20T07:02:32Z

src/experimental/patterns/ec2-app.ts

+
+export const RollingUpdateDurations: AutoScalingRollingUpdateDurations = {
+  sleep: Duration.seconds(5),
+  buffer: Duration.minutes(1),


I wonder if we can be more scientific with this value by making it relative to the target group's health check?

For example HealthCheckIntervalSeconds * HealthCheckTimeoutSeconds * HealthyThresholdCount.

akash1810 requested a review from a team as a code owner September 18, 2024 17:22

akash1810 commented Sep 18, 2024

View reviewed changes

akash1810 force-pushed the aa/rolling-update-duration branch 2 times, most recently from a4237d4 to d18ad38 Compare September 18, 2024 17:28

akash1810 commented Sep 18, 2024

View reviewed changes

akash1810 force-pushed the aa/rolling-update-duration branch from d18ad38 to cfb3000 Compare September 19, 2024 06:46

akash1810 added 2 commits September 19, 2024 08:15

chore: Add changeset

fed2598

akash1810 force-pushed the aa/rolling-update-duration branch from cfb3000 to fed2598 Compare September 19, 2024 07:15

jacobwinch approved these changes Sep 19, 2024

View reviewed changes

akash1810 merged commit 2daaea1 into main Sep 19, 2024
4 checks passed

akash1810 deleted the aa/rolling-update-duration branch September 19, 2024 11:35

akash1810 mentioned this pull request Sep 19, 2024

chore(deps-dev): Update GuCDK from 59.5.0 to 59.5.1 guardian/cdk-playground#527

Merged

akash1810 commented Sep 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(experimental-ec2-pattern): Add buffer to rolling update timeout #2462

fix(experimental-ec2-pattern): Add buffer to rolling update timeout #2462

akash1810 commented Sep 18, 2024 •

edited

Loading

changeset-bot bot commented Sep 18, 2024 •

edited

Loading

akash1810 Sep 18, 2024

akash1810 Sep 18, 2024

akash1810 Sep 20, 2024 •

edited

Loading

fix(experimental-ec2-pattern): Add buffer to rolling update timeout #2462

fix(experimental-ec2-pattern): Add buffer to rolling update timeout #2462

Conversation

akash1810 commented Sep 18, 2024 • edited Loading

What does this change?

How to test

How can we measure success?

Have we considered potential risks?

Checklist

Footnotes

changeset-bot bot commented Sep 18, 2024 • edited Loading

🦋 Changeset detected

akash1810 Sep 18, 2024

Choose a reason for hiding this comment

akash1810 Sep 18, 2024

Choose a reason for hiding this comment

akash1810 Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

akash1810 commented Sep 18, 2024 •

edited

Loading

changeset-bot bot commented Sep 18, 2024 •

edited

Loading

akash1810 Sep 20, 2024 •

edited

Loading