Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(experimental-ec2-pattern): Add buffer to rolling update timeout #2462

Merged
merged 2 commits into from
Sep 19, 2024

Conversation

akash1810
Copy link
Member

@akash1810 akash1810 commented Sep 18, 2024

What does this change?

Extends the rolling update's (#2417) pause time by a minute more than the health check grace period.

If we consider the health check grace period to be the time it takes the "normal" user data to run, the rolling update should be configured to be a little longer to cover the additional time spent polling the target group.

A buffer of 1 minute is somewhat arbitrarily chosen. Too high a value, then we increase the time it takes to automatically rollback from a failing health check. Too low a value, then we risk flaky deploys.

This experimental pattern is currently used by guardian/cdk-playground and guardian/security-hq. Since moving to it, a couple of deployments have failed.

We didn't see this behaviour during initial testing as we had the pause time set to 5 minutes, as noted in some of the observations1.

Timeline of a recent failed deployment

Below is a timeline from one failed deployment. We can see that although the signal was sent within 7 seconds of the the 2 minute timeout expiring, CloudFormation hadn't processed it and started rolling back. Adding a buffer should make deployments more stable.

Time Source Message
2024-09-18 16:31:43 UTC+0100 cloudformation UPDATE_IN_PROGRESS
2024-09-18 16:31:47 UTC+0100 cloudformation Rolling update initiated. Terminating 1 obsolete instance(s) in batches of 1, while keeping at least 1 instance(s) in service. Waiting on resource signals with a timeout of PT2M when new instances are added to the autoscaling group.
2024-09-18 16:31:47 UTC+0100 cloudformation Temporarily setting autoscaling group MinSize and DesiredCapacity to 2.
2024-09-18 16:31:47 UTC+0100 autoscaling group Launching a new EC2 instance: i-00bf09262d69d6186. At 2024-09-18T15:31:48Z a user request update of AutoScalingGroup constraints to min: 2, max: 2, desired: 2 changing the desired capacity from 1 to 2. At 2024-09-18T15:31:56Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 1 to 2.
2024-09-18 16:31:57 UTC+0100 ec2 instance Instance launched
2024-09-18 16:32:19 UTC+0100 cloudformation New instance(s) added to autoscaling group - Waiting on 1 resource signal(s) with a timeout of PT2M.
2024-09-18 16:33:13 UTC+0100 ec2 instance Instance not yet healthy within target group. Current state "unhealthy". Sleeping...
2024-09-18 16:34:11 UTC+0100 ec2 instance Instance is healthy in target group.
2024-09-18 16:34:12 UTC+0100 ec2 instance Cloud-init v. 24.2-0ubuntu1~22.04.1 finished at Wed, 18 Sep 2024 15:34:11 +0000. Datasource DataSourceEc2Local. Up 124.42 seconds
2024-09-18 16:34:19 UTC+0100 cloudformation Failed to receive 1 resource signal(s) for the current batch. Each resource signal timeout is counted as a FAILURE.
2024-09-18 16:34:20 UTC+0100 cloudformation Received 0 SUCCESS signal(s) out of 1. Unable to satisfy 100% MinSuccessfulInstancesPercent requirement
2024-09-18 16:34:21 UTC+0100 cloudformation UPDATE_ROLLBACK_IN_PROGRESS
2024-09-18 16:34:21 UTC+0100 autoscaling group Terminating EC2 instance: i-00bf09262d69d6186. At 2024-09-18T15:34:27Z a user request update of AutoScalingGroup constraints to min: 1, max: 2, desired: 1 changing the desired capacity from 2 to 1. At 2024-09-18T15:34:36Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 2 to 1. At 2024-09-18T15:34:36Z instance i-00bf09262d69d6186 was selected for termination.
2024-09-18 16:34:40 UTC+0100 cloudformation UPDATE_ROLLBACK_COMPLETE

How to test

See the updated unit tests. Ultimately, we'll have to use it to see.

How can we measure success?

More stable deployments.

Have we considered potential risks?

N/A.

Checklist

  • I have listed any breaking changes, along with a migration path 2
  • I have updated the documentation as required for the described changes 3

Footnotes

  1. See https://github.com/guardian/cdk/pull/2417#discussion_r1762871176 for a related discussion.

  2. Consider whether this is something that will mean changes to projects that have already been migrated, or to the CDK CLI tool. If changes are required, consider adding a checklist here and/or linking to related PRs.

  3. If you are adding a new construct or pattern, has new documentation been added? If you are amending defaults or changing behaviour, are the existing docs still valid?

@akash1810 akash1810 requested a review from a team as a code owner September 18, 2024 17:22
Copy link

changeset-bot bot commented Sep 18, 2024

🦋 Changeset detected

Latest commit: fed2598

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@guardian/cdk Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Comment on lines +24 to +27
export const RollingUpdateDurations: AutoScalingRollingUpdateDurations = {
sleep: Duration.seconds(5),
buffer: Duration.minutes(1),
};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are exported mainly for access by the tests.

@akash1810 akash1810 force-pushed the aa/rolling-update-duration branch 2 times, most recently from a4237d4 to d18ad38 Compare September 18, 2024 17:28
Comment on lines +62 to +64
if (!construct.healthCheckGracePeriod) {
throw new Error(`The healthcheck grace period not set for autoscaling group ${construct.node.id}.`);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The healthCheckGracePeriod is typed number | undefined. When undefined, the CloudFormation default of 0 seconds is used.

Our user data will at least download the application artifact from S3. This'll take longer than 0 seconds. Furthermore, the GuEc2App pattern defaults it to 2 minutes. That is, we can realistically expect this property to always be defined.

If we consider the health check grace period to be the time it takes the "normal" user data to run,
the rolling update should be configured to be a little longer to cover the additional time spent polling the target group.

A buffer of 1 minute is somewhat arbitrarily chosen.
Too high a value, then we increase the time it takes to automatically rollback from a failing health check.
Too low a value, then we risk flaky deploys.
@akash1810 akash1810 merged commit 2daaea1 into main Sep 19, 2024
4 checks passed
@akash1810 akash1810 deleted the aa/rolling-update-duration branch September 19, 2024 11:35

export const RollingUpdateDurations: AutoScalingRollingUpdateDurations = {
sleep: Duration.seconds(5),
buffer: Duration.minutes(1),
Copy link
Member Author

@akash1810 akash1810 Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can be more scientific with this value by making it relative to the target group's health check?

For example HealthCheckIntervalSeconds * HealthCheckTimeoutSeconds * HealthyThresholdCount.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants