feat: implement graceful shutdown of GitLab Runner #1117

tmeijn · 2024-04-23T08:34:16Z

Description

Based on the discussion #1067:

Move the EventBridge rule that triggers the Lambda from TERMINATING to TERMINATE. The Lambda now functions as an "after-the-fact" cleanup instead of being responsible of cleanup during termination.
Introduces a shell script managed by Systemd, that monitors the target lifecycle of the instance and initiates GitLab Runner graceful shutdown.
Makes the heartbeat timeout of the ASG terminating hook configurable, with a default of the maximum job timeout + 5 minutes, capped at 7200 (2 hours).
Introduces a launching lifecyclehook, allowing the new instance to provision itself and GitLab Runner to provision its set capacity before terminating the current instance.

Todos

Provide a diagram
Investigate feat: implement graceful shutdown of GitLab Runner #1117 (comment)

Migrations required

No, except that if the default behavior of immediately terminating all Workers + Manager, the runner_worker_graceful_terminate_timeout_duration variable should be set to 30 (the minimum allowed).

Verification

Graceful terminate

Deploy this version of the module.
Start a long running GitLab job.
Manually trigger an instance refresh in the runner ASG.
Verify the job keeps running and has output. Verify from the instance logs that GitLab Runner service is still running.
Once remaining jobs have been completed, observe that GitLab Runner service is terminated and instance is put into Terminating:Proceed status

Zero Downtime deployment

Deploy this version of the module.
Start multiple, long running GitLab jobs, twice the capacity of the GitLab Runner.
Manually trigger an instance refresh in the runner ASG.
Verify the jobs keep running and have output. Verify from the instance logs that GitLab Runner service is still running.
Verify new instance gets spun up, while the current instance stays InService.
Verify new instance is able to provision its set capacity.
Verify new instance starts picking up GitLab jobs from the queue before current instance gets terminated.
Observe that there is zero downtime.
Once remaining jobs have been completed, observe that GitLab Runner service is terminated and current instance is put into Terminating:Proceed status

Closes #1029

github-actions · 2024-04-23T08:34:29Z

Hey @tmeijn! 👋

Thank you for your contribution to the project. Please refer to the contribution rules for a quick overview of the process.

Make sure that this PR clearly explains:

the problem being solved
the best way a reviewer and you can test your changes

With submitting this PR you confirm that you hold the rights of the code added and agree that it will published under this LICENSE.

The following ChatOps commands are supported:

/help: notifies a maintainer to help you out

Simply add a comment with the command in the first line. If you need to pass more information, separate it with a blank line from the command.

This message was generated automatically. You are welcome to improve it.

tmeijn

Some todo's during self-review.

main.tf

modules/terminate-agent-hook/variables.tf

template/gitlab-runner.tftpl

variables.tf

long-wan-ep · 2024-04-24T23:24:30Z

This solution is much simpler than the one from #1099, let's close the other one in favor of this. I've also tested this out and it's working well, this is looking great.

tmeijn · 2024-04-25T07:26:34Z

Thank you @long-wan-ep, your MR definitely helped in making this MR better so thank you for proposing your solution 🙏🏾 . I'd like to invite you to review this MR and come with any questions or suggestions you might have!

tmeijn

Some open questions.

@kayman-mk, is there a reason we do not have terraform_docs in pre-commit, nor as a check in CI?

template/gitlab-runner.tftpl

variables.tf

.pre-commit-config.yaml

long-wan-ep

Looks good to me overall, just some minor feedback.

template/gitlab-runner.tftpl

tmeijn · 2024-05-01T09:38:19Z

Thanks for the review @long-wan-ep.

@long-wan-ep, @kayman-mk with the last commit I added another lifecycle hook that now truly would make this zero downtime! Previously, the ASG would immediately put the current instance in Terminating:Wait state and no longer accept new jobs, causing GitLab jobs to unnecessarily be in a pending state. With this commit, the ASG now waits five minutes before putting the new instance InService allowing GitLab Runner to start and provision its set capacity. Only after the new instance is InService will it put the current instance in Terminating:Wait state. In theory this would allow for a smooth cutover and therefore zero downtime redeployment of the GitLab Runner.

Let me know what you all think. I can easily revert the commit if this is too much and we can address this in a separate MR, but with my testing this really works great!

kayman-mk · 2024-05-01T15:28:06Z

Just from the comments here: I love it! I have a nasty workaround in place, updating the Runners in the middle of the night. This makes life easier.

Let me check with my setup here.

long-wan-ep · 2024-05-01T18:27:59Z

Thanks for the review @long-wan-ep.

@long-wan-ep, @kayman-mk with the last commit I added another lifecycle hook that now truly would make this zero downtime! Previously, the ASG would immediately put the current instance in Terminating:Wait state and no longer accept new jobs, causing GitLab jobs to unnecessarily be in a pending state. With this commit, the ASG now waits five minutes before putting the new instance InService allowing GitLab Runner to start and provision its set capacity. Only after the new instance is InService will it put the current instance in Terminating:Wait state. In theory this would allow for a smooth cutover and therefore zero downtime redeployment of the GitLab Runner.

Let me know what you all think. I can easily revert the commit if this is too much and we can address this in a separate MR, but with my testing this really works great!

That looks good to me, tested it out as well.

kayman-mk · 2024-05-03T12:50:28Z

Some open questions.

@kayman-mk, is there a reason we do not have terraform_docs in pre-commit, nor as a check in CI?

This is done in the release branch, so the documentation is updated with every release. No need to do it in the feature branch.

versions.tf

kayman-mk · 2024-05-08T07:44:02Z

@tmeijn Did a quick check in my test environment, but it was not running as expected.

In case the old Runner ran out of jobs and a new Runner was already there (did a terraform apply in the meantime), the old Runner was not removed. Maybe it's waiting for a timeout? When and where is the shutdown triggered? I see the old Runner reporting in CloudWatch that it is still InService.

Running terraform apply often results in failures as Terraform is unable to modify the autoscaling group in case a Runner is still running. Not sure what we can do here.

│ Error: starting Auto Scaling Group (Gitlab-Agent-TEST-eu-central-1a-2024050806540188720000000c-asg) instance refresh: waiting for Auto Scaling Group (Gitlab-Agent-TEST-eu-central-1a-2024050806540188720000000c-asg) instance refresh cancel: timeout while waiting for state to become 'Cancelled, Failed, Successful' (last state: 'Cancelling', timeout: 15m0s)

tmeijn · 2024-05-08T12:25:24Z

In case the old Runner ran out of jobs and a new Runner was already there (did a terraform apply in the meantime), the old Runner was not removed. Maybe it's waiting for a timeout? When and where is the shutdown triggered? I see the old Runner reporting in CloudWatch that it is still InService.

Did you wait for more than five minutes? That's how long it takes for the New instance to report InService and subsequently for the ASG to send the Terminate signal to the old instance.

Running terraform apply often results in failures as Terraform is unable to modify the autoscaling group in case a Runner is still running. Not sure what we can do here.

Maybe because the instance refresh is still progressing?

Would you be able to provide some clear steps in how to reproduce this and I'll take a look ASAP.

kayman-mk · 2024-05-14T10:54:29Z

I did a second test. Looked much better, but not as I expected.

My expectations:

after module installation I see one active Runner processing the jobs. 👍
after terraform apply (with some changes)
- a new Runner should appear becoming the only Runner processing jobs
- old Runner should deregister from GitLab and no longer process jobs. It shouldn't be visible on the Runner page anymore (at least not as online/idle/running)
- old Runner instance should wait until all jobs are finished
- as soon as all old jobs are done, the Runner terminates immediately (no matter what the timeout is, as all jobs are done)

kayman-mk · 2024-05-14T12:21:34Z

@tmeijn Does it make sense to add an aws autoscaling complete-lifecycle-action as soon as the Runner installation is done? So we don't have to wait for the 300s timeout.

tmeijn · 2024-05-14T12:59:20Z

@tmeijn Does it make sense to add an aws autoscaling complete-lifecycle-action as soon as the Runner installation is done? So we don't have to wait for the 300s timeout.

Yeah thought about this too, but I do not think there is an easy way to determine that GitLab Runner has provisioned the desired capacity so that is why I opted for a 'dumb' 300s timeout limit.

I think it would be good to have a simple diagram showing the whole process on a timeline. So we know when the new Runner is registered, deregistered, ...

Sure I can work on this! Could you in the mean time do a code review? 😄

kayman-mk · 2024-05-14T13:01:17Z

Maybe it was the result of killing the instance manually: Checked my autoscaling group by accident. I found an instance with lifecycle state Terminating:Wait and healthy status Unhealthy. The Cloudwatch logs have no valuable information (last log is monitor_runner.sh: Instance target lifecycle state is InService. No action required.). The EC2 instance is indeed dead and shows Terminated on EC2 console.

Not sure how we can get rid of those instances, but waiting for the heartbeat timeout (2h in my case). Any chance for a shortcut here?

template/gitlab-runner.tftpl

modules/terminate-agent-hook/variables.tf

template/gitlab-runner.tftpl

versions.tf

kayman-mk · 2024-05-14T13:20:23Z

@tmeijn Does it make sense to add an aws autoscaling complete-lifecycle-action as soon as the Runner installation is done? So we don't have to wait for the 300s timeout.

Yeah thought about this too, but I do not think there is an easy way to determine that GitLab Runner has provisioned the desired capacity so that is why I opted for a 'dumb' 300s timeout limit.

That's a valid reason too. Could you please add comment to the script (where we would usually expect the lifecycle command), so nobody tries to add the command in the future?

I think it would be good to have a simple diagram showing the whole process on a timeline. So we know when the new Runner is registered, deregistered, ...

Sure I can work on this! Could you in the mean time do a code review? 😄

Thanks & done

long-wan-ep · 2024-05-14T16:12:28Z

@long-wan-ep can you corroborate any of these findings?

I only saw the intended zero downtime behavior you described when I tested, I was using the instance refresh feature in the ASG.

kayman-mk · 2024-05-20T18:22:31Z

Just noticed that I was on 7.0.0. May be the problems where releated to this fact.

I installed the latest version and will give it a new try tomorrow.

tmeijn · 2024-05-21T11:41:53Z

@long-wan-ep @tmeijn Not sure, what happened here. But I now saw the expected behavior. Strange.

I think it would be good to have a simple diagram showing the whole process on a timeline. So we know when the new Runner is registered, deregistered, ...

Hey @kayman-mk WYDT about something like this? Do you have a suggestion where to add this?

sequenceDiagram
    autonumber
    participant ASG as Autoscaling Group
    participant CI as Current Instance
    participant NI as New Instance
    ASG->>NI: Provision New Instance (status: Pending)
    Note over NI: Install GitLab Runner <br/>and provision capacity<br/>(5m grace period)
    ASG->>NI: Set status to InService
    ASG->>CI: Set status to Terminating:Wait
    CI->>CI: Graceful terminate:<br/>Stop picking up new jobs,<br/>Finish current jobs<br/>assigned to this Runner
    CI->>ASG: Send complete-lifecycle-action
    ASG->>CI: Set status to Terminating:Proceed
    Note over CI: Instance is terminated:<br/>Cleanup Lambda is triggered

kayman-mk · 2024-05-22T13:28:44Z

Hey @kayman-mk WYDT about something like this? Do you have a suggestion where to add this?

Looks great! Let's add it to usage.md. There is a concept section.

kayman-mk · 2024-05-23T04:52:00Z

Just noticed that I was on 7.0.0. May be the problems where releated to this fact.

I installed the latest version and will give it a new try tomorrow.

Looks good to me. Everything worked as expected. Let's tackle the findings from the last review and get it merged.

kayman-mk · 2024-05-23T09:17:44Z

How to proceed here:

make the diagram available in usage.md: @tmeijn
review findings were solved a minute ago
kindly ask you to go over the last commits for a final review: @tmeijn @long-wan-ep

It looks good from my side. 🚀

template/gitlab-runner.tftpl

long-wan-ep · 2024-05-23T16:23:37Z

@kayman-mk Your commit look good to me overall, just 1 comment.

template/gitlab-runner.tftpl

tmeijn · 2024-05-29T09:36:00Z

Alright, think that's it, just a couple of non-blocking comments for you @kayman-mk!

Co-authored-by: Tyrone Meijn <[email protected]>

kayman-mk

Guys, good to have you here! This change pushes the project really forward. Thanks @tmeijn @long-wan-ep

🤖 I have created a release *beep* *boop* --- ## [7.7.0](7.6.1...7.7.0) (2024-05-29) ### Features * implement graceful shutdown of GitLab Runner ([#1117](#1117)) ([d2e2224](d2e2224)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: cattle-ops-releaser-2[bot] <134548870+cattle-ops-releaser-2[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

tmeijn commented Apr 23, 2024

View reviewed changes

tmeijn marked this pull request as ready for review April 24, 2024 09:54

tmeijn requested review from npalm and kayman-mk as code owners April 24, 2024 09:54

kayman-mk mentioned this pull request Apr 24, 2024

feat: add graceful terminate option to terminate-agent-hook #1099

Closed

long-wan-ep mentioned this pull request Apr 24, 2024

Option to gracefully terminate runner #1029

Closed

tmeijn marked this pull request as draft April 25, 2024 06:03

tmeijn commented Apr 25, 2024

View reviewed changes

template/gitlab-runner.tftpl Outdated Show resolved Hide resolved

variables.tf Outdated Show resolved Hide resolved

feat: implement graceful shutdown of GitLab Runner

679c656

tmeijn force-pushed the feat/enable-graceful-shutdown branch from 8d5b596 to 679c656 Compare April 25, 2024 08:32

tmeijn commented Apr 25, 2024

View reviewed changes

.pre-commit-config.yaml Show resolved Hide resolved

long-wan-ep reviewed Apr 25, 2024

View reviewed changes

address review comments

af294da

tmeijn marked this pull request as ready for review May 1, 2024 09:31

tmeijn force-pushed the feat/enable-graceful-shutdown branch from 2bffdb4 to 1e3561f Compare May 3, 2024 14:15

enable zero downtime deployment

edea800

tmeijn force-pushed the feat/enable-graceful-shutdown branch from 1e3561f to edea800 Compare May 3, 2024 14:17

tmeijn commented May 3, 2024

View reviewed changes

versions.tf Show resolved Hide resolved

tmeijn added 2 commits May 5, 2024 10:18

Merge branch 'main' into feat/enable-graceful-shutdown

f6fcd42

remove tf docs changes

1094c44

kayman-mk requested changes May 14, 2024

View reviewed changes

template/gitlab-runner.tftpl Outdated Show resolved Hide resolved

modules/terminate-agent-hook/variables.tf Outdated Show resolved Hide resolved

template/gitlab-runner.tftpl Outdated Show resolved Hide resolved

versions.tf Show resolved Hide resolved

kayman-mk mentioned this pull request May 23, 2024

feat: add docker autoscaler executor #1118

Merged

kayman-mk added 5 commits May 23, 2024 10:53

Merge branch 'main' into feat/enable-graceful-shutdown

06b2e1f

reduce logging during shutdown

92f1d0e

refine variable description

02668ae

enable logs for debug only

c95f23a

format code

97ba2f0

kayman-mk self-requested a review May 23, 2024 14:03

fix variable

d376393

long-wan-ep reviewed May 23, 2024

View reviewed changes

template/gitlab-runner.tftpl Outdated Show resolved Hide resolved

tmeijn commented May 24, 2024

View reviewed changes

template/gitlab-runner.tftpl Outdated Show resolved Hide resolved

tmeijn added 2 commits May 24, 2024 15:26

add concept to usage doc

bc0f2cc

add cspell ignor

1645d49

Update template/gitlab-runner.tftpl

5c719f5

Co-authored-by: Tyrone Meijn <[email protected]>

kayman-mk approved these changes May 29, 2024

View reviewed changes

kayman-mk merged commit d2e2224 into cattle-ops:main May 29, 2024
19 checks passed

cattle-ops-releaser-2 bot mentioned this pull request May 29, 2024

chore(main): release 7.7.0 #1132

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement graceful shutdown of GitLab Runner #1117

feat: implement graceful shutdown of GitLab Runner #1117

tmeijn commented Apr 23, 2024 •

edited

Loading

github-actions bot commented Apr 23, 2024

tmeijn left a comment

long-wan-ep commented Apr 24, 2024

tmeijn commented Apr 25, 2024

tmeijn left a comment

long-wan-ep left a comment

tmeijn commented May 1, 2024 •

edited

Loading

kayman-mk commented May 1, 2024

long-wan-ep commented May 1, 2024

kayman-mk commented May 3, 2024

kayman-mk commented May 8, 2024 •

edited

Loading

tmeijn commented May 8, 2024

kayman-mk commented May 14, 2024

kayman-mk commented May 14, 2024

tmeijn commented May 14, 2024

kayman-mk commented May 14, 2024 •

edited

Loading

kayman-mk commented May 14, 2024 •

edited

Loading

long-wan-ep commented May 14, 2024

kayman-mk commented May 20, 2024

tmeijn commented May 21, 2024

kayman-mk commented May 22, 2024

kayman-mk commented May 23, 2024

kayman-mk commented May 23, 2024

long-wan-ep commented May 23, 2024

tmeijn commented May 29, 2024

kayman-mk left a comment

feat: implement graceful shutdown of GitLab Runner #1117

feat: implement graceful shutdown of GitLab Runner #1117

Conversation

tmeijn commented Apr 23, 2024 • edited Loading

Description

Todos

Migrations required

Verification

Graceful terminate

Zero Downtime deployment

github-actions bot commented Apr 23, 2024

tmeijn left a comment

Choose a reason for hiding this comment

long-wan-ep commented Apr 24, 2024

tmeijn commented Apr 25, 2024

tmeijn left a comment

Choose a reason for hiding this comment

long-wan-ep left a comment

Choose a reason for hiding this comment

tmeijn commented May 1, 2024 • edited Loading

kayman-mk commented May 1, 2024

long-wan-ep commented May 1, 2024

kayman-mk commented May 3, 2024

kayman-mk commented May 8, 2024 • edited Loading

tmeijn commented May 8, 2024

kayman-mk commented May 14, 2024

kayman-mk commented May 14, 2024

tmeijn commented May 14, 2024

kayman-mk commented May 14, 2024 • edited Loading

kayman-mk commented May 14, 2024 • edited Loading

long-wan-ep commented May 14, 2024

kayman-mk commented May 20, 2024

tmeijn commented May 21, 2024

kayman-mk commented May 22, 2024

kayman-mk commented May 23, 2024

kayman-mk commented May 23, 2024

long-wan-ep commented May 23, 2024

tmeijn commented May 29, 2024

kayman-mk left a comment

Choose a reason for hiding this comment

tmeijn commented Apr 23, 2024 •

edited

Loading

tmeijn commented May 1, 2024 •

edited

Loading

kayman-mk commented May 8, 2024 •

edited

Loading

kayman-mk commented May 14, 2024 •

edited

Loading

kayman-mk commented May 14, 2024 •

edited

Loading