Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement graceful shutdown of GitLab Runner #1117

Merged
merged 14 commits into from
May 29, 2024

Conversation

tmeijn
Copy link
Contributor

@tmeijn tmeijn commented Apr 23, 2024

Description

Based on the discussion #1067:

  1. Move the EventBridge rule that triggers the Lambda from TERMINATING to TERMINATE. The Lambda now functions as an "after-the-fact" cleanup instead of being responsible of cleanup during termination.
  2. Introduces a shell script managed by Systemd, that monitors the target lifecycle of the instance and initiates GitLab Runner graceful shutdown.
  3. Makes the heartbeat timeout of the ASG terminating hook configurable, with a default of the maximum job timeout + 5 minutes, capped at 7200 (2 hours).
  4. Introduces a launching lifecyclehook, allowing the new instance to provision itself and GitLab Runner to provision its set capacity before terminating the current instance.

Todos

Migrations required

No, except that if the default behavior of immediately terminating all Workers + Manager, the runner_worker_graceful_terminate_timeout_duration variable should be set to 30 (the minimum allowed).

Verification

Graceful terminate

  1. Deploy this version of the module.
  2. Start a long running GitLab job.
  3. Manually trigger an instance refresh in the runner ASG.
  4. Verify the job keeps running and has output. Verify from the instance logs that GitLab Runner service is still running.
  5. Once remaining jobs have been completed, observe that GitLab Runner service is terminated and instance is put into Terminating:Proceed status

Zero Downtime deployment

  1. Deploy this version of the module.
  2. Start multiple, long running GitLab jobs, twice the capacity of the GitLab Runner.
  3. Manually trigger an instance refresh in the runner ASG.
  4. Verify the jobs keep running and have output. Verify from the instance logs that GitLab Runner service is still running.
  5. Verify new instance gets spun up, while the current instance stays InService.
  6. Verify new instance is able to provision its set capacity.
  7. Verify new instance starts picking up GitLab jobs from the queue before current instance gets terminated.
  8. Observe that there is zero downtime.
  9. Once remaining jobs have been completed, observe that GitLab Runner service is terminated and current instance is put into Terminating:Proceed status

Closes #1029

Copy link
Contributor

Hey @tmeijn! 👋

Thank you for your contribution to the project. Please refer to the contribution rules for a quick overview of the process.

Make sure that this PR clearly explains:

  • the problem being solved
  • the best way a reviewer and you can test your changes

With submitting this PR you confirm that you hold the rights of the code added and agree that it will published under this LICENSE.

The following ChatOps commands are supported:

  • /help: notifies a maintainer to help you out

Simply add a comment with the command in the first line. If you need to pass more information, separate it with a blank line from the command.

This message was generated automatically. You are welcome to improve it.

Copy link
Contributor Author

@tmeijn tmeijn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some todo's during self-review.

main.tf Outdated Show resolved Hide resolved
modules/terminate-agent-hook/variables.tf Outdated Show resolved Hide resolved
template/gitlab-runner.tftpl Outdated Show resolved Hide resolved
template/gitlab-runner.tftpl Outdated Show resolved Hide resolved
template/gitlab-runner.tftpl Outdated Show resolved Hide resolved
template/gitlab-runner.tftpl Show resolved Hide resolved
template/gitlab-runner.tftpl Outdated Show resolved Hide resolved
template/gitlab-runner.tftpl Outdated Show resolved Hide resolved
template/gitlab-runner.tftpl Outdated Show resolved Hide resolved
variables.tf Outdated Show resolved Hide resolved
@long-wan-ep
Copy link
Contributor

This solution is much simpler than the one from #1099, let's close the other one in favor of this. I've also tested this out and it's working well, this is looking great.

@tmeijn tmeijn marked this pull request as draft April 25, 2024 06:03
@tmeijn
Copy link
Contributor Author

tmeijn commented Apr 25, 2024

Thank you @long-wan-ep, your MR definitely helped in making this MR better so thank you for proposing your solution 🙏🏾 . I'd like to invite you to review this MR and come with any questions or suggestions you might have!

Copy link
Contributor Author

@tmeijn tmeijn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some open questions.

@kayman-mk, is there a reason we do not have terraform_docs in pre-commit, nor as a check in CI?

template/gitlab-runner.tftpl Outdated Show resolved Hide resolved
variables.tf Outdated Show resolved Hide resolved
@tmeijn tmeijn force-pushed the feat/enable-graceful-shutdown branch from 8d5b596 to 679c656 Compare April 25, 2024 08:32
Copy link
Contributor

@long-wan-ep long-wan-ep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall, just some minor feedback.

template/gitlab-runner.tftpl Outdated Show resolved Hide resolved
template/gitlab-runner.tftpl Outdated Show resolved Hide resolved
template/gitlab-runner.tftpl Outdated Show resolved Hide resolved
template/gitlab-runner.tftpl Outdated Show resolved Hide resolved
template/gitlab-runner.tftpl Outdated Show resolved Hide resolved
template/gitlab-runner.tftpl Outdated Show resolved Hide resolved
@tmeijn tmeijn marked this pull request as ready for review May 1, 2024 09:31
@tmeijn
Copy link
Contributor Author

tmeijn commented May 1, 2024

Thanks for the review @long-wan-ep.

@long-wan-ep, @kayman-mk with the last commit I added another lifecycle hook that now truly would make this zero downtime! Previously, the ASG would immediately put the current instance in Terminating:Wait state and no longer accept new jobs, causing GitLab jobs to unnecessarily be in a pending state. With this commit, the ASG now waits five minutes before putting the new instance InService allowing GitLab Runner to start and provision its set capacity. Only after the new instance is InService will it put the current instance in Terminating:Wait state. In theory this would allow for a smooth cutover and therefore zero downtime redeployment of the GitLab Runner.

Let me know what you all think. I can easily revert the commit if this is too much and we can address this in a separate MR, but with my testing this really works great!

@kayman-mk
Copy link
Collaborator

Just from the comments here: I love it! I have a nasty workaround in place, updating the Runners in the middle of the night. This makes life easier.

Let me check with my setup here.

@long-wan-ep
Copy link
Contributor

Thanks for the review @long-wan-ep.

@long-wan-ep, @kayman-mk with the last commit I added another lifecycle hook that now truly would make this zero downtime! Previously, the ASG would immediately put the current instance in Terminating:Wait state and no longer accept new jobs, causing GitLab jobs to unnecessarily be in a pending state. With this commit, the ASG now waits five minutes before putting the new instance InService allowing GitLab Runner to start and provision its set capacity. Only after the new instance is InService will it put the current instance in Terminating:Wait state. In theory this would allow for a smooth cutover and therefore zero downtime redeployment of the GitLab Runner.

Let me know what you all think. I can easily revert the commit if this is too much and we can address this in a separate MR, but with my testing this really works great!

That looks good to me, tested it out as well.

@kayman-mk
Copy link
Collaborator

Some open questions.

@kayman-mk, is there a reason we do not have terraform_docs in pre-commit, nor as a check in CI?

This is done in the release branch, so the documentation is updated with every release. No need to do it in the feature branch.

@tmeijn tmeijn force-pushed the feat/enable-graceful-shutdown branch from 2bffdb4 to 1e3561f Compare May 3, 2024 14:15
@tmeijn tmeijn force-pushed the feat/enable-graceful-shutdown branch from 1e3561f to edea800 Compare May 3, 2024 14:17
versions.tf Show resolved Hide resolved
@kayman-mk
Copy link
Collaborator

kayman-mk commented May 8, 2024

@tmeijn Did a quick check in my test environment, but it was not running as expected.

In case the old Runner ran out of jobs and a new Runner was already there (did a terraform apply in the meantime), the old Runner was not removed. Maybe it's waiting for a timeout? When and where is the shutdown triggered? I see the old Runner reporting in CloudWatch that it is still InService.

Running terraform apply often results in failures as Terraform is unable to modify the autoscaling group in case a Runner is still running. Not sure what we can do here.

│ Error: starting Auto Scaling Group (Gitlab-Agent-TEST-eu-central-1a-2024050806540188720000000c-asg) instance refresh: waiting for Auto Scaling Group (Gitlab-Agent-TEST-eu-central-1a-2024050806540188720000000c-asg) instance refresh cancel: timeout while waiting for state to become 'Cancelled, Failed, Successful' (last state: 'Cancelling', timeout: 15m0s)

@tmeijn
Copy link
Contributor Author

tmeijn commented May 8, 2024

In case the old Runner ran out of jobs and a new Runner was already there (did a terraform apply in the meantime), the old Runner was not removed. Maybe it's waiting for a timeout? When and where is the shutdown triggered? I see the old Runner reporting in CloudWatch that it is still InService.

Did you wait for more than five minutes? That's how long it takes for the New instance to report InService and subsequently for the ASG to send the Terminate signal to the old instance.

Running terraform apply often results in failures as Terraform is unable to modify the autoscaling group in case a Runner is still running. Not sure what we can do here.

Maybe because the instance refresh is still progressing?

Would you be able to provide some clear steps in how to reproduce this and I'll take a look ASAP.

@kayman-mk
Copy link
Collaborator

I did a second test. Looked much better, but not as I expected.

My expectations:

  • after module installation I see one active Runner processing the jobs. 👍
  • after terraform apply (with some changes)
    • a new Runner should appear becoming the only Runner processing jobs
    • old Runner should deregister from GitLab and no longer process jobs. It shouldn't be visible on the Runner page anymore (at least not as online/idle/running)
    • old Runner instance should wait until all jobs are finished
    • as soon as all old jobs are done, the Runner terminates immediately (no matter what the timeout is, as all jobs are done)

@kayman-mk
Copy link
Collaborator

@tmeijn Does it make sense to add an aws autoscaling complete-lifecycle-action as soon as the Runner installation is done? So we don't have to wait for the 300s timeout.

@tmeijn
Copy link
Contributor Author

tmeijn commented May 14, 2024

@tmeijn Does it make sense to add an aws autoscaling complete-lifecycle-action as soon as the Runner installation is done? So we don't have to wait for the 300s timeout.

Yeah thought about this too, but I do not think there is an easy way to determine that GitLab Runner has provisioned the desired capacity so that is why I opted for a 'dumb' 300s timeout limit.

I think it would be good to have a simple diagram showing the whole process on a timeline. So we know when the new Runner is registered, deregistered, ...

Sure I can work on this! Could you in the mean time do a code review? 😄

@kayman-mk
Copy link
Collaborator

kayman-mk commented May 14, 2024

Maybe it was the result of killing the instance manually: Checked my autoscaling group by accident. I found an instance with lifecycle state Terminating:Wait and healthy status Unhealthy. The Cloudwatch logs have no valuable information (last log is monitor_runner.sh: Instance target lifecycle state is InService. No action required.). The EC2 instance is indeed dead and shows Terminated on EC2 console.

Not sure how we can get rid of those instances, but waiting for the heartbeat timeout (2h in my case). Any chance for a shortcut here?

template/gitlab-runner.tftpl Outdated Show resolved Hide resolved
modules/terminate-agent-hook/variables.tf Outdated Show resolved Hide resolved
template/gitlab-runner.tftpl Outdated Show resolved Hide resolved
versions.tf Show resolved Hide resolved
@kayman-mk
Copy link
Collaborator

kayman-mk commented May 14, 2024

@tmeijn Does it make sense to add an aws autoscaling complete-lifecycle-action as soon as the Runner installation is done? So we don't have to wait for the 300s timeout.

Yeah thought about this too, but I do not think there is an easy way to determine that GitLab Runner has provisioned the desired capacity so that is why I opted for a 'dumb' 300s timeout limit.

That's a valid reason too. Could you please add comment to the script (where we would usually expect the lifecycle command), so nobody tries to add the command in the future?

I think it would be good to have a simple diagram showing the whole process on a timeline. So we know when the new Runner is registered, deregistered, ...

Sure I can work on this! Could you in the mean time do a code review? 😄

Thanks & done

@long-wan-ep
Copy link
Contributor

@long-wan-ep can you corroborate any of these findings?

I only saw the intended zero downtime behavior you described when I tested, I was using the instance refresh feature in the ASG.

@kayman-mk
Copy link
Collaborator

Just noticed that I was on 7.0.0. May be the problems where releated to this fact.

I installed the latest version and will give it a new try tomorrow.

@tmeijn
Copy link
Contributor Author

tmeijn commented May 21, 2024

@long-wan-ep @tmeijn Not sure, what happened here. But I now saw the expected behavior. Strange.

I think it would be good to have a simple diagram showing the whole process on a timeline. So we know when the new Runner is registered, deregistered, ...

Hey @kayman-mk WYDT about something like this? Do you have a suggestion where to add this?

sequenceDiagram
    autonumber
    participant ASG as Autoscaling Group
    participant CI as Current Instance
    participant NI as New Instance
    ASG->>NI: Provision New Instance (status: Pending)
    Note over NI: Install GitLab Runner <br/>and provision capacity<br/>(5m grace period)
    ASG->>NI: Set status to InService
    ASG->>CI: Set status to Terminating:Wait
    CI->>CI: Graceful terminate:<br/>Stop picking up new jobs,<br/>Finish current jobs<br/>assigned to this Runner
    CI->>ASG: Send complete-lifecycle-action
    ASG->>CI: Set status to Terminating:Proceed
    Note over CI: Instance is terminated:<br/>Cleanup Lambda is triggered
Loading

@kayman-mk
Copy link
Collaborator

Hey @kayman-mk WYDT about something like this? Do you have a suggestion where to add this?

Looks great! Let's add it to usage.md. There is a concept section.

@kayman-mk
Copy link
Collaborator

Just noticed that I was on 7.0.0. May be the problems where releated to this fact.

I installed the latest version and will give it a new try tomorrow.

Looks good to me. Everything worked as expected. Let's tackle the findings from the last review and get it merged.

@kayman-mk
Copy link
Collaborator

How to proceed here:

  • make the diagram available in usage.md: @tmeijn
  • review findings were solved a minute ago
  • kindly ask you to go over the last commits for a final review: @tmeijn @long-wan-ep

It looks good from my side. 🚀

@kayman-mk kayman-mk self-requested a review May 23, 2024 14:03
@long-wan-ep
Copy link
Contributor

@kayman-mk Your commit look good to me overall, just 1 comment.

template/gitlab-runner.tftpl Outdated Show resolved Hide resolved
@tmeijn
Copy link
Contributor Author

tmeijn commented May 29, 2024

Alright, think that's it, just a couple of non-blocking comments for you @kayman-mk!

Copy link
Collaborator

@kayman-mk kayman-mk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guys, good to have you here! This change pushes the project really forward. Thanks @tmeijn @long-wan-ep

@kayman-mk kayman-mk merged commit d2e2224 into cattle-ops:main May 29, 2024
19 checks passed
kayman-mk pushed a commit that referenced this pull request May 30, 2024
🤖 I have created a release *beep* *boop*
---


##
[7.7.0](7.6.1...7.7.0)
(2024-05-29)


### Features

* implement graceful shutdown of GitLab Runner
([#1117](#1117))
([d2e2224](d2e2224))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: cattle-ops-releaser-2[bot] <134548870+cattle-ops-releaser-2[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Option to gracefully terminate runner
3 participants