-
-
Notifications
You must be signed in to change notification settings - Fork 331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement graceful shutdown of GitLab Runner #1117
feat: implement graceful shutdown of GitLab Runner #1117
Conversation
Hey @tmeijn! 👋 Thank you for your contribution to the project. Please refer to the contribution rules for a quick overview of the process. Make sure that this PR clearly explains:
With submitting this PR you confirm that you hold the rights of the code added and agree that it will published under this LICENSE. The following ChatOps commands are supported:
Simply add a comment with the command in the first line. If you need to pass more information, separate it with a blank line from the command. This message was generated automatically. You are welcome to improve it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some todo's during self-review.
This solution is much simpler than the one from #1099, let's close the other one in favor of this. I've also tested this out and it's working well, this is looking great. |
Thank you @long-wan-ep, your MR definitely helped in making this MR better so thank you for proposing your solution 🙏🏾 . I'd like to invite you to review this MR and come with any questions or suggestions you might have! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some open questions.
@kayman-mk, is there a reason we do not have terraform_docs
in pre-commit, nor as a check in CI?
8d5b596
to
679c656
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me overall, just some minor feedback.
Thanks for the review @long-wan-ep. @long-wan-ep, @kayman-mk with the last commit I added another lifecycle hook that now truly would make this zero downtime! Previously, the ASG would immediately put the current instance in Let me know what you all think. I can easily revert the commit if this is too much and we can address this in a separate MR, but with my testing this really works great! |
Just from the comments here: I love it! I have a nasty workaround in place, updating the Runners in the middle of the night. This makes life easier. Let me check with my setup here. |
That looks good to me, tested it out as well. |
This is done in the release branch, so the documentation is updated with every release. No need to do it in the feature branch. |
2bffdb4
to
1e3561f
Compare
1e3561f
to
edea800
Compare
@tmeijn Did a quick check in my test environment, but it was not running as expected. In case the old Runner ran out of jobs and a new Runner was already there (did a Running
|
Did you wait for more than five minutes? That's how long it takes for the New instance to report InService and subsequently for the ASG to send the Terminate signal to the old instance.
Maybe because the instance refresh is still progressing? Would you be able to provide some clear steps in how to reproduce this and I'll take a look ASAP. |
I did a second test. Looked much better, but not as I expected. My expectations:
|
@tmeijn Does it make sense to add an |
Yeah thought about this too, but I do not think there is an easy way to determine that GitLab Runner has provisioned the desired capacity so that is why I opted for a 'dumb' 300s timeout limit.
Sure I can work on this! Could you in the mean time do a code review? 😄 |
Maybe it was the result of killing the instance manually: Checked my autoscaling group by accident. I found an instance with lifecycle state Not sure how we can get rid of those instances, but waiting for the heartbeat timeout (2h in my case). Any chance for a shortcut here? |
That's a valid reason too. Could you please add comment to the script (where we would usually expect the lifecycle command), so nobody tries to add the command in the future?
Thanks & done |
I only saw the intended zero downtime behavior you described when I tested, I was using the instance refresh feature in the ASG. |
Just noticed that I was on 7.0.0. May be the problems where releated to this fact. I installed the latest version and will give it a new try tomorrow. |
Hey @kayman-mk WYDT about something like this? Do you have a suggestion where to add this? sequenceDiagram
autonumber
participant ASG as Autoscaling Group
participant CI as Current Instance
participant NI as New Instance
ASG->>NI: Provision New Instance (status: Pending)
Note over NI: Install GitLab Runner <br/>and provision capacity<br/>(5m grace period)
ASG->>NI: Set status to InService
ASG->>CI: Set status to Terminating:Wait
CI->>CI: Graceful terminate:<br/>Stop picking up new jobs,<br/>Finish current jobs<br/>assigned to this Runner
CI->>ASG: Send complete-lifecycle-action
ASG->>CI: Set status to Terminating:Proceed
Note over CI: Instance is terminated:<br/>Cleanup Lambda is triggered
|
Looks great! Let's add it to usage.md. There is a concept section. |
Looks good to me. Everything worked as expected. Let's tackle the findings from the last review and get it merged. |
How to proceed here:
It looks good from my side. 🚀 |
@kayman-mk Your commit look good to me overall, just 1 comment. |
Alright, think that's it, just a couple of non-blocking comments for you @kayman-mk! |
Co-authored-by: Tyrone Meijn <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guys, good to have you here! This change pushes the project really forward. Thanks @tmeijn @long-wan-ep
🤖 I have created a release *beep* *boop* --- ## [7.7.0](7.6.1...7.7.0) (2024-05-29) ### Features * implement graceful shutdown of GitLab Runner ([#1117](#1117)) ([d2e2224](d2e2224)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: cattle-ops-releaser-2[bot] <134548870+cattle-ops-releaser-2[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Description
Based on the discussion #1067:
TERMINATING
toTERMINATE
. The Lambda now functions as an "after-the-fact" cleanup instead of being responsible of cleanup during termination.7200
(2 hours).Todos
Migrations required
No, except that if the default behavior of immediately terminating all Workers + Manager, the
runner_worker_graceful_terminate_timeout_duration
variable should be set to 30 (the minimum allowed).Verification
Graceful terminate
Terminating:Proceed
statusZero Downtime deployment
InService
.Terminating:Proceed
statusCloses #1029