Add max job execution time for runners. #58

maigl · 2023-01-11T20:56:37Z

maigl
Jan 11, 2023

We want to use garm in a wider scenario with many users.
For fairness and to avoid misuse and for optimization we need to be able to see who's using the system to which extend and we also need to be able to set limits.

In github.com you also have a number of limits:
https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration#usage-limits

One important first step would be a feature to add a max job execution time.
E.g. 6 hours - after that time an active runner will be stopped.

What do you think?

(Happy to provide a PR if that's the right solution.)

gabriel-samfira · 2023-01-16T08:51:59Z

gabriel-samfira
Jan 16, 2023
Maintainer

Hi @maigl !

Ideally this should be something that is handled by github itself. Garm should concern itself with making runners available to github, managing the lifecycle of the instances on which the runner is hosted, but not the lifecycle of the runner itself. That should be github's responsibility. After we run the command to register the runner to github, we no longer dictate what that runner does. Jobs are sent by github to the runner, and github removes the runner (if it's ephemeral) from the list of available runners, once it's job is done. Having garm intervene here, would cross an architectural boundary that would probably be difficult to move away from once it's adopted.

The only real way garm could enforce something like this is if we are willing to have garm forcefully cancel jobs. Having garm cancel jobs, leads to jobs failing in non obvious ways and inevitably frustration if the developer is unaware that garm has a specific timeout set. This seems like something that should reside outside of garm. At least at this stage.

There is a discussion here: https://github.com/orgs/community/discussions/25631 on the same matter. There is a timeout-minutes that can be explicitly set in a workflow to limit the number of minutes a job can run. After this timeout is reached, github itself will cancel the job.

Sadly, there is no org/enterprise level setting to enforce this. It seems that people need this as evidenced by this comment https://github.com/orgs/community/discussions/25631#discussioncomment-3248533, si it may be worth pinging that thread.

In the absence of an org/enterprise wide default timeout, this could possibly be enforced as a "best practice" and caught through proper vetting/linting in pre-push hooks. For example, a pre-push hook could be created that parses all workflow jobs and ensures that an explicit timeout-minutes is set and has an acceptable value.

Alternatively, monitoring of job run times can be implemented and jobs can be canceled using a cron job or something similar.

4 replies

MoritzKeppler Jan 17, 2023

Valid observations, @gabriel-samfira, thanks! Definitely something that is missing in GitHub, I agree.

Nevertheless, I wonder if it should not also be a part of garm for the reason of protecting itself. At best, configured with timeout values slightly higher than those set in GitHub.
Other values that control the availability of resources like min and max pool size are also configured in garm and not in GH.

gabriel-samfira Jan 19, 2023
Maintainer

Hi @MoritzKeppler !

Preamble

Bare with me while we dive a bit into understanding the need for this, what the rationale was when adding the existing options in garm, and possible way to achieve what you (and others) may need. By the end of this discussion if we decide it makes sense for garm to attempt to kill jobs, we can add it, but I would like to avoid adding features that may have a better/handier/safer solution that fits into your existing infrastructure.

The details

Nevertheless, I wonder if it should not also be a part of garm for the reason of protecting itself. At best, configured with timeout values slightly higher than those set in GitHub.

The timeouts set by GitHub may be overwritten by minutes-timeout set inside the workflow. For example, the default is 6 hours, but for self hosted runners, the maximum minutes-timeout is 35 days 😄. A user may specify an explicit minutes-timeout inside the workflow (which garm can't see) that sets the timeout to 35 days.

That is one reason I suggested it might be worth having an org level "best practice" to explicitly include a sane timeout inside workflow jobs. I realize that this is extremely difficult to do when you have hundreds of teams in dozens of departments totaling thousands of people.

Other values that control the availability of resources like min and max pool size are also configured in garm and not in GH.

The max pool size option was meant to prevent garm from overwhelming a small provider like a single LXD instance with hundreds of workers. It was a way to protect the provider more than garm. The job runtime however would not bring a provider to it's knees. It would simply deny the creation of new workers while existing ones run.

A possible solution

Than being said, there is another way to achieve this outside of garm. @maigl has proposed a PR that enables metrics collection from garm. This is great for observability and allows you to see what is happening while enabling you to act on those metrics through whatever system you may want to implement (Prometheus or otherwise).

If you do use Prometheus and have alert manager set up, you could set up an alert that can trigger an action using something like:

https://github.com/imgix/prometheus-am-executor .

That action can be to forcefully cancel a job, which in turn will send a webhook to garm to cleanup the instance. Would this fit well into your setup?

MoritzKeppler Jan 19, 2023

thanks for the detailed answer!
For sure it's an option to keep the logic of deciding when to force delete a runner outside of garm.
In our setup we won't have the right to cancel arbitrary jobs - although I understand that this would be the preferred way.

Do you think it's possible to change garm to allow force deletion of runner in status active?
Could also be of use for other administrative tasks.

gabriel-samfira Jan 19, 2023
Maintainer

GitHub doesn't allow us to delete runners which are in active state. The API returns an error. Garm would have to cancel the job itself, then wait for the runner to be reaped by github. It's a different code path than simply deleting a runner and would imply that we would need to track jobs. If we will ever add this ability into garm, it would need to be as part of something more broad.

Tracking jobs is on the todo list for garm, because it would allow us to better schedule runners and would fix: #47, but it's something that will take a lot of work to get right. Right now this will have to wait for a bit. Need to clear my current backlog.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add max job execution time for runners. #58

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Add max job execution time for runners. #58

maigl Jan 11, 2023

Replies: 1 comment · 4 replies

gabriel-samfira Jan 16, 2023 Maintainer

MoritzKeppler Jan 17, 2023

gabriel-samfira Jan 19, 2023 Maintainer

Preamble

The details

A possible solution

MoritzKeppler Jan 19, 2023

gabriel-samfira Jan 19, 2023 Maintainer

maigl
Jan 11, 2023

Replies: 1 comment 4 replies

gabriel-samfira
Jan 16, 2023
Maintainer

gabriel-samfira Jan 19, 2023
Maintainer

gabriel-samfira Jan 19, 2023
Maintainer