Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for autoscaling self-hosted github runners #845

Open
jwcmd opened this issue Dec 3, 2020 · 17 comments
Open

Support for autoscaling self-hosted github runners #845

jwcmd opened this issue Dec 3, 2020 · 17 comments
Labels
autoscaling enhancement New feature or request Runner Feature Feature scope to the runner

Comments

@jwcmd
Copy link

jwcmd commented Dec 3, 2020

Describe the enhancement
I'm looking for a way to put a self-hosted github runner into an autoscale group.

I've discussed with Github Support and they've explained that the tokens are only valid for one hour. That's problematic for an autoscale group because it means they will fail to bring up a runner an hour after I deploy the autoscale group. They recommended raising my issue here, I apologize if we've both missed an obvious solution for this.

Code Snippet
Not Applicable.

@jwcmd jwcmd added the enhancement New feature or request label Dec 3, 2020
@j3parker
Copy link

j3parker commented Dec 3, 2020

In AWS we do this:

GitHub App tokens:

  • We register a GitHub App which has permission to register runners
  • We store its secrets in AWS secrets manager
  • A cloudwatch scheduled event triggers a lambda which generates a github token for that app and stores it into secrets manager

Runner registration:

  • A cloudwatch event trigger is setup for EC2 instances in the pending state.
  • Those events trigger a lambda which looks at the Org and Repo tags of the launched instance to decide where to register the runner (this lambda has permission to read the github token secret)
  • The lambda fetches a registration token and puts it into the EC2 Parameter store, prefixed by the instance name

The runner instance:

  • The IAM role for that instance has permission to read parameter store values for its instance id
  • On boot it polls parameter store waiting for a registration token

The ASGs:

  • We run an ASG for each kind of runner we want, and it uses the Org/Repo tags for its instances
  • When running ephemeral runners (waiting on GitHub to finish that feature so we don't do this yet) when a job gets picked up by the runner it removes itself from the ASG, causing a new runner to be launched/boot/etc. (in this mode, "desired" = how many pre-warmed runners do we want to have on standby to pick up new jobs... our max # of concurrent jobs is only limited by EC2/our accounts EC2 limits)
  • When running a fixed pool of re-usable runners we use scheduled scaling events

The way we do registration is so that:

  • Runner VMs never have access to GitHub app creds or tokens
  • The registration lambda only has access to the token, not the long-lived creds
  • If GitHub has a blip (heh) we have the cached token.... we try to refresh it every 30 mins but its good for an hour and we will keep using the old one for that long if new tokens are failing
  • The instance polls for the registration token... its usually there on the first poll but if GitHub or AWS were to have a blip it will still work fine
  • It's also easy to manually launch a runner in the console if needed... just specify the Org/Repo tags

@rvoitenko
Copy link

@j3parker thank you for solution!
With ephemeral runners you describe it's kind of autoscaling, because instances which remains in ASG are just idle runners.
Question about ephemeral runners: how do you catch event when job gets picked up by the runner ? Do you also terminate such ephemeral runner after job complete ?
Would be nice to know some details.

@j3parker
Copy link

j3parker commented Mar 19, 2021

Question about ephemeral runners: how do you catch event when job gets picked up by the runner ?

Fantastic question -- I opened an issue about that over here: #699

We've prototyped a few hacks to detect when a job is started (to remove the runner from the ASG, triggering a new one to start booting to replace it + ASG policies to scale-up). We're just waiting patiently for ephemeral runners to be supported 😄

Do you also terminate such ephemeral runner after job complete?

Our plan is to terminate, yes. Vaguely I'm assuming the runner will exit and we will trigger a shutdown. You can configure an EC2 instance to terminate on shutdown.


Spinning up VMs for builds might be expensive. We do a fair clip of builds during the day so one option I'm mulling is to use firecracker rather than VMs, but you need to buy a whole (metal) instance for that. We haven't costed out if that would make sense for us yet.

Hopefully in the long-term someone will develop a turn-key AWS solution that can do a mix of spot-based instances for small load and bulk firecracker-based ones for better latency at scale.

@rvoitenko
Copy link

rvoitenko commented Mar 19, 2021

We've prototyped a few hacks to detect when a job is started (to remove the runner from the ASG, triggering a new one to start booting to replace it + ASG policies to scale-up). We're just waiting patiently for ephemeral runners to be supported 😄

I guess GitHub going to present something new in Q3.
but we need some working solution until that happen.

Our plan is to terminate, yes. Vaguely I'm assuming the runner will exit and we will trigger a shutdown. You can configure an EC2 instance to terminate on shutdown.

@j3parker good tip, thank you.
In my propotype I'm checking if runner is busy via github api - /actions/runners. If it's busy - script remove it from ASG.
Next run script check if runner not busy and it's not part of ASG(no tag "aws:autoscaling:groupName"") and do deregister runner from github and shutdown/terminate the instance. The only problem that this checking script runs by cron every minute and if job run is less than 1 minute there is a chance that such logic will not work.
My goal is to try detect that runner is busy not every minute, but immediately. Maybe some filewatcher service that detects new Worker* files in runner "_diag" folder will help.
But this looks promising for my setup.

Spinning up VMs for builds might be expensive. We do a fair clip of builds during the day so one option I'm mulling is to use firecracker rather than VMs, but you need to buy a whole (metal) instance for that. We haven't costed out if that would make sense for us yet.

Why expensive as EC2 instances are currently billed per-second ?
Also spin-up time for ephemeral runners can be improved by baking own images with pre-installed runner. Register runner action is also just few seconds task.

@j3parker
Copy link

In my propotype I'm checking if runner is busy via github api - /actions/runners. If it's busy - script remove it from ASG.

Nice! That is simple.

Why expensive as EC2 instances are currently billed per-second ?

Oh sorry, that was unclear. I mean in terms of time (there is a latency to spin up a machine.) Spinning up hot capacity in the background can hide that from users, but of course you're also paying for that. With enough concurrent builds it could be worth it (both in terms of money and managing perceived latency) to have an entire machine rented from AWS and use firecracker (which will boot things faster than EC2, e.g. it's what powers AWS Lambda).

An i3.metal (required if you want to use Firecracker) is 72 vCPUs so if you're doing 2 vCPU agents thats probably only going to make sense if you are doing >36 or so concurrent builds. You pay for these by the second too though, and can theoretically buy them on the spot market (I'm not sure if availability is good).

Also spin-up time for ephemeral runners can be improved by baking own images

😄 We do that by taking actions/virtual-environments which defines the GitHub-hosted runners and patching the packer files with jsonnet to tweak things for our purposes (and install the runner exe.) I definitely recommend it. You need to keep up with versions of the runner so that when your VM connects to github it doesn't accept a job and then download the newer version of the runner (we have a scheduled github action that polls for new releases of the runner.)

@vietanhduong
Copy link

I doing a project like this by using GCP Preemptible VM but there are some issues:

  • Startup instance: I need 65s to create a new instance and register the runner. I use startup script to register the runner with GitHub.
  • Cache: I have no idea how to resolve this issue, maybe Gcloud storage.

I'm changing to Gcloud build. I think it's easier.

@abdidarmawan007
Copy link

abdidarmawan007 commented Jun 22, 2021

Gitlab already support this feature for long time ago use Gitlab Runner Manager
its like Github no intention support for autoscaling self-hosted (aws,gcp) for Github runners
because they try build like Azure Devops

@ghost
Copy link

ghost commented Jul 20, 2021

Waiting for this feature to be running on AWS ECS fargate

@abhinav-khanna-1001
Copy link

abhinav-khanna-1001 commented Aug 5, 2021

@dgteixeira
Copy link

@vietanhduong , how did you implement that in GCP? I'm trying to use a MIG with runners on them.
Do you have any details on how you made it?

@manpreet-agoro
Copy link

Gitlab already support this feature for long time ago use Gitlab Runner Manager its like Github no intention support for autoscaling self-hosted (aws,gcp) for Github runners because they try build like Azure Devops

How will they make you pay if runners are easy to auto scale? Its similar to "planned obsolesce", this would be "authentication nightmare"

@giorgiocerruti
Copy link

giorgiocerruti commented Oct 7, 2021

You can create a simple cronjob to regenerate the token every 30 minutes let's say. I created an scalable environment in an ECS cluster and sometimes the containers die after more then 1h, before unsubscribe the runner a function refresh the token.

@ringods
Copy link

ringods commented Nov 4, 2021

Strange that no one is pointing to the docs on this:

https://docs.github.com/en/actions/hosting-your-own-runners/autoscaling-with-self-hosted-runners

@lorengordon
Copy link

I would suppose that's because the features that doc is written around are fairly new, released 20 Sept. :D

https://github.blog/changelog/2021-09-20-github-actions-ephemeral-self-hosted-runners-new-webhooks-for-auto-scaling/

Strange that no one is pointing to the docs on this:

docs.github.com/en/actions/hosting-your-own-runners/autoscaling-with-self-hosted-runners

@ashb
Copy link

ashb commented Nov 10, 2021

I've just noticed this warning in the logs of my runner:

Nov 10 10:11:54 ip-172-31-28-50 run.sh[12322]: Warning: '--once' is going to be deprecated in the future, please consider using '--ephemeral' during runner registration.
Nov 10 10:11:54 ip-172-31-28-50 run.sh[12322]: https://docs.github.com/en/actions/hosting-your-own-runners/autoscaling-with-self-hosted-runners#using-ephemeral-runners-for-autoscaling

However this won't work for us as a project in the apache org unless something has changed about the permissions around registering runners -- In order to register a new runner we need to a token that is created, and to create runner in a org wide group that requires Admin permissions, which we as members of the project don't have (only the central members of the ASF Infra team have that).

It has not changed as per https://docs.github.com/en/rest/reference/actions#self-hosted-runners

In order to create a registration token for an org group (i.e. not belonging to a single repo) I'll need an access token with Admin rights on the org:

GitHub Apps must have the administration permission for repositories or the organization_self_hosted_runners permission for organizations. Authenticated users must have admin access to the repository or organization to use this API.

If this goes ahead then all apache projects won't be able to have single-shot runners anymore.

@nikola-jokic nikola-jokic added the Runner Feature Feature scope to the runner label Mar 16, 2022
@hasinireddy24
Copy link

Hi,

In the cloudwatch logs I see that the lambda triggers the scale up function but it is not creating the EC2 instance and also the job builds are not queued up in SQS. As my understanding is right, whenever the job is in queued it should post it in the SQS queue and from there the lambda scale up function picks up the job. But that is not happening. I'm not seeing any messages come to SQS always the available messages is "0".

Cloudwatch logs for Scaleup function

2022-07-22 17:55:35.045 INFO [scale-up:b0c371ee-c099-xxxxxxxx index.js:1142xx scaleUp] Received workflow_job from xxxxxxx
{}
2022-07-22 17:55:35.060 INFO [scale-up:b0c371ee-c099-5a8c-ba85-2aba264b3b98 index.js:114235 scaleUp] Received event
{
"runnerType": "Org",
"runnerOwner": "xxxxxxx",
"event": "workflow_job",
"id": "xxxxxx"
}

@aktech
Copy link

aktech commented Aug 16, 2022

Disclaimer: This doesn't answers the actual question, but suggests an alternative:

You can achieve this easily with https://cirun.io/ It creates on demand runners for GitHub Actions on your cloud and manages the complete lifecycle. You simply connect your cloud provider and define what runners you need in a simple yaml file and that's it.

See https://docs.cirun.io/reference/examples.html#aws for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autoscaling enhancement New feature or request Runner Feature Feature scope to the runner
Projects
None yet
Development

No branches or pull requests