Support configure runner as ephemeral. #660

TingluoHuang · 2020-08-14T15:37:03Z

The service will make sure to only ever send one job to this runner.
The service will remove the runner registration from service after the job finish.

TingluoHuang · 2020-08-14T15:40:13Z

#510

src/Runner.Listener/Configuration/ConfigurationManager.cs

hross · 2020-09-02T10:32:03Z

If someone uses this on a beta version of GHES how will we handle error messages?

Shegox · 2020-09-11T13:09:34Z

Hi, this looks like a very great change and really looking forward to that.
I would actually have a question and hope it is fine to directly ask it here.
We're running currently the GHES 2.22.0 beta and also looking into providing runners on a large scale for the enterprise account.
If I understand ephemeral runners correctly, they will be registered just like normal runners, but the big difference being that they only receive a single job. Meaning that we could execute untrusted code inside the self-hosted runner and the untrusted code wouldn't be able to extract credentials to get another job and potentially steal their GitHub secrets?

Would perhaps also be good to extend the docs/design/auth.md with such information.

Already many thanks in advance and if you need someone testing that on GHES, I would be happy to help.

Temtaime · 2020-09-18T11:38:44Z

It is a very limited solution for creating fresh environments.
It doesn't provide a way for creating VM with specific configuration.
More elegant solution is to add ability to create custom executor like how it is done in gitlab.
#689

dakale · 2020-10-27T17:16:05Z

@TingluoHuang I tried this out, and one thing I found is that the process doesnt seem to exit if the runner was auto updated prior to running its one job. Is that something you are aware of?

shwuhk · 2020-11-09T05:39:41Z

May I know how to use ephemeral with run.sh/runsvc.sh now?

Dids · 2020-12-03T07:44:48Z

What's the status on this? I'm assuming this is still waiting for server-side changes, if so, is that publicly being tracked anywhere?

I've been working around the "single use" self-hosted runner issues by creating an orchestrator of sorts, which keeps N amount of runners running (all running inside a Docker container) with the --once flag, then destroys and (de-)registers them when the jobs are done.

This has been fairly unreliably for several reasons:

If the amount of queued jobs exceeds the amount of runners, jobs will end up timing out because they were assigned to a non-existent runner (sometimes takes 24 hours for them to timeout)
Having a --once runner running for long periods of time, and especially when there are connection issues during that time, will leave the runner in some kind of inconsistent state, where it is unable to accept new jobs and sometimes even shows up as offline on GH's side

The upside is that this has provided a very nice way to provide semi-isolated environments for runners/jobs, as each runner would run in a fresh Docker container, but with the downside of additional action containers running on the same host.

FloThinksPi · 2021-03-17T15:41:48Z

Also interested in an answer to @Shegox question.

Given one is running unknown code on a runner. To safely run unknown code we`d like to reset a VM to a Snapshot after every run. To do so one can run on an ephemeral runner. As the unknown code has root permissions there (to access docker or install packages) the assumption is this code could also alter/access the runner process itself.

Is it guaranteed that with the access token of the ephemeral runner, malicious code on this runner can not pull another workflow on his instance ? E.g. malicious workflows could extract api tokens from the runner and start a second runner process to pick up another workflow and be able to extract secrets from that workflow then.

As of https://github.com/actions/runner/blob/main/docs/design/auth.md i would expect as long as the initial workflow on the ephemeral runner did not finish, its token is valid and malicious code would be able to use that token to fetch more Workflow jobs and extract secrets from them ? Or does the API actually ensure that just a single workflow job can be pulled with the token of the ephemeral worker and not any other ones in github-enterprise ? (its done this way on github.com already)

Summary: Pull Request resolved: #56929 Artifacts were failing to unzip since they already existed in the current tree so this just forces the zip to go through no matter what Was observing that test phases will fail if attempting to zip over an already existing directory, https://github.com/pytorch/pytorch/runs/2424525136?check_suite_focus=true In the long run however it'd be good to have these binaries built out as part of the regular cmake process instead of being one off builds like they are now **NOTE**: This wouldn't be an issue if `--ephemeral` workers was a thing, see: actions/runner#660 Signed-off-by: Eli Uriegas <[email protected]> Test Plan: Imported from OSS Reviewed By: janeyx99 Differential Revision: D28004271 Pulled By: seemethere fbshipit-source-id: c138bc85caac5d411a0126d27cc42c60fe88de60

Summary: Pull Request resolved: pytorch#56929 Artifacts were failing to unzip since they already existed in the current tree so this just forces the zip to go through no matter what Was observing that test phases will fail if attempting to zip over an already existing directory, https://github.com/pytorch/pytorch/runs/2424525136?check_suite_focus=true In the long run however it'd be good to have these binaries built out as part of the regular cmake process instead of being one off builds like they are now **NOTE**: This wouldn't be an issue if `--ephemeral` workers was a thing, see: actions/runner#660 Signed-off-by: Eli Uriegas <[email protected]> Test Plan: Imported from OSS Reviewed By: janeyx99 Differential Revision: D28004271 Pulled By: seemethere fbshipit-source-id: c138bc85caac5d411a0126d27cc42c60fe88de60

haines · 2021-05-20T12:54:23Z

Hi, just wanted to check if this is still on the roadmap? We have an autoscaling group of self-hosted runners but it's very unreliable - we often just get "this check failed" with no log output after jobs time out, which I assume is because the service is allocating jobs to the runners that are scaling down. We really need to be able to configure the runners as ephemeral, but if that's not going to ship any time soon we will have to look at another approach.

sethvargo · 2021-06-03T01:15:31Z

@bryanmacfarlane @TingluoHuang 👋 could you please provide an update on whether this will be merged (or general support for ephemeral runners in general)?

lokesh755 · 2021-06-03T01:18:21Z

@sethvargo We're actively working on this. We should merge this sometime this month probably sooner :)

thinkafterbefore · 2021-06-29T13:38:24Z

@lokesh755 @TingluoHuang Any updates to when this will hit production?

src/Runner.Listener/Runner.cs

tbando · 2021-09-14T23:34:08Z

Can I already use --ephemeral feature on self-hosted GHES? Or need to wait for newer release of GHES?

joeyparrish · 2021-09-15T00:08:55Z

This shipped in the latest release, v2.282.0.

tbando · 2021-09-15T00:15:45Z

Oh, I would like to know if I need to update server-side GHES as I got Internal Error when I issued .config.sh --ephemeral.

MichaelJJ · 2021-09-17T22:04:18Z

With the new --ephemeral flag, is there a way to have the config.sh wait until the runner has de-registered? As an example, if I make a docker image that runs a shell script on startup to register an ephemeral runner, what is the best way to have the script wait until the runner is done so the container doesn't exit?

joeyparrish · 2021-09-17T22:24:28Z

I use a shell script as a docker entrypoint, which calls config.sh --ephemeral followed by run.sh, then terminates. I wrap that in a systemd service that removes and restarts the docker container automatically.

An older version of this (based on --once) is currently available at https://github.com/myoung34/docker-github-actions-runner#ephemeral-mode

I'm working on a PR to update that to use --ephemeral instead.

zetaab · 2021-09-20T20:12:49Z

This PR currently breaks github actions in GHES. --once does not work anymore and --ephemeral not supported great. And github actions runner force updating itself to newest

TingluoHuang · 2021-09-20T20:22:58Z

@zetaab I don't think we changed any behavior for --once in this PR, what exactly error/issue did you are run into on GHES?

rofafor · 2021-09-20T20:28:27Z

once was removed from valid flags resulting Unrecognized command-line input arguments: 'once'.

TingluoHuang · 2021-09-20T20:30:15Z

I think that should only give you a warning but not actually fail anything.

rofafor · 2021-09-20T20:44:56Z

I'm getting An Internal Error Occurred. Activity Id: ... errors when enabling --ephemeral against our GHES 3.0 / 3.1.

TingluoHuang · 2021-09-20T21:58:25Z

--ephemeral does not support GHES

aidan-mundy · 2021-09-20T22:33:26Z

@TingluoHuang As far as I can tell, --once is no longer usable with this update. As @rofafor said, it is no longer accepted as a valid flag. See https://github.com/actions/runner/pull/660/files#diff-b1f59ae3d34d9d3811ce43ed0214576cb4d9f3373a6734adf1318b5ab7e535eeL35

zetaab · 2021-09-21T11:52:10Z

Like @rofafor said: I think your idea was to deprecate flag, but you actually removed it also. So now the problem is that --once does not work anymore. When using GHES --ephemeral does not work either.

thboop · 2021-09-21T12:18:53Z

@aidan-mundy , @zetaab can you confirm that you are unable to use the --once flag when configuring the newest runner? You may see an error saying the flag is not available (which is intended, we want people to eventually move off of it), but the flag still works.

If it doesn't work, please file an issue and provide your runner version and os.

TingluoHuang · 2021-09-21T13:23:51Z

Here is what i just tried.

ting@htl-mac _layout % ./run.sh --once
Unrecognized command-line input arguments: 'once'. For usage refer to: .\config.cmd --help or ./config.sh --help

√ Connected to GitHub

2021-09-21 13:19:26Z: Listening for Jobs
2021-09-21 13:20:15Z: Running job: build
2021-09-21 13:20:18Z: Job build completed with result: Succeeded
ting@htl-mac _layout % $?
zsh: command not found: 0

We do print out an error but the flag is no longer recognized, but the runner is still able to connect to the server and run a single job, and exit.

Do I miss something here?

rofafor · 2021-09-21T13:26:29Z

The --once seems to be working despite the warning message. According to code comments, you've scheduled to remove the once switch completely in 10/2021 - what happens to GHES after that? No more ephemeral runners?

TingluoHuang · 2021-09-21T13:30:29Z

@rofafor
We have not decided when to really remove --once, given the fact that so many customers depend on it today and have various reasons can't leverage --ephemeral, ex: GHES.

We will keep the --once around for a long time until everyone is good to move off.

TingluoHuang · 2021-09-21T13:34:08Z

I created a PR to update the comment to make it less confusing. #1360

@thboop ☝️

hross · 2021-09-21T13:46:58Z

I want to add a summary here so it's obvious if you land here wondering about --once:

We don't plan to deprecate this command any time soon and will give notification before we do so. We realize customers still use it (you will receive the warning, though)
We recommend that you stop using --once and start using --ephemeral (except on current versions of GHES). The reason is that this is a server side change to ensure there are not race conditions with job assignment. --once was not "officially supported" and is client side, which exposes you to the risk of multiple job assignment.
--ephemeral will ship in the next version of GHES (but as I said above it requires server changes to fix the race condition with client side only assignment)

If you have any issues with ephemeral/once please feel free to reach out (this issue works but you can also use the community support forms which might have better support for customer questions and let us file support tickets to help you).

More information can be found in this runner issue.

aidan-mundy · 2021-10-07T20:17:01Z

@hross When you say "next version" do you mean in a quarter (with V3.3.0) or in a couple weeks (with V3.2.1)?

Shegox · 2021-10-07T23:40:27Z

Disclaimer: not a GitHub employee

GitHub normally releases feature only in minor (3.x) releases and not in patch releases (3.2.x). So I wouldn't actually expect it before 3.3.x (and maybe even later, but thats up to GitHub to confirm).

The GitHub roadmap currently doesn't specify any concrete date for it.

hross · 2021-10-08T11:34:56Z

@Shegox is right. We will land it in 3.3.x (next version meaning "next major release").

aidan-mundy · 2021-11-09T18:51:50Z

For those of you that are enterprise server users and are waiting for this functionality, 3.3.0.rc1 is now available for preview. It includes the --ephemeral flag and a number of other neat features/changes.

(looks like my estimate of "in a quarter" was slightly pessimistic, happy to see the prompt update from the GHES team!)

Manouchehri · 2022-01-31T19:23:28Z

Is there an easy way to run a command after --ephemeral has finished one job?

sethvargo · 2022-01-31T19:53:17Z

@Manouchehri I've achieved this by running ephemeral under systemd and then using a ExecStop or ExecStopPost.

ericsciple reviewed Aug 24, 2020

View reviewed changes

src/Runner.Listener/Configuration/ConfigurationManager.cs Show resolved Hide resolved

ericsciple previously approved these changes Aug 24, 2020

View reviewed changes

lokesh755 previously approved these changes Aug 25, 2020

View reviewed changes

j3parker mentioned this pull request Sep 16, 2020

Ability to exit runner svc after completing workflow #559

Closed

onelapahead mentioned this pull request Sep 21, 2020

Use self update ready entrypoint actions/actions-runner-controller#99

Merged

j3parker mentioned this pull request Sep 25, 2020

Question: Is it possible to run the runner for only one job in queue so I can stop virtual machine and setup new one a fresh one. #720

Closed

andreabenfatto mentioned this pull request Oct 27, 2020

Ephemeral (single use) runner registrations #510

Closed

This was referenced Dec 24, 2020

Scale runners to 0? evryfs/github-actions-runner-operator#81

Closed

Support ephemeral runners to minimize committed resource usage actions/actions-runner-controller#17

Closed

seemethere mentioned this pull request Apr 26, 2021

.github: Add options to force unzip artifacts pytorch/pytorch#56929

Closed

hross mentioned this pull request May 14, 2021

enterprise autoscaling issues, indefinitely queued jobs within workflows actions/actions-runner-controller#470

Closed

seemethere mentioned this pull request Jun 24, 2021

Spurious failure reported by a GHA workflow pytorch/pytorch#60506

Closed

martin389 reviewed Jul 12, 2021

View reviewed changes

src/Runner.Listener/Runner.cs Outdated Show resolved Hide resolved

pje dismissed lokesh755’s stale review via 2b36a52 July 12, 2021 19:49

t0rr3sp3dr0 mentioned this pull request Sep 14, 2021

Don't Remove Support to --once #1339

Closed

uilton-oliveira mentioned this pull request Sep 15, 2021

Ephemeral Runners? philips-labs/terraform-aws-github-runner#182

Closed

0x2b3bfa0 mentioned this pull request Sep 21, 2021

Github actions ephemeral registration option. iterative/cml#724

Closed

skyzyx mentioned this pull request Oct 20, 2021

GHES Runners at Enterprise Level support philips-labs/terraform-aws-github-runner#1303

Open

dvviktordelev mentioned this pull request May 8, 2024

How to handle runner upgrade with ephemeral datavisyn/github-workflows#65

Open

Support configure runner as ephemeral. #660

Support configure runner as ephemeral. #660

Conversation

TingluoHuang commented Aug 14, 2020

TingluoHuang commented Aug 14, 2020

hross commented Sep 2, 2020

Shegox commented Sep 11, 2020

Temtaime commented Sep 18, 2020 • edited Loading

dakale commented Oct 27, 2020

shwuhk commented Nov 9, 2020

Dids commented Dec 3, 2020

FloThinksPi commented Mar 17, 2021 • edited Loading

haines commented May 20, 2021

sethvargo commented Jun 3, 2021

lokesh755 commented Jun 3, 2021

thinkafterbefore commented Jun 29, 2021

tbando commented Sep 14, 2021

joeyparrish commented Sep 15, 2021

tbando commented Sep 15, 2021

MichaelJJ commented Sep 17, 2021

joeyparrish commented Sep 17, 2021

zetaab commented Sep 20, 2021 • edited Loading

TingluoHuang commented Sep 20, 2021

rofafor commented Sep 20, 2021

TingluoHuang commented Sep 20, 2021

rofafor commented Sep 20, 2021

TingluoHuang commented Sep 20, 2021

aidan-mundy commented Sep 20, 2021

zetaab commented Sep 21, 2021

thboop commented Sep 21, 2021

TingluoHuang commented Sep 21, 2021

rofafor commented Sep 21, 2021 • edited Loading

TingluoHuang commented Sep 21, 2021

TingluoHuang commented Sep 21, 2021

hross commented Sep 21, 2021 • edited by thboop Loading

aidan-mundy commented Oct 7, 2021

Shegox commented Oct 7, 2021

hross commented Oct 8, 2021

aidan-mundy commented Nov 9, 2021 • edited Loading

Manouchehri commented Jan 31, 2022

sethvargo commented Jan 31, 2022

Temtaime commented Sep 18, 2020 •

edited

Loading

FloThinksPi commented Mar 17, 2021 •

edited

Loading

zetaab commented Sep 20, 2021 •

edited

Loading

rofafor commented Sep 21, 2021 •

edited

Loading

hross commented Sep 21, 2021 •

edited by thboop

Loading

aidan-mundy commented Nov 9, 2021 •

edited

Loading