Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support configure runner as ephemeral. #660

Merged
merged 2 commits into from
Sep 13, 2021
Merged

Conversation

TingluoHuang
Copy link
Member

The service will make sure to only ever send one job to this runner.
The service will remove the runner registration from service after the job finish.

@TingluoHuang
Copy link
Member Author

#510

ericsciple
ericsciple previously approved these changes Aug 24, 2020
lokesh755
lokesh755 previously approved these changes Aug 25, 2020
@hross
Copy link
Contributor

hross commented Sep 2, 2020

If someone uses this on a beta version of GHES how will we handle error messages?

@Shegox
Copy link

Shegox commented Sep 11, 2020

Hi, this looks like a very great change and really looking forward to that.
I would actually have a question and hope it is fine to directly ask it here.
We're running currently the GHES 2.22.0 beta and also looking into providing runners on a large scale for the enterprise account.
If I understand ephemeral runners correctly, they will be registered just like normal runners, but the big difference being that they only receive a single job. Meaning that we could execute untrusted code inside the self-hosted runner and the untrusted code wouldn't be able to extract credentials to get another job and potentially steal their GitHub secrets?

Would perhaps also be good to extend the docs/design/auth.md with such information.

Already many thanks in advance and if you need someone testing that on GHES, I would be happy to help.

@Temtaime
Copy link
Contributor

Temtaime commented Sep 18, 2020

It is a very limited solution for creating fresh environments.
It doesn't provide a way for creating VM with specific configuration.
More elegant solution is to add ability to create custom executor like how it is done in gitlab.
#689

@dakale
Copy link
Contributor

dakale commented Oct 27, 2020

@TingluoHuang I tried this out, and one thing I found is that the process doesnt seem to exit if the runner was auto updated prior to running its one job. Is that something you are aware of?

@shwuhk
Copy link

shwuhk commented Nov 9, 2020

May I know how to use ephemeral with run.sh/runsvc.sh now?

@Dids
Copy link

Dids commented Dec 3, 2020

What's the status on this? I'm assuming this is still waiting for server-side changes, if so, is that publicly being tracked anywhere?

I've been working around the "single use" self-hosted runner issues by creating an orchestrator of sorts, which keeps N amount of runners running (all running inside a Docker container) with the --once flag, then destroys and (de-)registers them when the jobs are done.

This has been fairly unreliably for several reasons:

  • If the amount of queued jobs exceeds the amount of runners, jobs will end up timing out because they were assigned to a non-existent runner (sometimes takes 24 hours for them to timeout)
  • Having a --once runner running for long periods of time, and especially when there are connection issues during that time, will leave the runner in some kind of inconsistent state, where it is unable to accept new jobs and sometimes even shows up as offline on GH's side

The upside is that this has provided a very nice way to provide semi-isolated environments for runners/jobs, as each runner would run in a fresh Docker container, but with the downside of additional action containers running on the same host.

@FloThinksPi
Copy link

FloThinksPi commented Mar 17, 2021

Also interested in an answer to @Shegox question.

Given one is running unknown code on a runner. To safely run unknown code we`d like to reset a VM to a Snapshot after every run. To do so one can run on an ephemeral runner. As the unknown code has root permissions there (to access docker or install packages) the assumption is this code could also alter/access the runner process itself.

Is it guaranteed that with the access token of the ephemeral runner, malicious code on this runner can not pull another workflow on his instance ? E.g. malicious workflows could extract api tokens from the runner and start a second runner process to pick up another workflow and be able to extract secrets from that workflow then.

As of https://github.com/actions/runner/blob/main/docs/design/auth.md i would expect as long as the initial workflow on the ephemeral runner did not finish, its token is valid and malicious code would be able to use that token to fetch more Workflow jobs and extract secrets from them ? Or does the API actually ensure that just a single workflow job can be pulled with the token of the ephemeral worker and not any other ones in github-enterprise ? (its done this way on github.com already)

facebook-github-bot pushed a commit to pytorch/pytorch that referenced this pull request Apr 26, 2021
Summary:
Pull Request resolved: #56929

Artifacts were failing to unzip since they already existed in the
current tree so this just forces the zip to go through no matter what

Was observing that test phases will fail if attempting to zip over an already existing directory, https://github.com/pytorch/pytorch/runs/2424525136?check_suite_focus=true

In the long run however it'd be good to have these binaries built out as part of the regular cmake process instead of being one off builds like they are now

**NOTE**: This wouldn't be an issue if `--ephemeral` workers was a thing, see: actions/runner#660

Signed-off-by: Eli Uriegas <[email protected]>

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D28004271

Pulled By: seemethere

fbshipit-source-id: c138bc85caac5d411a0126d27cc42c60fe88de60
crcrpar pushed a commit to crcrpar/pytorch that referenced this pull request May 7, 2021
Summary:
Pull Request resolved: pytorch#56929

Artifacts were failing to unzip since they already existed in the
current tree so this just forces the zip to go through no matter what

Was observing that test phases will fail if attempting to zip over an already existing directory, https://github.com/pytorch/pytorch/runs/2424525136?check_suite_focus=true

In the long run however it'd be good to have these binaries built out as part of the regular cmake process instead of being one off builds like they are now

**NOTE**: This wouldn't be an issue if `--ephemeral` workers was a thing, see: actions/runner#660

Signed-off-by: Eli Uriegas <[email protected]>

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D28004271

Pulled By: seemethere

fbshipit-source-id: c138bc85caac5d411a0126d27cc42c60fe88de60
krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
Summary:
Pull Request resolved: pytorch#56929

Artifacts were failing to unzip since they already existed in the
current tree so this just forces the zip to go through no matter what

Was observing that test phases will fail if attempting to zip over an already existing directory, https://github.com/pytorch/pytorch/runs/2424525136?check_suite_focus=true

In the long run however it'd be good to have these binaries built out as part of the regular cmake process instead of being one off builds like they are now

**NOTE**: This wouldn't be an issue if `--ephemeral` workers was a thing, see: actions/runner#660

Signed-off-by: Eli Uriegas <[email protected]>

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D28004271

Pulled By: seemethere

fbshipit-source-id: c138bc85caac5d411a0126d27cc42c60fe88de60
@haines
Copy link

haines commented May 20, 2021

Hi, just wanted to check if this is still on the roadmap? We have an autoscaling group of self-hosted runners but it's very unreliable - we often just get "this check failed" with no log output after jobs time out, which I assume is because the service is allocating jobs to the runners that are scaling down. We really need to be able to configure the runners as ephemeral, but if that's not going to ship any time soon we will have to look at another approach.

@sethvargo
Copy link

@bryanmacfarlane @TingluoHuang 👋 could you please provide an update on whether this will be merged (or general support for ephemeral runners in general)?

@lokesh755
Copy link
Contributor

@sethvargo We're actively working on this. We should merge this sometime this month probably sooner :)

@thinkafterbefore
Copy link

@lokesh755 @TingluoHuang Any updates to when this will hit production?

@tbando
Copy link

tbando commented Sep 14, 2021

Can I already use --ephemeral feature on self-hosted GHES? Or need to wait for newer release of GHES?

@joeyparrish
Copy link

This shipped in the latest release, v2.282.0.

@tbando
Copy link

tbando commented Sep 15, 2021

Oh, I would like to know if I need to update server-side GHES as I got Internal Error when I issued .config.sh --ephemeral.

@MichaelJJ
Copy link

With the new --ephemeral flag, is there a way to have the config.sh wait until the runner has de-registered? As an example, if I make a docker image that runs a shell script on startup to register an ephemeral runner, what is the best way to have the script wait until the runner is done so the container doesn't exit?

@joeyparrish
Copy link

I use a shell script as a docker entrypoint, which calls config.sh --ephemeral followed by run.sh, then terminates. I wrap that in a systemd service that removes and restarts the docker container automatically.

An older version of this (based on --once) is currently available at https://github.com/myoung34/docker-github-actions-runner#ephemeral-mode

I'm working on a PR to update that to use --ephemeral instead.

@zetaab
Copy link

zetaab commented Sep 20, 2021

This PR currently breaks github actions in GHES. --once does not work anymore and --ephemeral not supported great. And github actions runner force updating itself to newest

@TingluoHuang
Copy link
Member Author

@zetaab I don't think we changed any behavior for --once in this PR, what exactly error/issue did you are run into on GHES?

@rofafor
Copy link

rofafor commented Sep 20, 2021

once was removed from valid flags resulting Unrecognized command-line input arguments: 'once'.

@TingluoHuang
Copy link
Member Author

I think that should only give you a warning but not actually fail anything.

@rofafor
Copy link

rofafor commented Sep 20, 2021

I'm getting An Internal Error Occurred. Activity Id: ... errors when enabling --ephemeral against our GHES 3.0 / 3.1.

@TingluoHuang
Copy link
Member Author

--ephemeral does not support GHES

@aidan-mundy
Copy link

@TingluoHuang As far as I can tell, --once is no longer usable with this update. As @rofafor said, it is no longer accepted as a valid flag. See https://github.com/actions/runner/pull/660/files#diff-b1f59ae3d34d9d3811ce43ed0214576cb4d9f3373a6734adf1318b5ab7e535eeL35

@zetaab
Copy link

zetaab commented Sep 21, 2021

Like @rofafor said: I think your idea was to deprecate flag, but you actually removed it also. So now the problem is that --once does not work anymore. When using GHES --ephemeral does not work either.

@thboop
Copy link
Collaborator

thboop commented Sep 21, 2021

@aidan-mundy , @zetaab can you confirm that you are unable to use the --once flag when configuring the newest runner? You may see an error saying the flag is not available (which is intended, we want people to eventually move off of it), but the flag still works.

If it doesn't work, please file an issue and provide your runner version and os.

@TingluoHuang
Copy link
Member Author

Here is what i just tried.

ting@htl-mac _layout % ./run.sh --once
Unrecognized command-line input arguments: 'once'. For usage refer to: .\config.cmd --help or ./config.sh --help

√ Connected to GitHub

2021-09-21 13:19:26Z: Listening for Jobs
2021-09-21 13:20:15Z: Running job: build
2021-09-21 13:20:18Z: Job build completed with result: Succeeded
ting@htl-mac _layout % $?
zsh: command not found: 0

We do print out an error but the flag is no longer recognized, but the runner is still able to connect to the server and run a single job, and exit.

Do I miss something here?

@rofafor
Copy link

rofafor commented Sep 21, 2021

The --once seems to be working despite the warning message. According to code comments, you've scheduled to remove the once switch completely in 10/2021 - what happens to GHES after that? No more ephemeral runners?

@TingluoHuang
Copy link
Member Author

@rofafor
We have not decided when to really remove --once, given the fact that so many customers depend on it today and have various reasons can't leverage --ephemeral, ex: GHES.

We will keep the --once around for a long time until everyone is good to move off.

@TingluoHuang
Copy link
Member Author

I created a PR to update the comment to make it less confusing. #1360

@thboop ☝️

@hross
Copy link
Contributor

hross commented Sep 21, 2021

I want to add a summary here so it's obvious if you land here wondering about --once:

  • We don't plan to deprecate this command any time soon and will give notification before we do so. We realize customers still use it (you will receive the warning, though)
  • We recommend that you stop using --once and start using --ephemeral (except on current versions of GHES). The reason is that this is a server side change to ensure there are not race conditions with job assignment. --once was not "officially supported" and is client side, which exposes you to the risk of multiple job assignment.
  • --ephemeral will ship in the next version of GHES (but as I said above it requires server changes to fix the race condition with client side only assignment)

If you have any issues with ephemeral/once please feel free to reach out (this issue works but you can also use the community support forms which might have better support for customer questions and let us file support tickets to help you).

More information can be found in this runner issue.

@aidan-mundy
Copy link

@hross When you say "next version" do you mean in a quarter (with V3.3.0) or in a couple weeks (with V3.2.1)?

@Shegox
Copy link

Shegox commented Oct 7, 2021

Disclaimer: not a GitHub employee

GitHub normally releases feature only in minor (3.x) releases and not in patch releases (3.2.x). So I wouldn't actually expect it before 3.3.x (and maybe even later, but thats up to GitHub to confirm).

The GitHub roadmap currently doesn't specify any concrete date for it.

@hross
Copy link
Contributor

hross commented Oct 8, 2021

@Shegox is right. We will land it in 3.3.x (next version meaning "next major release").

@aidan-mundy
Copy link

aidan-mundy commented Nov 9, 2021

For those of you that are enterprise server users and are waiting for this functionality, 3.3.0.rc1 is now available for preview. It includes the --ephemeral flag and a number of other neat features/changes.

(looks like my estimate of "in a quarter" was slightly pessimistic, happy to see the prompt update from the GHES team!)

@Manouchehri
Copy link

Is there an easy way to run a command after --ephemeral has finished one job?

@sethvargo
Copy link

@Manouchehri I've achieved this by running ephemeral under systemd and then using a ExecStop or ExecStopPost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.