Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: provide a signal to other programs when the runner has started a job #699

Closed
j3parker opened this issue Sep 10, 2020 · 5 comments
Labels
enhancement New feature or request

Comments

@j3parker
Copy link

j3parker commented Sep 10, 2020

It would be helpful if the runner could signal that it has picked up a job to other programs.

We intend to use ephemeral runners (#510). We want to have an auto-scale-group in AWS that holds spare/idle capacity. When a job starts we want to remove the runner from that ASG so that it gets replaced in parallel with the job. The alternative is to scale up only after the job ends (and the machine terminates) but this adds latency, and if we get a burst of jobs (that uses up all of our capacity) we would get excessive queueing (and if the jobs are long-running it could get really bad.)

(You don't care about these details) when we detect a job has started we trigger a Lambda function that looks at the caller identity and removes the caller from its ASG (we do it this way so that the runner VM has very narrow AWS permissions).

How we've worked around this: with --once (which we tried using before noticing #510) we were looking at the runner stdout for the "Job started" message. This isn't the classiest thing to do (but it at least gracefully degrades -- if the message were to change then we would just have more build queueing). Any design you come up with would be fine for us.

@j3parker j3parker added the enhancement New feature or request label Sep 10, 2020
@j3parker
Copy link
Author

j3parker commented Sep 10, 2020

This would also make scaling-in easier: if our ASG is only holding spare/idle capacity then it is safe to stop those VMs. If our ASG holds both idle and active runners then scaling-in is more complicated (we can't just let the ASG pick a random machine -- but we don't know which ones it should pick).

There is a small race-condition here: the runner may have picked up a job but we haven't detatched it from the ASG... but this is something we could probably live with easily (we're talking like < 1 second delay here). And there are ways we could solve that too.

@bryanmacfarlane
Copy link
Member

bryanmacfarlane commented Sep 10, 2020

In the meantime one easy way to get a signal is the runner has two processes. One that listens to the queue (long running) and one that spawns a worker process (one per job) so you could ideally hook into process start / stop and at worst poll (ps aux | grep ...)

@j3parker
Copy link
Author

j3parker commented Sep 10, 2020

Ok, thanks! Maybe we'll combine that with our stdout spying for extra assurance (the "remove from ASG" operation is idempotent -- so it'd be safe for us to do both.)

I'm ok with temporary solutions like that, and I'm ok with any permanent solution you would come up with :) Having a documented/supported approach would probably be good because this might be a somewhat common scenario (we're also considering open-sourcing our AWS solution, but don't want to if we're doing weird hacks.)

@thboop
Copy link
Collaborator

thboop commented Mar 14, 2022

We recently published an ADR for Job Started / Job Completed hooks for self hosted runners, feel free to provide your feedback.

In particular we would love to hear what (if anything ) else you would need to support your use case, and if the interface makes sense for you.

@thboop
Copy link
Collaborator

thboop commented Mar 30, 2022

We've shipped a beta of this functionality in 2.289.1, please try it out and provide any feedback you have on the adr!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants