-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delay in allocating a runner for a job #1387
Comments
ARC isn't involved in jobs acutally getting allocated to runners, all ARC is responsible for is scaling runners up and down effectively. Once the runner container is up, outside of the entrypoint, GitHub's runner software takes over. Is this a regular problem or something that only happens now and again? github.com is not reliable so perhaps you are just suffering from their API having issues? |
Yeah, this is unlikely to be triggered by ARC. GitHub should assign jobs to idle runners with matching labels. As you already see idle runners, ARC seems to be working correctly. Have you already contacted GitHub support? |
Thanks @toast-gear @mumoshu - that's what my suspicion was too. Although I assumed, since the ARC communicates with Github to determine scaling needs, that the ARC is also responsible for notifying Github about the availability of the runners. |
@mumoshu another question - how does ARC handle pods that are still running because the |
Afraid not, ARC only talks to GitHub to determine the scaling need or to to get the runner's state, the https://github.com/actions/runner software is responsible for actually telling GitHub what it's state is (i.e. idle, busy, offline)
No, when a pod finishes a job it is destroyed and created again if required as set by the constraints (e.g. minReplicas), runner pods are single use so nothing is persistended between jobs unless you have actvely tuned off the ephemeral design https://github.com/actions-runner-controller/actions-runner-controller#persistent-runners |
Hmm...I didn't unset the ephemeral design. However I see that a pod that's been "running" gets a job completes the job and continues to be in the running state. Unless the new pod that got created uses the same name. |
Mmmm, that doesn't sound right. Could you:
|
This is the image version 0.21.1, here's our deployment and HRA yaml apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: krogertechnology-runner-deployment
spec:
template:
spec:
image: docker-prod.registry.kroger.com/tools/actions-runner-dind:v2.289.2-ubuntu-20.04-no-proxy
imagePullPolicy: IfNotPresent
organization: krogertechnology
labels:
- nonprod-region-east
- aks
group: AKS Non-prod
dockerdWithinRunnerContainer: true
dockerRegistryMirror: https://docker.registry.kroger.com/
resources:
limits:
cpu: "8"
memory: "8Gi"
requests:
cpu: "1"
memory: "8Gi"
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
name: krogertechnology-runner-deployment-autoscaler
spec:
scaleTargetRef:
name: krogertechnology-runner-deployment
minReplicas: 10
maxReplicas: 200
metrics:
- type: PercentageRunnersBusy
scaleUpThreshold: '0.5'
scaleDownThreshold: '0.2'
scaleUpAdjustment: 10
scaleDownAdjustment: 1 |
We had bugs around runner scaling down prior to v0.22.3, whilst I'm not sure if it's related I'd def upgrade to v0.22.3 before spending much time debugging. After upgrading the logs on the runner after it finishes a job, the controller logs and the k8s events at that time would be helpful |
Okay, I'll work on upgrading the ARC and get back. |
I updated the ARC and the CRD's. I ran a job and saw the pod deleted after the job however I don't see the new pod created. I see this message
|
Update: my |
it should keep the runner pod count at 10, if a workflow with 1 job in it runs I would expect:
Can you confirm what you behaviour you saw? |
@shankarRaman Can you also share the full log from the controller, and runner pod logs from
This can happen when your runner got stuck without exiting after job completion, which doesn't usually happen. |
@mumoshu sorry for the delay. We had a chat with MS and found that it could be coming from the automatic software update that the runners do when spinning up. name: DISABLE_RUNNER_UPDATE
value: "true" That helped with runners being available and picking up the jobs. |
@shankarRaman That makes sense. Thanks for reporting back! |
Although I don't know how auto-update results in a long delay like you described in:
We recently revised our bug report form so that everyone is encouraged to provide full context about the problem Perhaps you can review it and provide missing information so that we can investigate further, if necessary. |
@shankarRaman are you able to link a raised issue against actions/runner? I guess the problem is probably in their backend rather than the runner software but I think it would be a good place to raise it initially. I don't think you are the first person to report this and so it would be great if there was a public issue raised which we can link to to cut down on duplicate issues being raised in ARC |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
@shankarRaman where did you set this flag, exactly?
in controller-manager's values.yaml or the scale-set's values.yaml? |
@sjm-ho this was added to the |
Describe the bug
We noticed an unusual amount of delay (almost 1 hour) for the runners to pick the jobs in Github. Usually this happens instantly and there were a bunch of runners in "idle" state.
How can one troubleshoot this to find out where the root cause is?
Checks
Expected behavior
I expect the runners to pick the jobs asap
The text was updated successfully, but these errors were encountered: