Delay in allocating a runner for a job #1387

shankarRaman · 2022-04-25T19:14:35Z

Describe the bug
We noticed an unusual amount of delay (almost 1 hour) for the runners to pick the jobs in Github. Usually this happens instantly and there were a bunch of runners in "idle" state.
How can one troubleshoot this to find out where the root cause is?

Checks

My actions-runner-controller version (v0.22.2) does support the feature
I'm using an unreleased version of the controller I built from HEAD of the default branch

Expected behavior
I expect the runners to pick the jobs asap

toast-gear · 2022-04-25T19:48:33Z

ARC isn't involved in jobs acutally getting allocated to runners, all ARC is responsible for is scaling runners up and down effectively. Once the runner container is up, outside of the entrypoint, GitHub's runner software takes over. Is this a regular problem or something that only happens now and again? github.com is not reliable so perhaps you are just suffering from their API having issues?

mumoshu · 2022-04-26T00:09:26Z

Yeah, this is unlikely to be triggered by ARC. GitHub should assign jobs to idle runners with matching labels. As you already see idle runners, ARC seems to be working correctly. Have you already contacted GitHub support?

shankarRaman · 2022-04-26T12:33:51Z

Thanks @toast-gear @mumoshu - that's what my suspicion was too. Although I assumed, since the ARC communicates with Github to determine scaling needs, that the ARC is also responsible for notifying Github about the availability of the runners.

shankarRaman · 2022-04-26T14:05:50Z

@mumoshu another question - how does ARC handle pods that are still running because the minReplicas count has be maintained? Wouldn't those hanging pods create a persistent runner scenario causing unwanted behavior (we see that with "permission denied" messages when trying to clean up working directory as part of actions/checkout)

toast-gear · 2022-04-26T14:50:37Z

Although I assumed, since the ARC communicates with Github to determine scaling needs, that the ARC is also responsible for notifying Github about the availability of the runners.

Afraid not, ARC only talks to GitHub to determine the scaling need or to to get the runner's state, the https://github.com/actions/runner software is responsible for actually telling GitHub what it's state is (i.e. idle, busy, offline)

Wouldn't those hanging pods create a persistent runner scenario causing unwanted behavior (we see that with "permission denied" messages when trying to clean up working directory as part of actions/checkout

No, when a pod finishes a job it is destroyed and created again if required as set by the constraints (e.g. minReplicas), runner pods are single use so nothing is persistended between jobs unless you have actvely tuned off the ephemeral design https://github.com/actions-runner-controller/actions-runner-controller#persistent-runners

shankarRaman · 2022-04-26T15:38:43Z

No, when a pod finishes a job it is destroyed and created again if required as set by the constraints (e.g. minReplicas), runner pods are single use so nothing is persistended between jobs unless you have actvely tuned off the ephemeral design https://github.com/actions-runner-controller/actions-runner-controller#persistent-runners

Hmm...I didn't unset the ephemeral design. However I see that a pod that's been "running" gets a job completes the job and continues to be in the running state. Unless the new pod that got created uses the same name.
What would then cause the following error when doing a checkout
rm: cannot remove '/runner/_work/...': Permission denied

toast-gear · 2022-04-26T15:49:30Z

Mmmm, that doesn't sound right. Could you:

confirm your ARC controller version?
are your CRDs up to date? https://github.com/actions-runner-controller/actions-runner-controller/blob/master/charts/actions-runner-controller/docs/UPGRADING.md
send your runner*.yaml and HRA yaml?

shankarRaman · 2022-04-26T15:59:45Z

This is the image version 0.21.1, here's our deployment and HRA yaml

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: krogertechnology-runner-deployment
spec:
  template:
    spec:
      image: docker-prod.registry.kroger.com/tools/actions-runner-dind:v2.289.2-ubuntu-20.04-no-proxy
      imagePullPolicy: IfNotPresent
      organization: krogertechnology
      labels:
      - nonprod-region-east
      - aks
      group: AKS Non-prod
      dockerdWithinRunnerContainer: true
      dockerRegistryMirror: https://docker.registry.kroger.com/
      resources:
        limits:
          cpu: "8"
          memory: "8Gi"
        requests:
          cpu: "1"
          memory: "8Gi"
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: krogertechnology-runner-deployment-autoscaler
spec:
  scaleTargetRef:
    name: krogertechnology-runner-deployment
  minReplicas: 10
  maxReplicas: 200
  metrics:
  - type: PercentageRunnersBusy
    scaleUpThreshold: '0.5'    
    scaleDownThreshold: '0.2'  
    scaleUpAdjustment: 10      
    scaleDownAdjustment: 1

toast-gear · 2022-04-26T16:02:05Z

We had bugs around runner scaling down prior to v0.22.3, whilst I'm not sure if it's related I'd def upgrade to v0.22.3 before spending much time debugging. After upgrading the logs on the runner after it finishes a job, the controller logs and the k8s events at that time would be helpful

shankarRaman · 2022-04-26T16:12:39Z

Okay, I'll work on upgrading the ARC and get back.

shankarRaman · 2022-04-26T17:08:47Z

I updated the ARC and the CRD's. I ran a job and saw the pod deleted after the job however I don't see the new pod created. I see this message

Runner failed to register itself to GitHub in timely manner. Recreating the pod to see if it resolves the issue. CAUTION: If you see this a lot, you should investigate the root cause. See https://github.com/actions-runner-controller/actions-runner-controller/issues/288

shankarRaman · 2022-04-26T17:44:28Z

Update: my minReplicas is set to 10, so I assume it won't add pods if there's no demand. The above message threw me off guard since I thought I messed up the controller.

toast-gear · 2022-04-26T20:03:39Z

it should keep the runner pod count at 10, if a workflow with 1 job in it runs I would expect:

10 pods to be up idle
1 pod takes on the job (assuming the labels match)
at the end of the job the pod to be terminated, a new pod is then created as you have a constraint of 10 runner pods as a mimumum

Can you confirm what you behaviour you saw?

mumoshu · 2022-04-26T23:38:34Z

@shankarRaman Can you also share the full log from the controller, and runner pod logs fromkubectl logs -c runner $YOUR_RUNNER_POD, plus the dockerfile you used to build your custom runner image if any?

I don't see the new pod created

This can happen when your runner got stuck without exiting after job completion, which doesn't usually happen.

shankarRaman · 2022-05-10T16:04:18Z

@mumoshu sorry for the delay. We had a chat with MS and found that it could be coming from the automatic software update that the runners do when spinning up.
https://docs.github.com/en/actions/hosting-your-own-runners/autoscaling-with-self-hosted-runners#controlling-runner-software-updates-on-self-hosted-runners
We set this flag

name: DISABLE_RUNNER_UPDATE
value: "true"

That helped with runners being available and picking up the jobs.

mumoshu · 2022-05-10T22:22:27Z

@shankarRaman That makes sense. Thanks for reporting back!

mumoshu · 2022-05-10T22:24:53Z

Although I don't know how auto-update results in a long delay like you described in:

We noticed an unusual amount of delay (almost 1 hour) for the runners to pick the jobs in Github

We recently revised our bug report form so that everyone is encouraged to provide full context about the problem

https://github.com/actions-runner-controller/actions-runner-controller/issues/new?assignees=&labels=bug&template=bug_report.yml&title=Bug

Perhaps you can review it and provide missing information so that we can investigate further, if necessary.

toast-gear · 2022-05-11T11:51:11Z

@shankarRaman are you able to link a raised issue against actions/runner? I guess the problem is probably in their backend rather than the runner software but I think it would be a good place to raise it initially. I don't think you are the first person to report this and so it would be great if there was a public issue raised which we can link to to cut down on duplicate issues being raised in ARC

github-actions · 2022-06-11T02:13:27Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

sjm-ho · 2024-06-22T15:35:33Z

@shankarRaman where did you set this flag, exactly?

name: DISABLE_RUNNER_UPDATE
value: "true"

in controller-manager's values.yaml or the scale-set's values.yaml?

shankarRaman · 2024-06-24T12:07:02Z

@shankarRaman where did you set this flag, exactly?
name: DISABLE_RUNNER_UPDATE
value: "true"
in controller-manager's values.yaml or the scale-set's values.yaml?

@sjm-ho this was added to the RunnerDeployment object (the legacy ARC/runner solution) and the values were injected as environment variables.
HTH

toast-gear added the external label May 11, 2022

toast-gear mentioned this issue May 19, 2022

Critical CVE-2021-44906 in minimist in image summerwind/actions-runner #1413

Closed

github-actions bot added the stale label Jun 11, 2022

github-actions bot closed this as completed Jun 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delay in allocating a runner for a job #1387

Delay in allocating a runner for a job #1387

shankarRaman commented Apr 25, 2022

toast-gear commented Apr 25, 2022

mumoshu commented Apr 26, 2022 •

edited

Loading

shankarRaman commented Apr 26, 2022 •

edited

Loading

shankarRaman commented Apr 26, 2022

toast-gear commented Apr 26, 2022 •

edited

Loading

shankarRaman commented Apr 26, 2022

toast-gear commented Apr 26, 2022 •

edited

Loading

shankarRaman commented Apr 26, 2022

toast-gear commented Apr 26, 2022 •

edited

Loading

shankarRaman commented Apr 26, 2022

shankarRaman commented Apr 26, 2022

shankarRaman commented Apr 26, 2022

toast-gear commented Apr 26, 2022

mumoshu commented Apr 26, 2022 •

edited

Loading

shankarRaman commented May 10, 2022

mumoshu commented May 10, 2022

mumoshu commented May 10, 2022

toast-gear commented May 11, 2022 •

edited

Loading

github-actions bot commented Jun 11, 2022

sjm-ho commented Jun 22, 2024

shankarRaman commented Jun 24, 2024

Delay in allocating a runner for a job #1387

Delay in allocating a runner for a job #1387

Comments

shankarRaman commented Apr 25, 2022

toast-gear commented Apr 25, 2022

mumoshu commented Apr 26, 2022 • edited Loading

shankarRaman commented Apr 26, 2022 • edited Loading

shankarRaman commented Apr 26, 2022

toast-gear commented Apr 26, 2022 • edited Loading

shankarRaman commented Apr 26, 2022

toast-gear commented Apr 26, 2022 • edited Loading

shankarRaman commented Apr 26, 2022

toast-gear commented Apr 26, 2022 • edited Loading

shankarRaman commented Apr 26, 2022

shankarRaman commented Apr 26, 2022

shankarRaman commented Apr 26, 2022

toast-gear commented Apr 26, 2022

mumoshu commented Apr 26, 2022 • edited Loading

shankarRaman commented May 10, 2022

mumoshu commented May 10, 2022

mumoshu commented May 10, 2022

toast-gear commented May 11, 2022 • edited Loading

github-actions bot commented Jun 11, 2022

sjm-ho commented Jun 22, 2024

shankarRaman commented Jun 24, 2024

mumoshu commented Apr 26, 2022 •

edited

Loading

shankarRaman commented Apr 26, 2022 •

edited

Loading

toast-gear commented Apr 26, 2022 •

edited

Loading

toast-gear commented Apr 26, 2022 •

edited

Loading

toast-gear commented Apr 26, 2022 •

edited

Loading

mumoshu commented Apr 26, 2022 •

edited

Loading

toast-gear commented May 11, 2022 •

edited

Loading