Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay in allocating a runner for a job #1387

Closed
1 of 2 tasks
shankarRaman opened this issue Apr 25, 2022 · 21 comments
Closed
1 of 2 tasks

Delay in allocating a runner for a job #1387

shankarRaman opened this issue Apr 25, 2022 · 21 comments

Comments

@shankarRaman
Copy link

Describe the bug
We noticed an unusual amount of delay (almost 1 hour) for the runners to pick the jobs in Github. Usually this happens instantly and there were a bunch of runners in "idle" state.
How can one troubleshoot this to find out where the root cause is?

Checks

  • My actions-runner-controller version (v0.22.2) does support the feature
  • I'm using an unreleased version of the controller I built from HEAD of the default branch

Expected behavior
I expect the runners to pick the jobs asap

@toast-gear
Copy link
Collaborator

ARC isn't involved in jobs acutally getting allocated to runners, all ARC is responsible for is scaling runners up and down effectively. Once the runner container is up, outside of the entrypoint, GitHub's runner software takes over. Is this a regular problem or something that only happens now and again? github.com is not reliable so perhaps you are just suffering from their API having issues?

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 26, 2022

Yeah, this is unlikely to be triggered by ARC. GitHub should assign jobs to idle runners with matching labels. As you already see idle runners, ARC seems to be working correctly. Have you already contacted GitHub support?

@shankarRaman
Copy link
Author

shankarRaman commented Apr 26, 2022

Thanks @toast-gear @mumoshu - that's what my suspicion was too. Although I assumed, since the ARC communicates with Github to determine scaling needs, that the ARC is also responsible for notifying Github about the availability of the runners.

@shankarRaman
Copy link
Author

@mumoshu another question - how does ARC handle pods that are still running because the minReplicas count has be maintained? Wouldn't those hanging pods create a persistent runner scenario causing unwanted behavior (we see that with "permission denied" messages when trying to clean up working directory as part of actions/checkout)

@toast-gear
Copy link
Collaborator

toast-gear commented Apr 26, 2022

Although I assumed, since the ARC communicates with Github to determine scaling needs, that the ARC is also responsible for notifying Github about the availability of the runners.

Afraid not, ARC only talks to GitHub to determine the scaling need or to to get the runner's state, the https://github.com/actions/runner software is responsible for actually telling GitHub what it's state is (i.e. idle, busy, offline)

Wouldn't those hanging pods create a persistent runner scenario causing unwanted behavior (we see that with "permission denied" messages when trying to clean up working directory as part of actions/checkout

No, when a pod finishes a job it is destroyed and created again if required as set by the constraints (e.g. minReplicas), runner pods are single use so nothing is persistended between jobs unless you have actvely tuned off the ephemeral design https://github.com/actions-runner-controller/actions-runner-controller#persistent-runners

@shankarRaman
Copy link
Author

No, when a pod finishes a job it is destroyed and created again if required as set by the constraints (e.g. minReplicas), runner pods are single use so nothing is persistended between jobs unless you have actvely tuned off the ephemeral design https://github.com/actions-runner-controller/actions-runner-controller#persistent-runners

Hmm...I didn't unset the ephemeral design. However I see that a pod that's been "running" gets a job completes the job and continues to be in the running state. Unless the new pod that got created uses the same name.
What would then cause the following error when doing a checkout
rm: cannot remove '/runner/_work/...': Permission denied

@toast-gear
Copy link
Collaborator

toast-gear commented Apr 26, 2022

Mmmm, that doesn't sound right. Could you:

@shankarRaman
Copy link
Author

This is the image version 0.21.1, here's our deployment and HRA yaml

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: krogertechnology-runner-deployment
spec:
  template:
    spec:
      image: docker-prod.registry.kroger.com/tools/actions-runner-dind:v2.289.2-ubuntu-20.04-no-proxy
      imagePullPolicy: IfNotPresent
      organization: krogertechnology
      labels:
      - nonprod-region-east
      - aks
      group: AKS Non-prod
      dockerdWithinRunnerContainer: true
      dockerRegistryMirror: https://docker.registry.kroger.com/
      resources:
        limits:
          cpu: "8"
          memory: "8Gi"
        requests:
          cpu: "1"
          memory: "8Gi"
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: krogertechnology-runner-deployment-autoscaler
spec:
  scaleTargetRef:
    name: krogertechnology-runner-deployment
  minReplicas: 10
  maxReplicas: 200
  metrics:
  - type: PercentageRunnersBusy
    scaleUpThreshold: '0.5'    
    scaleDownThreshold: '0.2'  
    scaleUpAdjustment: 10      
    scaleDownAdjustment: 1   

@toast-gear
Copy link
Collaborator

toast-gear commented Apr 26, 2022

We had bugs around runner scaling down prior to v0.22.3, whilst I'm not sure if it's related I'd def upgrade to v0.22.3 before spending much time debugging. After upgrading the logs on the runner after it finishes a job, the controller logs and the k8s events at that time would be helpful

@shankarRaman
Copy link
Author

Okay, I'll work on upgrading the ARC and get back.

@shankarRaman
Copy link
Author

I updated the ARC and the CRD's. I ran a job and saw the pod deleted after the job however I don't see the new pod created. I see this message

Runner failed to register itself to GitHub in timely manner. Recreating the pod to see if it resolves the issue. CAUTION: If you see this a lot, you should investigate the root cause. See https://github.com/actions-runner-controller/actions-runner-controller/issues/288

@shankarRaman
Copy link
Author

Update: my minReplicas is set to 10, so I assume it won't add pods if there's no demand. The above message threw me off guard since I thought I messed up the controller.

@toast-gear
Copy link
Collaborator

it should keep the runner pod count at 10, if a workflow with 1 job in it runs I would expect:

  • 10 pods to be up idle
  • 1 pod takes on the job (assuming the labels match)
  • at the end of the job the pod to be terminated, a new pod is then created as you have a constraint of 10 runner pods as a mimumum

Can you confirm what you behaviour you saw?

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 26, 2022

@shankarRaman Can you also share the full log from the controller, and runner pod logs fromkubectl logs -c runner $YOUR_RUNNER_POD, plus the dockerfile you used to build your custom runner image if any?

I don't see the new pod created

This can happen when your runner got stuck without exiting after job completion, which doesn't usually happen.

@shankarRaman
Copy link
Author

@mumoshu sorry for the delay. We had a chat with MS and found that it could be coming from the automatic software update that the runners do when spinning up.
https://docs.github.com/en/actions/hosting-your-own-runners/autoscaling-with-self-hosted-runners#controlling-runner-software-updates-on-self-hosted-runners
We set this flag

name: DISABLE_RUNNER_UPDATE
value: "true"

That helped with runners being available and picking up the jobs.

@mumoshu
Copy link
Collaborator

mumoshu commented May 10, 2022

@shankarRaman That makes sense. Thanks for reporting back!

@mumoshu
Copy link
Collaborator

mumoshu commented May 10, 2022

Although I don't know how auto-update results in a long delay like you described in:

We noticed an unusual amount of delay (almost 1 hour) for the runners to pick the jobs in Github

We recently revised our bug report form so that everyone is encouraged to provide full context about the problem

https://github.com/actions-runner-controller/actions-runner-controller/issues/new?assignees=&labels=bug&template=bug_report.yml&title=Bug

Perhaps you can review it and provide missing information so that we can investigate further, if necessary.

@toast-gear
Copy link
Collaborator

toast-gear commented May 11, 2022

@shankarRaman are you able to link a raised issue against actions/runner? I guess the problem is probably in their backend rather than the runner software but I think it would be a good place to raise it initially. I don't think you are the first person to report this and so it would be great if there was a public issue raised which we can link to to cut down on duplicate issues being raised in ARC

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@sjm-ho
Copy link

sjm-ho commented Jun 22, 2024

@shankarRaman where did you set this flag, exactly?

name: DISABLE_RUNNER_UPDATE
value: "true"

in controller-manager's values.yaml or the scale-set's values.yaml?

@shankarRaman
Copy link
Author

@shankarRaman where did you set this flag, exactly?

name: DISABLE_RUNNER_UPDATE
value: "true"

in controller-manager's values.yaml or the scale-set's values.yaml?

@sjm-ho this was added to the RunnerDeployment object (the legacy ARC/runner solution) and the values were injected as environment variables.
HTH

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants