Self-hosted Actions Runners #178369

danielrothimprint · 2025-10-29T21:06:06Z

danielrothimprint
Oct 29, 2025

Why are you starting this discussion?

Question

What GitHub Actions topic or product is this about?

ARC (Actions Runner Controller)

Discussion Details

We self-host arc-runner-controller in our kubernetes and there's something I can't reproduce but we see once in a while where a PR triggers 5 workfows, 4 succeed and the last one is waiting for a runner.
Normally that would be fine but it's taking 30+ minutes for this to be picked up and we have no problem with scaling nodes. Any way I can get more insights into why this happens and what the holdup is?

currentsuspect · 2025-10-29T22:12:38Z

currentsuspect
Oct 29, 2025

Hey 👋 — this kind of delay usually points to how the arc-runner-controller reconciles queued workflow jobs with available runners, not necessarily to a scaling issue in Kubernetes itself.

Below are a few things to check that normally reveal where the bottleneck is.

🔍 1. Inspect the controller logs

Run:

kubectl logs -n actions-runner-system deployment/arc-runner-controller

Look for entries like

reconcile: waiting for scale-up runnerdeployment=<your-runner>

If these appear long before a pod gets created, the controller might be missing or delaying webhook events.

⚙️ 2. Validate your RunnerDeployment / HRA configuration

Make sure:

minReplicas is greater than 0 if you need instant pickup.
maxReplicas allows for the burst you expect.
Metrics are flowing correctly? check:
```
kubectl describe hpa -n actions-runner-system
```
If CPU or queued-job metrics lag, runners will be late even though nodes exist.

🧩 3. Check pending pods directly

kubectl get pods -A | grep runner
kubectl describe pod <pod-name>

If pods are stuck in Pending, look for:

Node affinity / taints blocking scheduling.
Ephemeral storage or resource quotas.
Image-pull throttling from the registry.

Sometimes cluster autoscaling is fast, but the kube-scheduler still waits for nodes to register as Ready before binding pods.

�� 4. Turn on debug logging temporarily

Add to the controller deployment:

env:
  - name: LOG_LEVEL
    value: debug

Re-apply it.
You’ll then see timestamps for “job received → pod created”, which highlights where the delay happens.

🚀 5. Mitigations that usually help

Keep a small pre-warmed pool of idle runners (minReplicas: 1–2).
Upgrade to the latest arc release—recent versions fixed delayed webhook and queue processing issues.
Consider the RunnerScaleSetListener if available; it short-circuits the event path for new jobs.

📚 References

In most environments, enabling debug logs and maintaining a tiny idle pool fixes the 30-minute delay completely.
Hope this helps you zero in on the culprit.

1 reply

danielrothimprint Oct 29, 2025
Author

Thanks for the detailed write-up. We're running with the helm chart and are on 0.12.1 of the helm chart.

We don't see any logs in the reconcile: ... format.
We run with minReplicas 0 and maxReplicas 55 on top of EKS with EKS automode. Even when we ran with minReplicas: 2 we got those 30 minute waits periodically like now.
The runner pod that ran this particular workflow only came up the second it picked it up.
Good to know, how much noise will this generate on top of the normal logs?
I don't know that the pool would have helped here of the prewarmed runners because 4 runners were already working on workflows in that same PR. The 5th runner was the one late to the party.
I checked the release notes of the 0.13.0 release and couldn't find anything mentioned about the delayed webhook or queue processing, is that the version that fixed it or is the one we're using (0.12.1) already containing the fix?
Need to read up on the RunnerScaleSetListener, this is the first I'm hearing about it.

2025-12-29T00:43:15Z

github-actions[bot]
bot Dec 29, 2025

🕒 Discussion Activity Reminder 🕒

This Discussion has been labeled as dormant by an automated system for having no activity in the last 60 days. Please consider one the following actions:

1️⃣ Close as Out of Date: If the topic is no longer relevant, close the Discussion as out of date at the bottom of the page.

2️⃣ Provide More Information: Share additional details or context — or let the community know if you've found a solution on your own.

3️⃣ Mark a Reply as Answer: If your question has been answered by a reply, mark the most helpful reply as the solution.

Note: This dormant notification will only apply to Discussions with the Question label. To learn more, see our recent announcement.

Thank you for helping bring this Discussion to a resolution! 💬

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Community

Self-hosted Actions Runners #178369

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

GitHub Community

Self-hosted Actions Runners #178369

Uh oh!

danielrothimprint Oct 29, 2025

Why are you starting this discussion?

What GitHub Actions topic or product is this about?

Discussion Details

Replies: 2 comments · 1 reply

Uh oh!

currentsuspect Oct 29, 2025

🔍 1. Inspect the controller logs

⚙️ 2. Validate your RunnerDeployment / HRA configuration

🧩 3. Check pending pods directly

�� 4. Turn on debug logging temporarily

🚀 5. Mitigations that usually help

📚 References

Uh oh!

danielrothimprint Oct 29, 2025 Author

Uh oh!

github-actions[bot] bot Dec 29, 2025

danielrothimprint
Oct 29, 2025

Replies: 2 comments 1 reply

currentsuspect
Oct 29, 2025

danielrothimprint Oct 29, 2025
Author

github-actions[bot]
bot Dec 29, 2025