Hero image for Building a Self-Healing GitLab Runner Fleet on Kubernetes

Building a Self-Healing GitLab Runner Fleet on Kubernetes


Your CI/CD pipeline just failed at 3 AM because a runner crashed mid-deployment. By the time you wake up, your team has lost hours of productivity, your staging environment is in an inconsistent state, and your morning standup opens with “Why is the build still broken?”

This scenario plays out across engineering teams every week. A runner process consumes all available memory. A node unexpectedly reboots. A network partition isolates half your build fleet. The traditional response—spinning up more static runners and hoping for better luck—merely masks the underlying fragility of your infrastructure.

The core problem isn’t the failures themselves. Infrastructure fails. What separates resilient systems from brittle ones is how they respond to failure. When a runner crashes, can your infrastructure detect the failure, route jobs away from the unhealthy node, provision a replacement, and rebalance the workload—all before the next developer pushes code? Most runner deployments can’t. They rely on manual intervention, alert fatigue, and heroic debugging sessions to maintain availability.

Static runner fleets compound this problem as teams scale. You overprovision to handle peak load, burning budget on idle resources. Or you underprovision to control costs, creating queues that slow down every deployment. Manual scaling becomes a daily task. Health checks become someone’s monitoring dashboard. Runner updates become scheduled maintenance windows that block productivity.

The solution isn’t more runners. It’s self-healing infrastructure that treats runners as ephemeral, replaceable resources. Before architecting that system, we need to understand exactly what makes traditional runner infrastructure so expensive to maintain.

The Hidden Cost of Static Runner Infrastructure

Most organizations start their GitLab CI/CD journey by provisioning a handful of dedicated VMs as runners. A few m5.xlarge instances handle the initial workload just fine. But as development velocity increases and the number of pipelines multiplies, this approach reveals crippling inefficiencies that drain both budget and engineering time.

Visual: Cost analysis showing queue times and resource waste in static runner infrastructure

The Bottleneck Pattern

Static runner infrastructure operates on fixed capacity. When your team provisions five runners with 8 CPU cores each, you have exactly 40 cores available—regardless of whether you’re running 2 jobs at midnight or 60 during peak deployment hours. During off-hours, you’re burning money on idle compute. During peak times, jobs queue for minutes while developers wait, context-switch, and lose flow state.

The math becomes brutal at scale. A team running 200 daily deployments with an average queue time of 3 minutes wastes 10 developer-hours per day waiting. At a loaded cost of $150 per engineering hour, that’s $1,500 daily—or $390,000 annually—lost to infrastructure bottlenecks. Meanwhile, your runner fleet sits at 15% utilization overnight and on weekends, burning another $30,000 per year in wasted cloud spend.

The Overprovisioning Trap

Faced with queue times, teams overcorrect by massively overprovisioning runners. You might deploy 20 m5.2xlarge instances to handle peak load, giving you 160 cores. This eliminates queuing but creates a different problem: you’re now paying for 160 cores 24/7 to serve workloads that peak at 80 cores for 3 hours per day. Your effective utilization drops to 8%, transforming a $4,000 monthly runner bill into $32,000.

The real cost extends beyond compute. Overprovisioned runners require security patching, OS updates, GitLab Runner version management, and monitoring. Each runner needs SSH access controls, log rotation configuration, and integration with your observability stack. A platform engineer spending 5 hours monthly maintaining 20 runners could instead be building developer productivity tools.

The Maintenance Burden

Static runners accumulate technical debt. Disk space fills with Docker layers and cached dependencies. Runner registration tokens expire. OS packages drift out of compliance. A runner crashes during a critical deployment, and suddenly you’re SSH-ing into a VM at 2 AM, manually restarting services while your deployment window closes.

Multiply this across 20 runners, and maintenance becomes a part-time job. Teams implement elaborate Ansible playbooks, custom health checks, and runbooks for common failure modes. The infrastructure meant to accelerate delivery becomes the thing slowing you down.

Dynamic, Kubernetes-based runner infrastructure solves these problems by treating runners as ephemeral workloads that scale with demand and self-heal when failures occur. The next section explores how GitLab’s Kubernetes executor fundamentally changes the runner paradigm.

Kubernetes Executor Architecture: Beyond Basic Deployments

The Kubernetes executor fundamentally redefines how GitLab Runners operate. Unlike Docker or Shell executors that run jobs within persistent runner containers, the Kubernetes executor spawns ephemeral pods for each job. This architectural shift eliminates state accumulation, prevents resource leakage between jobs, and provides genuine multi-tenancy—but it requires rethinking how you design your runner infrastructure.

Visual: Architecture diagram showing pod-per-job execution model with build, helper, and service containers

Pod-per-Job Execution Model

When a Kubernetes-based runner picks up a job, it doesn’t execute the job itself. Instead, it acts as an orchestrator that creates a new pod containing multiple containers: a build container running your CI/CD commands, helper containers for services like databases or Redis, and a sidecar for handling Git operations and artifact uploads. The runner manager pod remains lightweight, monitoring job pods and reporting status back to GitLab. This delegation model means a single runner manager can coordinate dozens of concurrent jobs without performance degradation, because the actual workload runs in separate pods with dedicated resources.

This approach delivers automatic cleanup—when the job completes, Kubernetes garbage-collects the entire pod, including any filesystem artifacts, network configurations, and secret mounts. There’s no need for cleanup scripts or manual intervention to prevent disk space exhaustion. Failed jobs leave no trace beyond logs, which can be shipped to centralized storage before pod termination.

Resource Isolation and Security Boundaries

In multi-tenant environments where different teams or projects share the same runner fleet, resource isolation becomes critical. The Kubernetes executor leverages native pod resource requests and limits to enforce CPU and memory boundaries. You can configure runner pools with different resource profiles—lightweight runners for unit tests with 500m CPU and 1GB RAM, heavyweight runners for integration tests with 4 CPUs and 16GB RAM—all scheduled by Kubernetes based on node availability.

Security boundaries extend beyond compute resources. Each job pod runs with its own service account, enabling fine-grained RBAC policies. A job building a frontend application doesn’t need the same permissions as one deploying infrastructure changes. Pod Security Standards (replacing deprecated Pod Security Policies) enforce restrictions on privileged containers, host namespace access, and volume types. For regulated industries, you can configure runners to schedule jobs only on nodes with specific taints or labels, ensuring compliance workloads run on hardened, isolated infrastructure.

💡 Pro Tip: Use separate Kubernetes namespaces for different runner pools. This creates hard resource quotas and prevents a single team’s misconfigured job from consuming the entire cluster’s capacity.

Network Topology for Private Registries and Internal Services

The pod-per-job model introduces network complexity that static runners avoid. Every job pod needs connectivity to your GitLab instance to clone repositories, to your container registry to pull images, and potentially to internal services for integration testing. In air-gapped or private cloud environments, this requires deliberate network design.

GitLab communicates with runners over HTTPS on port 443, but runners initiate all connections—there’s no inbound traffic to runner pods. However, job pods need outbound access to package registries, artifact storage, and deployment targets. Network policies should allow job pods to reach these destinations while blocking lateral movement between different teams’ job pods. For private container registries, configure imagePullSecrets in your runner’s Helm values to inject credentials into every job pod automatically.

DNS resolution deserves particular attention. Job pods must resolve your GitLab hostname and any internal service dependencies. If your cluster uses a private DNS server, ensure dnsPolicy and dnsConfig in runner pod templates propagate the correct nameservers. Misconfigured DNS is the most common cause of intermittent job failures in Kubernetes-based CI/CD.

With this architectural foundation established, you’re ready to translate these concepts into a production deployment using Helm charts that encode these patterns as reusable configuration.

Deploying Runners with Helm: Production-Ready Configuration

The GitLab Runner Helm chart provides a solid foundation, but the default configuration leaves critical production concerns unaddressed. A runner that works in testing often fails under load due to resource contention, cache misses, or inefficient pod scheduling. This section walks through a production-ready configuration that handles these failure modes explicitly.

Registration and Namespace Isolation

Start by isolating runner workloads in a dedicated namespace with resource quotas. This prevents runaway jobs from starving other cluster workloads:

values.yaml
runnerRegistrationToken: "glrt-a8f3k9m2p5w7x1z4"
gitlabUrl: "https://gitlab.company.com"
runners:
config: |
[[runners]]
[runners.kubernetes]
namespace = "gitlab-runner-jobs"
privileged = false
cpu_limit = "2000m"
cpu_request = "500m"
memory_limit = "4Gi"
memory_request = "1Gi"
service_cpu_limit = "500m"
service_cpu_request = "100m"
service_memory_limit = "1Gi"
service_memory_request = "256Mi"
helper_cpu_limit = "200m"
helper_cpu_request = "100m"
helper_memory_limit = "256Mi"
helper_memory_request = "128Mi"
poll_timeout = 360
concurrent: 20
runnerToken: ""
locked: false
tags: "kubernetes,amd64,standard"

The cpu_request value determines bin-packing efficiency. Setting it too low causes node oversubscription and OOM kills. Setting it too high wastes resources. Start with 25% of the limit and adjust based on actual usage patterns observed in your metrics.

Namespace isolation provides a blast radius boundary. Create the namespace before deploying the chart and apply a ResourceQuota to cap total consumption. Without this, a batch of concurrent jobs can exhaust cluster resources, impacting unrelated workloads. The quota acts as a circuit breaker—jobs queue when the namespace hits its limit rather than degrading cluster-wide performance.

Resource Management for Job Pods

Job pods consist of three containers: the build container, helper, and any services (like databases for integration tests). Resource limits prevent a single job from consuming excessive cluster capacity:

values.yaml
runners:
config: |
[[runners]]
[runners.kubernetes]
namespace = "gitlab-runner-jobs"
[runners.kubernetes.pod_labels]
"app.kubernetes.io/managed-by" = "gitlab-runner"
"cost-center" = "ci-cd"
[runners.kubernetes.node_selector]
"workload-type" = "ci-jobs"
[runners.kubernetes.affinity]
node_affinity = '''
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- c5.2xlarge
- c5.4xlarge
'''

Node selectors and affinity rules ensure jobs land on appropriate hardware. This configuration targets compute-optimized nodes, avoiding placement on nodes running stateful workloads or system services. The hard node affinity requirement prevents jobs from landing on general-purpose nodes that lack the CPU headroom for build workloads. If your cluster runs mixed workloads, label dedicated node pools accordingly and enforce placement through these selectors.

Pod labels enable cost tracking and policy enforcement. The cost-center label feeds into chargeback reports, while the managed-by label supports automated cleanup of orphaned resources. Add project-specific labels in your .gitlab-ci.yml using the KUBERNETES_POD_LABELS_* variables to enable per-project resource tracking.

Cache Locality with Pod Affinity

Cache hit rates determine build speed. Without affinity rules, consecutive builds from the same project land on different nodes, missing local caches:

values.yaml
runners:
config: |
[[runners]]
[runners.kubernetes]
[runners.kubernetes.affinity]
pod_affinity = '''
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: project-id
operator: In
values:
- "PROJECT_ID"
topologyKey: kubernetes.io/hostname
'''
[runners.kubernetes.pod_annotations]
"project-id" = "$CI_PROJECT_ID"

This uses a soft affinity rule—it prefers to schedule jobs from the same project on nodes where previous jobs ran, but doesn’t fail if no such node exists. Hard affinity rules (requiredDuringScheduling) cause scheduling failures when the preferred node lacks capacity.

The topology key kubernetes.io/hostname groups jobs by individual nodes. For multi-zone clusters, consider topology.kubernetes.io/zone to balance jobs across availability zones while maintaining zone-local cache affinity. The weight of 100 prioritizes cache locality over other soft constraints. Test different weight values based on your cache hit rate metrics—if you’re consistently hitting distributed caches (S3, GCS), lower the weight to allow more flexible scheduling.

Concurrent Limits and Job Distribution

The concurrent setting controls how many jobs execute simultaneously across all runner pods. Set this based on cluster capacity and job resource profiles:

values.yaml
replicas: 3
runners:
concurrent: 10
config: |
[[runners]]
limit = 30
request_concurrency = 10
[runners.kubernetes]
poll_interval = 3
[runners.cache]
Type = "s3"
Shared = true
[runners.cache.s3]
ServerAddress = "s3.amazonaws.com"
BucketName = "gitlab-runner-cache"
BucketLocation = "us-east-1"

Running three replicas with concurrent: 10 per pod supports 30 simultaneous jobs with headroom for rolling updates. The request_concurrency parameter limits how many jobs each runner pod requests from GitLab at once, preventing a single pod from hogging the queue.

The polling interval determines how quickly runners pick up new jobs. The default of 3 seconds balances responsiveness with API load. Lower values (1-2s) reduce job start latency but increase load on the GitLab instance. If you notice jobs sitting in “pending” state despite available runner capacity, decrease the interval. If GitLab’s runner API endpoints show elevated latency, increase it.

Configure checkout strategies to optimize repository cloning behavior. The default fetch strategy clones the full repository on every job. For large repositories, use clone with a shallow clone depth or enable Git LFS caching to reduce network transfer time. Set GIT_STRATEGY=none in jobs that don’t need source code (deployment-only jobs, for instance) to skip cloning entirely.

With these configurations in place, the runner fleet handles variable load and node failures gracefully. The next critical piece is ensuring individual runner pods restart automatically when they enter unhealthy states.

Implementing Self-Healing with Readiness and Liveness Probes

Kubernetes’ self-healing capabilities become powerful when you configure probes that detect the specific failure modes of GitLab Runners. Default health checks monitor process liveness but miss critical failure states: runners stuck in cleanup, corrupted job caches, or network partitions to the GitLab instance. Implementing custom probes that catch these conditions transforms your fleet from “technically running” to “actually functioning.”

Detecting Stuck Runner States

The most insidious runner failures don’t crash the process—they freeze it. A runner waiting indefinitely for a Docker socket, blocked on I/O to a failed NFS mount, or deadlocked in job cleanup will pass basic liveness checks while consuming resources and rejecting new jobs. The solution requires a custom health endpoint that verifies operational state, not just process existence.

Extend your runner deployment with a sidecar container that exposes health metrics. This sidecar queries the runner’s Prometheus endpoint (/metrics) and fails the readiness probe when the runner hasn’t processed a job heartbeat in the expected window:

runner-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: gitlab-runner
spec:
template:
spec:
containers:
- name: gitlab-runner
image: gitlab/gitlab-runner:latest
livenessProbe:
exec:
command:
- /bin/bash
- -c
- |
# Fail if runner hasn't updated job metrics in 10 minutes
LAST_JOB=$(gitlab-runner verify 2>&1 | grep -c "is alive")
if [ "$LAST_JOB" -eq 0 ]; then exit 1; fi
initialDelaySeconds: 60
periodSeconds: 120
failureThreshold: 3
readinessProbe:
httpGet:
path: /metrics
port: 9252
periodSeconds: 10
failureThreshold: 2

The liveness probe’s 3-failure threshold with 120-second intervals means Kubernetes waits 6 minutes before restarting a stuck runner—long enough to avoid false positives during long job initialization, short enough to prevent extended downtime. The readiness probe operates more aggressively, removing unresponsive runners from the service endpoint within 20 seconds to prevent new job assignments to degraded pods.

For runners executing Docker-in-Docker workloads, add a probe that verifies the Docker daemon’s responsiveness. Runners often survive while their inner Docker daemon becomes unresponsive, creating a scenario where jobs are accepted but immediately fail:

docker-daemon-probe.yaml
livenessProbe:
exec:
command:
- /bin/bash
- -c
- |
# Verify Docker daemon responds to basic commands
timeout 10 docker ps >/dev/null 2>&1 || exit 1
periodSeconds: 30
timeoutSeconds: 15
failureThreshold: 2

This probe catches Docker daemon hangs that manifest as indefinite docker ps commands, forcing pod restart before multiple jobs queue up behind the stuck daemon.

Building Circuit Breakers for Dependency Failures

When your container registry goes down, runners attempting to pull images enter exponential backoff, accumulating failed jobs and consuming pod resources while waiting. Rather than letting all runners fail simultaneously, implement a circuit breaker pattern that gracefully degrades capacity.

Create a PreStop lifecycle hook that marks runners as unhealthy when consecutive job failures exceed a threshold:

runner-with-circuit-breaker.yaml
lifecycle:
preStop:
exec:
command:
- /bin/bash
- -c
- |
FAILURES=$(curl -s localhost:9252/metrics | \
grep 'gitlab_runner_jobs{state="failed"}' | \
awk '{print $2}')
if [ "$FAILURES" -gt 10 ]; then
# Trigger faster pod replacement
exit 1
fi
gitlab-runner unregister --all-runners
sleep 30

Pair this with a startup probe that validates connectivity to critical dependencies before accepting jobs:

startup-probe.yaml
startupProbe:
exec:
command:
- /bin/bash
- -c
- |
# Verify GitLab API reachability
curl -f -s https://gitlab.example.com/api/v4/version || exit 1
# Verify container registry
curl -f -s https://registry.example.com/v2/ || exit 1
periodSeconds: 15
failureThreshold: 20

This probe allows 5 minutes for transient network issues to resolve during pod startup while preventing runners from registering when core services are unavailable. The extended startup window accommodates slow DNS resolution in newly created pods and temporary network partitions during cluster maintenance windows.

Configuring Automatic Restarts for Common Failure Modes

Beyond stuck processes and dependency failures, runners experience failure modes specific to CI/CD workloads. Cache corruption, filesystem exhaustion from abandoned build artifacts, and zombie executor processes all degrade runner performance without triggering standard health checks.

Implement a probe that monitors filesystem pressure and forces cleanup through pod restart before disk exhaustion causes job failures:

disk-pressure-probe.yaml
livenessProbe:
exec:
command:
- /bin/bash
- -c
- |
# Fail if /builds partition exceeds 85% capacity
USAGE=$(df /builds | tail -1 | awk '{print $5}' | sed 's/%//')
if [ "$USAGE" -gt 85 ]; then exit 1; fi
periodSeconds: 60
failureThreshold: 2

This approach treats pods as ephemeral: rather than implementing complex cleanup logic, let Kubernetes replace degraded runners with fresh instances. The 2-failure threshold provides a 2-minute window to finish in-flight jobs before termination.

Metrics-Based Pod Replacement

CPU and memory thresholds miss a critical scaling signal: job queue depth. A runner pool at 30% CPU utilization but with 50 queued jobs needs more capacity. Integrate the Kubernetes Metrics Server with custom queries against the GitLab API to trigger pod replacement based on actual demand:

hpa-custom-metrics.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: gitlab-runner-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: gitlab-runner
minReplicas: 3
maxReplicas: 20
metrics:
- type: External
external:
metric:
name: gitlab_runner_queue_depth
selector:
matchLabels:
runner_pool: "production"
target:
type: AverageValue
averageValue: "5"
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70

This configuration maintains an average of 5 queued jobs per runner while capping CPU utilization at 70%, scaling up during peak demand and down during idle periods. The dual-metric approach prevents both queue buildup during compute-intensive jobs and resource waste during I/O-bound workloads. Combined with pod disruption budgets set to maxUnavailable: 1, you ensure continuous capacity while allowing graceful runner replacement.

With these self-healing mechanisms in place, your runner fleet handles stuck processes, dependency failures, and demand spikes without manual intervention. The next challenge becomes optimizing build performance through intelligent cache distribution.

Cache Strategy: Distributed Storage vs. Local NVMe

Cache performance directly impacts pipeline execution time. A GitLab Runner executing 100 jobs daily can waste 15-20 hours on redundant dependency downloads without proper caching. The choice between distributed storage (S3, GCS), Persistent Volume Claims, and node-local NVMe storage determines whether your pipelines complete in 3 minutes or 8.

Performance Characteristics by Cache Backend

S3-compatible object storage provides the simplest multi-runner cache sharing but introduces 200-400ms latency per object retrieval. For a Node.js project with 500 npm packages, this translates to 2-3 additional minutes per build. The latency compounds when runners operate across geographic regions—a runner in eu-west-1 accessing an S3 bucket in us-east-1 can see retrieval times exceeding 800ms per object.

PVCs reduce latency to 50-100ms when backed by network-attached storage like AWS EBS or GCP Persistent Disks, but create contention bottlenecks when multiple pods access the same volume simultaneously. A ReadWriteOnce PVC limits cache access to a single pod at a time, forcing other jobs into a queue. ReadWriteMany volumes eliminate this restriction but require network filesystems like NFS or CephFS, which introduce their own performance penalties—particularly for workloads performing thousands of small file operations.

Node-local NVMe delivers sub-10ms cache reads with IOPS exceeding 500,000 for random access patterns. This works well for dedicated runner pools handling specific project types, where Kubernetes affinity rules keep related jobs on the same nodes. The tradeoff is cache isolation: jobs scheduled to different nodes cannot share cached artifacts, leading to redundant downloads when the cluster autoscaler adds capacity or rebalances workloads.

Implementing a Hybrid Cache Architecture

The optimal solution combines S3 as the authoritative cache store with ephemeral local volumes for hot data. Configure runners to check local cache first, fall back to S3, then populate local cache on miss:

gitlab-runner-values.yaml
runners:
config: |
[[runners]]
[runners.kubernetes]
[[runners.kubernetes.volumes.empty_dir]]
name = "cache"
mount_path = "/cache"
medium = "Memory"
size_limit = "2Gi"
[runners.cache]
Type = "s3"
Shared = true
[runners.cache.s3]
ServerAddress = "s3.amazonaws.com"
BucketName = "gitlab-runner-cache"
BucketLocation = "us-east-1"
[runners.cache.local]
Path = "/cache"

This configuration creates a 2GB in-memory cache per pod while maintaining S3 as the persistent layer. Jobs running on the same pod within minutes of each other hit the memory cache directly, avoiding network calls entirely. The medium: "Memory" setting leverages tmpfs, which resides in RAM and delivers cache read performance comparable to NVMe for datasets under 2GB.

Adjust the size_limit based on your dependency footprint. Java projects with large Maven repositories may require 4-8GB, while Go projects with vendored dependencies often fit within 512MB. Monitor pod memory pressure metrics—if cache evictions occur due to memory constraints, reduce the limit or switch to medium: "" to use node disk instead of RAM.

For teams with predictable workloads, implement cache warming through scheduled jobs that pre-populate local storage:

cache-warmer-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: cache-warmer-frontend
spec:
schedule: "0 */4 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: warmer
image: amazon/aws-cli
command:
- /bin/sh
- -c
- aws s3 sync s3://gitlab-runner-cache/frontend-deps /cache --quiet
volumeMounts:
- name: cache-volume
mountPath: /cache
volumes:
- name: cache-volume
hostPath:
path: /mnt/nvme/gitlab-cache
type: DirectoryOrCreate
nodeSelector:
workload: gitlab-runner

This CronJob runs every 4 hours, syncing frequently accessed dependencies to node-local NVMe before peak development hours. The nodeSelector ensures the cache populates on nodes designated for runner workloads. For critical pipelines, trigger cache warming via GitLab webhooks after merges to main branches, ensuring the latest dependencies are immediately available.

Cache Eviction and Cost Management

Without eviction policies, cache storage grows unbounded. S3 lifecycle rules automatically delete objects older than 30 days, balancing availability with cost:

s3-lifecycle-policy.json
{
"Rules": [{
"Id": "ExpireOldCache",
"Status": "Enabled",
"Prefix": "runner/",
"Expiration": {
"Days": 30
}
}]
}

For node-local storage, implement a sidecar container that monitors disk usage and evicts least-recently-used caches when utilization exceeds 80%. This prevents cache saturation from impacting pod scheduling. Tools like ncdu or custom scripts can identify stale cache directories based on access time, removing entries untouched for more than 7 days.

Projects with infrequent commits benefit most from longer retention periods. Monorepos with hourly builds should reduce retention to 7 days, as cache keys change frequently enough that older entries provide negligible hit rates. Analyze CloudWatch or S3 access logs to identify cache hit rates per project—if a cache key hasn’t been accessed in 14 days, its retention period exceeds its utility.

💡 Pro Tip: Enable cache compression in your .gitlab-ci.yml with cache:untracked: false and explicit path specifications. Compressed caches reduce S3 transfer costs by 60-70% for text-heavy dependencies like node_modules, and decrease retrieval latency proportionally to file size reduction.

With caching optimized, the next critical concern is understanding where bottlenecks occur in your runner fleet through comprehensive observability.

Observability: Tracking Runner Health and Job Distribution

Without proper observability, your runner fleet operates as a black box—jobs queue mysteriously, costs spiral, and failures surface only when developers complain. GitLab Runner exposes Prometheus metrics that transform this opacity into actionable intelligence, enabling data-driven capacity planning and proactive incident response.

Exposing Metrics to Prometheus

GitLab Runner includes a built-in metrics endpoint on port 9252. Enable it in your Helm configuration:

values.yaml
metrics:
enabled: true
port: 9252
serviceMonitor:
enabled: true
interval: 30s
labels:
prometheus: kube-prometheus
runners:
config: |
[[runners]]
[runners.kubernetes]
[runners.kubernetes.pod_annotations]
prometheus.io/scrape = "true"
prometheus.io/port = "9252"
prometheus.io/path = "/metrics"

This configuration creates a ServiceMonitor that Prometheus Operator automatically discovers. Key metrics include gitlab_runner_jobs (current job count), gitlab_runner_job_duration_seconds (execution time histogram), and gitlab_runner_errors_total (failure counters). The runner manager process exposes these metrics continuously, while ephemeral job pods report execution-level telemetry through annotations.

Building Operational Dashboards

Queue depth signals capacity problems before they impact developers. Track gitlab_runner_concurrent_limit against gitlab_runner_jobs{state="running"} to calculate saturation percentage. When this exceeds 80%, your autoscaling policy needs adjustment—either increase concurrent limits or provision additional runner replicas. Graph this metric alongside gitlab_runner_jobs{state="pending"} to visualize the backlog of work waiting for available capacity.

Job latency reveals infrastructure bottlenecks. Graph the p95 of gitlab_runner_job_duration_seconds segmented by job stage. Spikes in the build stage often indicate image pull issues, while test stage delays point to resource contention. Cross-reference with Kubernetes node metrics to identify whether CPU, memory, or I/O creates the constraint. Layer in container_image_pull_duration_seconds to distinguish between network bottlenecks and compute limitations.

Cost per job requires combining runner metrics with cluster telemetry. Calculate sum(rate(container_cpu_usage_seconds_total[5m])) * cpu_hourly_cost + sum(container_memory_working_set_bytes) / 1e9 * memory_hourly_cost divided by rate(gitlab_runner_jobs_total[5m]). This metric exposes whether spot instances or reserved capacity delivers better economics for your workload profile. Segment by runner pool to identify which teams consume disproportionate resources—a build-heavy pool running on CPU-optimized nodes might cost $0.03 per job, while a test-intensive pool on memory-optimized instances hits $0.12 per job.

Proactive Alerting

Registration failures indicate network issues or token rotation problems. Alert when rate(gitlab_runner_errors_total{error="registration"}[5m]) > 0 persists for ten minutes—this catches expired tokens before they drain your runner pool. Configure notifications to page on-call engineers, since registration failures halt all new job execution for affected runner instances.

Job timeouts suggest deadlocked processes or resource starvation. Fire alerts when gitlab_runner_jobs{state="running"} > 0 for jobs exceeding twice their historical p95 duration. This catches infinite loops in test suites and misconfigured resource limits. Correlate with pod eviction events—if Kubernetes terminates pods due to memory pressure, your resource requests need tuning rather than your job timeout thresholds.

Runner churn—pods restarting frequently—signals underlying stability issues. Alert on rate(kube_pod_container_status_restarts_total{pod=~"gitlab-runner.*"}[15m]) > 0.1 to catch crash loops from memory leaks or network partition scenarios. Integrate with cluster autoscaler metrics to distinguish between intentional scale-down events and unexpected pod failures.

💡 Pro Tip: Export Prometheus data to your cost management platform to implement chargeback models. Tag runner pools with team labels and allocate infrastructure costs based on gitlab_runner_jobs_total per team—this drives accountability for expensive test suites and incentivizes optimization efforts.

With comprehensive observability in place, you gain the visibility needed to optimize runner distribution across teams. The next section explores managing multiple specialized runner pools using GitOps principles.

Managing Multiple Runner Pools with GitOps

As your CI/CD infrastructure grows, managing a single homogeneous runner pool becomes a bottleneck. Different workloads demand different resources: frontend tests need minimal CPU, integration tests require large memory allocations, and ML model training demands GPU access. GitOps provides the framework to maintain dozens of specialized runner pools while ensuring configuration consistency and audit trails.

Organizing Runners by Workload Characteristics

Structure your runner pools around resource profiles rather than teams or projects. Create distinct pools for CPU-intensive builds, memory-heavy integration tests, GPU workloads, and ARM-based builds for mobile deployments.

runners/base/runner-cpu-intensive.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: runner-cpu-intensive-config
data:
config.toml: |
concurrent = 20
check_interval = 3
[[runners]]
name = "cpu-intensive-pool"
url = "https://gitlab.company.com"
token = "${RUNNER_TOKEN}"
executor = "kubernetes"
[runners.kubernetes]
cpu_request = "4"
cpu_limit = "8"
memory_request = "4Gi"
memory_limit = "8Gi"
node_selector = { "workload-type" = "compute-optimized" }
[runners.kubernetes.pod_labels]
"runner-pool" = "cpu-intensive"

Define resource limits that match your node specifications. If your compute-optimized nodes provide 16 vCPUs, configure limits that allow optimal pod packing without resource contention. For GPU pools, specify the GPU resource type explicitly (nvidia.com/gpu: "1") and use node selectors to target nodes with the required CUDA versions. Memory-intensive pools should request large allocations upfront to prevent OOM kills during long-running integration test suites.

Consider lifecycle requirements when designing pools. Ephemeral workloads like unit tests benefit from high concurrency settings and aggressive timeout values, while long-running builds require lower concurrency to prevent resource starvation. Document these design decisions in your repository’s README to guide future operators.

Version Control and Synchronization with ArgoCD

Store runner configurations in a Git repository with environment-specific overlays using Kustomize. ArgoCD monitors this repository and synchronizes changes across clusters.

argocd/runner-pools-application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: gitlab-runners
namespace: argocd
spec:
project: infrastructure
source:
repoURL: https://github.com/company/k8s-infrastructure
targetRevision: main
path: runners/overlays/production
destination:
server: https://kubernetes.default.svc
namespace: gitlab-runners
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
retry:
limit: 3
backoff:
duration: 30s
maxDuration: 3m

This configuration enables automatic synchronization with retry logic. When you update runner concurrency limits or resource requests, ArgoCD rolls out changes across all clusters within minutes. The prune: true setting ensures decommissioned runner pools are automatically removed, preventing configuration drift. Use Kustomize overlays to maintain environment-specific variations—production pools might use larger resource limits than staging, while development environments can share runners across workload types to reduce costs.

Organize your repository structure around stability boundaries. Place base runner configurations in a runners/base directory, then create overlay directories for each environment (runners/overlays/production, runners/overlays/staging). This separation allows you to test runner configuration changes in non-production environments before promoting to production, reducing the risk of fleet-wide outages.

Tag-Based Routing for Specialized Workloads

Configure pipeline jobs to target specific runner pools using tags that match workload characteristics:

.gitlab-ci.yml
build-frontend:
tags:
- cpu-intensive
- amd64
script:
- npm run build
train-model:
tags:
- gpu
- cuda-12
- large-memory
script:
- python train.py
integration-tests:
tags:
- memory-optimized
- postgres-enabled
script:
- pytest tests/integration

Align runner tags with your pool configurations to ensure deterministic job placement. Avoid generic tags like “fast” or “powerful” that create ambiguous routing decisions. Instead, use descriptive tags that specify exact requirements: cuda-12 instead of gpu, 32gb-memory instead of large-memory. This precision prevents jobs from landing on undersized runners where they’ll fail with cryptic resource errors.

Establish tag naming conventions across your organization. Prefix infrastructure tags with infra- and capability tags with cap- to distinguish between resource requirements and feature availability. This convention helps developers understand whether they’re selecting compute resources or specialized tooling.

Rolling Updates Without Disruption

Implement pod disruption budgets to prevent ArgoCD from terminating runners executing jobs:

runners/base/pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: runner-cpu-intensive-pdb
spec:
minAvailable: 75%
selector:
matchLabels:
app: gitlab-runner
pool: cpu-intensive

Combined with proper readiness probes, this ensures configuration updates propagate gradually, replacing idle runners first while preserving active job execution. Set your minAvailable percentage based on pool size and typical utilization. Small pools with high utilization need higher percentages (80-90%) to avoid blocking updates indefinitely, while large pools with variable load can use lower thresholds (60-75%).

Configure ArgoCD sync waves to orchestrate update ordering across dependent pools. If your GPU runners depend on shared storage provisioners, use sync wave annotations to ensure storage updates complete before runner updates begin. This prevents race conditions where runners start before their dependencies are ready.

With GitOps managing your runner fleet, the final operational concern becomes visibility into fleet health and job distribution patterns.

Key Takeaways

  • Start with the Kubernetes executor and Helm charts rather than building custom runner deployments—you’ll avoid months of operational complexity
  • Implement health checks and autoscaling from day one; retrofitting self-healing into a static runner fleet is significantly harder than starting with it
  • Measure cache hit rates and job distribution across runner pools to identify optimization opportunities—most teams discover 40%+ time savings in their first month
  • Use GitOps (ArgoCD) to manage runner configurations as code, enabling safe experimentation with different pool configurations without risking production stability