Hero image for Building a Production ML Inference Stack with KServe, vLLM, and Karmada

Building a Production ML Inference Stack with KServe, vLLM, and Karmada


Your ML models work perfectly in development. The inference latency looks great, the throughput numbers hit your targets, and your team is ready to ship. Then production reality hits: you need to serve this model across three regions, handle failover when a GPU node disappears, and maintain consistent p99 latency for users in Singapore and São Paulo simultaneously. Suddenly you’re writing custom health checks, building bespoke routing logic, and wondering why your “simple” deployment turned into a distributed systems research project.

The fundamental problem is that ML inference doesn’t behave like traditional web services. You can’t just throw a load balancer in front of GPU-bound workloads and call it a day. Models have cold-start penalties measured in seconds, not milliseconds. GPU memory fragmentation creates capacity cliffs that don’t show up in CPU utilization metrics. And when a node fails, you can’t spin up a replacement in the time it takes to serve a single request—model loading alone takes longer than most SLA windows allow.

For the past two years, the CNCF ecosystem has quietly assembled the pieces to solve this problem properly. KServe provides a standardized inference serving layer with built-in model management. vLLM delivers state-of-the-art LLM execution with continuous batching and PagedAttention. Karmada extends Kubernetes federation to orchestrate workloads across clusters without requiring you to rebuild your entire platform. Each project solves one part of the puzzle; together, they form a production-grade stack for multi-cluster ML inference.

The challenge is understanding how these pieces connect—and where the integration points create both opportunities and sharp edges.

The Multi-Cluster ML Serving Challenge

Running ML inference at scale exposes fundamental limitations in single-cluster Kubernetes deployments. What works for stateless web services breaks down when you introduce GPU dependencies, model loading latencies, and the computational demands of large language models.

Visual: Multi-cluster ML inference architecture overview

The Single-Cluster Ceiling

A single Kubernetes cluster constrains your inference capacity in three dimensions:

Availability boundaries. When your cluster experiences an outage—whether from control plane issues, node failures, or cloud provider incidents—your entire inference capability disappears. For production ML systems where downtime translates directly to revenue loss or degraded user experience, this single point of failure is unacceptable.

Latency geography. Users in Singapore hitting a model served from us-east-1 experience latency that makes real-time inference impractical. Deploying models closer to users requires presence in multiple regions, which means multiple clusters.

Resource ceilings. GPU availability varies dramatically across cloud regions and availability zones. A single cluster in one region caps you at whatever GPU quota you can secure there. Spreading across clusters lets you aggregate GPU capacity from multiple pools.

Why Traditional Load Balancing Fails

Standard Kubernetes ingress and service mesh patterns assume workloads are fungible—any pod can handle any request with roughly equivalent performance. ML inference breaks this assumption.

GPU workloads require specific node types with attached accelerators. Models must be loaded into GPU memory before serving, a process that takes seconds to minutes depending on model size. Cold-start latency for a 70B parameter model can exceed 30 seconds, making reactive autoscaling painfully slow.

Traditional load balancers route based on connection counts or round-robin algorithms. They have no awareness of model loading state, GPU memory utilization, or batch queue depth. Sending inference requests to a pod still loading its model results in timeouts. Routing to an overloaded GPU while others sit idle wastes expensive compute.

The CNCF Stack for Distributed Inference

Three CNCF projects address these gaps at different layers of the stack:

KServe provides the model serving abstraction—handling model deployment, autoscaling based on inference-aware metrics, canary rollouts, and A/B testing. It understands the ML serving lifecycle rather than treating models as generic containers.

vLLM delivers the execution runtime for LLM inference, implementing PagedAttention for efficient GPU memory management and continuous batching for throughput optimization. It maximizes GPU utilization in ways generic serving frameworks cannot.

Karmada orchestrates workloads across multiple clusters, propagating deployments, managing cross-cluster networking, and enabling intelligent traffic distribution based on cluster health and capacity.

Together, these projects form an inference stack purpose-built for the constraints of production ML. The following sections examine how each component fits into a cohesive architecture, starting with KServe’s serving primitives.

KServe: Kubernetes-Native Model Serving Foundation

Managing model deployments through raw Kubernetes manifests quickly becomes unwieldy. You’re juggling Deployments, Services, Ingress resources, HorizontalPodAutoscalers, and custom readiness probes—all while trying to implement proper rollout strategies. KServe eliminates this complexity by providing a single abstraction purpose-built for ML inference workloads.

The InferenceService Abstraction

KServe’s InferenceService custom resource encapsulates everything needed for production model serving: model loading, request routing, autoscaling, and observability. Rather than managing a dozen interconnected resources, you declare your desired state in a single manifest.

inferenceservice.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: fraud-detection
namespace: ml-inference
annotations:
serving.kserve.io/enable-prometheus-scraping: "true"
spec:
predictor:
minReplicas: 2
maxReplicas: 20
scaleTarget: 10
scaleMetric: concurrency
model:
modelFormat:
name: sklearn
storageUri: s3://ml-models-prod/fraud-detection/v2.1.0
resources:
limits:
cpu: "4"
memory: 8Gi
requests:
cpu: "2"
memory: 4Gi

This single resource creates the serving infrastructure, configures model loading from S3, and establishes autoscaling policies. KServe handles container image selection based on the model format, probe configuration, and service mesh integration automatically.

The abstraction extends beyond simplicity. KServe supports multiple model frameworks out of the box—scikit-learn, XGBoost, TensorFlow, PyTorch, and ONNX—each with optimized serving runtimes. When you specify modelFormat: sklearn, KServe selects the appropriate container image, configures memory-mapped model loading, and sets up health checks that verify model initialization rather than just container readiness.

Inference-Aware Autoscaling

Traditional CPU and memory-based autoscaling fails for inference workloads. A model processing complex requests at 30% CPU utilization is already saturated from a latency perspective, while another at 80% CPU handles simple requests with sub-100ms p99 latency. KServe integrates with Knative to provide metrics that actually matter for inference.

The scaleMetric: concurrency configuration in the example above scales based on in-flight requests rather than resource utilization. When concurrent requests per pod exceed the scaleTarget of 10, KServe provisions additional replicas. This approach maintains consistent latency as traffic increases, responding to actual request pressure rather than indirect resource signals.

For GPU workloads, you can scale on custom Prometheus metrics like GPU memory utilization or inference queue depth:

gpu-autoscaling.yaml
spec:
predictor:
scaleMetric: gpu_memory_utilization
scaleTarget: 70

KServe also supports requests-per-second (RPS) as a scaling metric, which works well for workloads with predictable per-request latency. Choose concurrency for variable-latency models where queue depth matters, and RPS for consistent-latency models where throughput is the primary concern.

💡 Pro Tip: Set minReplicas: 2 for production services to ensure availability during pod evictions and node failures. The cold-start latency for model loading—especially large models—makes scale-from-zero impractical for latency-sensitive applications.

Canary Deployments for Model Versions

Deploying a new model version to production without validation is reckless. KServe’s native canary support lets you gradually shift traffic while monitoring performance metrics.

canary-rollout.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: fraud-detection
namespace: ml-inference
spec:
predictor:
canaryTrafficPercent: 10
model:
modelFormat:
name: sklearn
storageUri: s3://ml-models-prod/fraud-detection/v2.2.0
resources:
limits:
cpu: "4"
memory: 8Gi
requests:
cpu: "2"
memory: 4Gi

With canaryTrafficPercent: 10, the new model version receives 10% of incoming requests while v2.1.0 continues handling the remainder. Monitor prediction accuracy, latency distributions, and error rates before incrementing traffic. If the canary underperforms, set canaryTrafficPercent: 0 to instantly roll back without redeploying.

KServe maintains both model versions simultaneously, enabling rapid rollback without cold-start delays. This is particularly valuable when model quality regressions only surface under production traffic patterns that synthetic tests don’t capture. The dual-deployment approach also allows A/B testing scenarios where you deliberately maintain traffic splits to compare model performance over extended periods.

For automated progressive delivery, integrate KServe with Flagger or Argo Rollouts. These tools can automatically increment canary traffic based on success rate thresholds and latency SLOs, reducing manual intervention while maintaining safety guarantees.

The InferenceService abstraction provides the foundation for production model serving, but standard serving runtimes struggle with the memory and throughput demands of large language models. This is where vLLM’s architecture becomes essential.

Integrating vLLM for High-Throughput LLM Inference

Large language models demand fundamentally different serving strategies than traditional ML models. A 7B parameter model consumes 14GB of GPU memory just for weights, leaving limited headroom for the key-value cache that grows with each token generated. vLLM solves this constraint through PagedAttention, a memory management technique that transforms LLM serving economics.

Understanding PagedAttention

Traditional LLM serving pre-allocates contiguous memory blocks for each request’s KV cache, sized for the maximum possible sequence length. This approach wastes 60-80% of GPU memory on fragmentation and over-provisioning. PagedAttention borrows concepts from operating system virtual memory, allocating KV cache in non-contiguous blocks on demand. Just as OS virtual memory maps logical addresses to physical pages scattered across RAM, PagedAttention maps attention computations to memory blocks allocated dynamically as sequences grow.

The performance impact is substantial. By eliminating memory fragmentation, vLLM serves 2-4x more concurrent requests on identical hardware. Combined with continuous batching—which dynamically adds new requests to in-flight batches rather than waiting for completion—throughput improvements reach 23x compared to naive implementations. For production deployments processing thousands of requests per minute, this translates directly to reduced GPU costs. The efficiency gains compound at scale: what previously required eight A100 GPUs can often be served with two or three.

Configuring vLLM as a KServe Runtime

KServe’s ClusterServingRuntime abstraction lets you define vLLM as a first-class serving backend. The following configuration registers vLLM with optimized defaults for production LLM workloads:

vllm-runtime.yaml
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
name: vllm-runtime
spec:
annotations:
prometheus.kserve.io/scrape: "true"
prometheus.kserve.io/port: "8000"
supportedModelFormats:
- name: vllm
version: "1"
autoSelect: true
containers:
- name: kserve-container
image: vllm/vllm-openai:v0.4.2
args:
- --model=/mnt/models
- --served-model-name={{.Name}}
- --tensor-parallel-size=1
- --max-model-len=4096
- --gpu-memory-utilization=0.90
resources:
limits:
nvidia.com/gpu: "1"
requests:
memory: "24Gi"
cpu: "4"

The --served-model-name template variable dynamically substitutes the InferenceService name, ensuring consistent naming across your deployment pipeline. This simplifies client configuration and observability correlation.

With the runtime registered, deploy models using standard InferenceService manifests:

llama-inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-7b
namespace: ml-models
spec:
predictor:
model:
modelFormat:
name: vllm
storageUri: s3://model-artifacts/llama-7b-chat
runtime: vllm-runtime
minReplicas: 2
maxReplicas: 8

Memory Management and Batch Sizing

GPU memory utilization (--gpu-memory-utilization) controls the percentage of available VRAM allocated to vLLM. Setting 0.90 reserves 10% for CUDA kernels and temporary allocations, preventing out-of-memory errors during request spikes. For multi-tenant clusters sharing GPUs, reduce this to 0.85. In dedicated inference environments where stability is paramount, 0.88 provides a reasonable balance between utilization and headroom.

💡 Pro Tip: Monitor the vllm:gpu_cache_usage_perc metric. Sustained values above 95% indicate memory pressure—either reduce max-model-len or add replicas. Pair this with vllm:num_requests_waiting to distinguish between memory constraints and genuine demand spikes.

Maximum model length (--max-model-len) caps the combined input and output token count per request. Lower values increase concurrent request capacity but limit use cases. For conversational applications, 4096 tokens handles most interactions. RAG pipelines with large context windows require 8192 or higher. Consider your actual usage patterns: if 95% of requests complete within 2048 tokens, setting a 4096 limit doubles your effective concurrency compared to 8192.

Tensor parallelism (--tensor-parallel-size) shards model weights across multiple GPUs. A 70B parameter model requires at least 4x A100-40GB GPUs with tensor parallelism enabled. When using tensor parallelism, ensure your InferenceService requests the corresponding GPU count and that GPUs are connected via NVLink for acceptable inter-GPU communication latency. Without NVLink, PCIe bandwidth becomes the bottleneck, negating much of the parallelism benefit.

For batch sizing, vLLM’s continuous batching removes the need for manual configuration. The engine automatically maximizes GPU utilization by filling available memory with concurrent requests. Focus tuning efforts on memory parameters rather than explicit batch sizes. If you observe inconsistent latencies, examine the queue depth metrics rather than adjusting batch parameters.

With vLLM handling efficient execution on individual clusters, the next challenge becomes distributing these workloads across multiple Kubernetes clusters for resilience and geographic locality—exactly what Karmada provides.

Multi-Cluster Orchestration with Karmada

Running ML inference at scale demands more than a single Kubernetes cluster. GPU availability fluctuates across regions, latency requirements vary by geography, and resilience requires workload distribution. Karmada provides the orchestration layer that transforms isolated clusters into a unified inference platform.

Visual: Karmada multi-cluster orchestration architecture

Karmada’s Control Plane Architecture

Karmada separates the control plane from member clusters, creating a federation layer that manages workloads without modifying existing cluster configurations. The Karmada API server accepts standard Kubernetes resources—including KServe InferenceServices—and propagates them to member clusters based on policies you define.

This architecture means your KServe deployments remain unchanged. You define an InferenceService once in the Karmada control plane, and propagation policies determine where instances run. The separation also provides operational benefits: you can upgrade member clusters independently, test new configurations on specific clusters before rolling out broadly, and maintain different Kubernetes versions across your fleet when vendor constraints require it.

The control plane components include:

  • karmada-apiserver: Accepts resource definitions and policy configurations
  • karmada-controller-manager: Reconciles desired state across member clusters
  • karmada-scheduler: Places workloads based on cluster capacity and policy constraints
  • karmada-webhook: Validates and mutates resources before they enter the system

Member clusters run a lightweight karmada-agent that reports cluster status and executes propagation decisions. This agent-based model allows clusters to operate independently if connectivity to the control plane is interrupted—workloads continue serving traffic even during network partitions between the control plane and member clusters.

Propagating InferenceServices with PropagationPolicy

PropagationPolicy defines which clusters receive your workloads and how replicas distribute across them. For GPU-intensive inference workloads, cluster selection based on resource availability is critical.

propagation-policy.yaml
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
name: llm-inference-propagation
spec:
resourceSelectors:
- apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
name: llama-inference
placement:
clusterAffinity:
clusterNames:
- gpu-cluster-us-east
- gpu-cluster-eu-west
- gpu-cluster-ap-south
spreadConstraints:
- maxGroups: 3
minGroups: 2
spreadByField: cluster
replicaScheduling:
replicaDivisionPreference: Weighted
replicaSchedulingType: Divided
weightPreference:
staticWeightList:
- targetCluster:
clusterNames:
- gpu-cluster-us-east
weight: 3
- targetCluster:
clusterNames:
- gpu-cluster-eu-west
weight: 2
- targetCluster:
clusterNames:
- gpu-cluster-ap-south
weight: 1

This policy distributes the LLaMA inference workload across three GPU clusters with weighted replica allocation. The US East cluster receives the highest proportion of replicas, reflecting primary traffic patterns. The spreadConstraints ensure that at least two clusters always run the workload, providing resilience against single-cluster failures while allowing flexibility in how the remaining capacity distributes.

Cluster-Specific Configuration with OverridePolicy

Different clusters require different configurations. GPU types vary between cloud providers, memory limits differ based on available hardware, and environment-specific settings need injection. OverridePolicy handles these variations without duplicating InferenceService definitions, keeping your resource manifests DRY while accommodating infrastructure heterogeneity.

override-policy.yaml
apiVersion: policy.karmada.io/v1alpha1
kind: OverridePolicy
metadata:
name: llm-inference-overrides
spec:
resourceSelectors:
- apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
name: llama-inference
overrideRules:
- targetCluster:
clusterNames:
- gpu-cluster-us-east
overriders:
plaintext:
- path: /spec/predictor/containers/0/resources/limits/nvidia.com~1gpu
operator: replace
value: "4"
- path: /spec/predictor/containers/0/env/-
operator: add
value:
name: VLLM_TENSOR_PARALLEL_SIZE
value: "4"
- targetCluster:
clusterNames:
- gpu-cluster-eu-west
overriders:
plaintext:
- path: /spec/predictor/containers/0/resources/limits/nvidia.com~1gpu
operator: replace
value: "2"
- path: /spec/predictor/containers/0/env/-
operator: add
value:
name: VLLM_TENSOR_PARALLEL_SIZE
value: "2"

The override mechanism uses JSON patch semantics, giving you precise control over which fields to modify. Note the ~1 encoding for the forward slash in nvidia.com/gpu—this follows RFC 6901 JSON Pointer specification.

💡 Pro Tip: Use Karmada’s cluster labels to create dynamic cluster affinity rules. Label clusters with GPU types (gpu-type: a100, gpu-type: h100) and reference these labels in PropagationPolicy to automatically route workloads to appropriate hardware as your infrastructure evolves.

GPU-Aware Scheduling

Karmada’s scheduler integrates with cluster resource reporting to make placement decisions based on actual GPU availability. The karmada-agent reports extended resources including nvidia.com/gpu counts, enabling the scheduler to avoid clusters with insufficient capacity. This prevents scheduling failures that would otherwise occur when placing workloads on clusters lacking available GPUs.

Configure cluster resource models to expose GPU metrics:

cluster-resource-model.yaml
apiVersion: cluster.karmada.io/v1alpha1
kind: Cluster
metadata:
name: gpu-cluster-us-east
spec:
resourceModels:
- grade: 0
ranges:
- name: nvidia.com/gpu
min: 0
max: 8

Resource models define capacity grades that the scheduler uses when evaluating placement decisions. A cluster with available GPUs in the 0-8 range falls into grade 0. You can define multiple grades to represent different capacity tiers—clusters with more available GPUs can receive higher priority for resource-intensive workloads.

The scheduler also respects taints and tolerations propagated from member clusters. If a cluster marks its GPU nodes as unavailable during maintenance, the scheduler automatically redirects new placements to other clusters without manual intervention.

With propagation policies, override policies, and GPU-aware scheduling in place, your inference workloads distribute intelligently across clusters. The next challenge is handling failures gracefully and managing traffic across this distributed infrastructure.

Implementing Failover and Traffic Management

A multi-cluster inference deployment provides limited value if a single cluster failure takes down your entire serving capacity. This section covers the patterns and configurations that transform a distributed KServe deployment into a resilient system capable of automatic recovery. The architecture combines Karmada’s cluster-level orchestration with Istio’s traffic management to create defense in depth against failures at multiple layers of the stack.

Karmada Failover Configuration

Karmada’s failover capabilities rely on cluster health monitoring and automated replica redistribution. The controller continuously evaluates cluster status through heartbeat signals and resource availability metrics, triggering corrective actions when thresholds are breached. Configure the propagation policy with explicit failover behavior:

failover-policy.yaml
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
name: llm-inference-failover
namespace: ml-serving
spec:
resourceSelectors:
- apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
name: llama-70b-chat
placement:
clusterAffinity:
clusterNames:
- us-east-1-gpu
- us-west-2-gpu
- eu-west-1-gpu
spreadConstraints:
- maxGroups: 3
minGroups: 2
spreadByField: cluster
replicaScheduling:
replicaSchedulingType: Divided
replicaDivisionPreference: Weighted
weightPreference:
staticWeightList:
- targetCluster:
clusterNames:
- us-east-1-gpu
weight: 40
- targetCluster:
clusterNames:
- us-west-2-gpu
weight: 35
- targetCluster:
clusterNames:
- eu-west-1-gpu
weight: 25
failover:
application:
decisionConditions:
tolerationSeconds: 60
purgeMode: Graciously
gracePeriodSeconds: 120

The tolerationSeconds parameter controls how long Karmada waits before declaring a cluster unhealthy. For inference workloads, 60 seconds balances between premature failovers from transient network issues and prolonged service degradation. Setting this value too low triggers unnecessary replica migrations during routine network hiccups, while values exceeding two minutes leave users experiencing errors for an unacceptable duration. The purgeMode: Graciously setting ensures in-flight requests complete before replica migration, preventing abrupt connection terminations that would otherwise return errors to clients mid-inference.

The spreadConstraints configuration with minGroups: 2 ensures that replicas always span at least two clusters, preventing a single point of failure. If a cluster becomes unavailable, the remaining clusters maintain service continuity while Karmada redistributes the affected replicas.

Health-Based Scheduling

Karmada evaluates cluster health through multiple signals including node availability, resource capacity, and network connectivity. Extend the default health checks with inference-specific criteria using a cluster override policy:

health-override.yaml
apiVersion: policy.karmada.io/v1alpha1
kind: ClusterOverridePolicy
metadata:
name: inference-health-requirements
spec:
resourceSelectors:
- apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
overrideRules:
- targetCluster:
clusterNames:
- us-east-1-gpu
- us-west-2-gpu
- eu-west-1-gpu
overriders:
annotationsOverrider:
- operator: addIfAbsent
value:
serving.kserve.io/health-probe-timeout: "30"
serving.kserve.io/min-ready-replicas: "1"

When a cluster’s GPU nodes experience memory pressure or the inference pods fail health checks, Karmada automatically redistributes replicas to healthy clusters according to the weighted preferences. The min-ready-replicas annotation prevents Karmada from considering a cluster healthy until at least one inference pod passes its readiness probe, avoiding premature traffic routing to clusters still initializing their model weights.

Service Mesh Integration for Cross-Cluster Routing

Istio’s multi-cluster service mesh provides the traffic management layer that connects clients to the nearest healthy inference endpoint. This layer operates independently from Karmada’s orchestration, enabling rapid traffic shifting without waiting for pod rescheduling. Configure a destination rule that prioritizes local cluster routing with automatic failover:

cross-cluster-routing.yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: llm-inference-routing
namespace: ml-serving
spec:
host: llama-70b-chat-predictor.ml-serving.svc.cluster.local
trafficPolicy:
connectionPool:
http:
h2UpgradePolicy: UPGRADE
maxRequestsPerConnection: 100
loadBalancer:
localityLbSetting:
enabled: true
failover:
- from: us-east-1
to: us-west-2
- from: us-west-2
to: us-east-1
- from: eu-west-1
to: us-east-1
outlierDetection:
consecutive5xxErrors: 3
interval: 10s
baseEjectionTime: 30s
maxEjectionPercent: 50

The outlierDetection configuration ejects unhealthy endpoints from the load balancing pool after three consecutive errors, preventing request routing to degraded inference pods while Karmada handles cluster-level failover. The baseEjectionTime of 30 seconds provides sufficient recovery window for transient GPU memory issues without permanently removing endpoints from rotation.

Locality-aware load balancing through localityLbSetting minimizes cross-region latency by routing requests to the nearest healthy cluster. The explicit failover chain ensures deterministic behavior when the local cluster becomes unavailable, routing US East traffic to US West before considering the higher-latency European cluster.

💡 Pro Tip: Set maxEjectionPercent to 50 rather than 100 for inference workloads. This prevents complete endpoint ejection during GPU memory spikes that cause temporary 5xx responses, maintaining some serving capacity while pods recover.

This failover architecture handles both pod-level failures through Istio’s outlier detection and cluster-level failures through Karmada’s replica redistribution. The combination provides sub-minute recovery times for most failure scenarios, with Istio responding to individual pod failures within seconds while Karmada orchestrates broader cluster remediation over the configured toleration period.

With resilient traffic management in place, visibility into system behavior becomes critical. The next section addresses observability patterns that surface health metrics, latency distributions, and failure events across your distributed inference stack.

Observability Across the Stack

Distributed ML inference creates blind spots that traditional monitoring approaches miss. When a request traverses multiple clusters, touches GPU-accelerated pods, and passes through KServe’s routing layer, you need unified visibility into every component. Without it, debugging latency spikes becomes a multi-hour investigation across disconnected dashboards.

Unified Metrics Collection

The three-layer stack exposes metrics through Prometheus endpoints, but aggregating them requires intentional design. KServe exports inference-specific metrics on port 9091, vLLM provides engine statistics on its metrics endpoint, and Karmada exposes cluster health and scheduling decisions. Deploy a federated Prometheus setup with Thanos or Cortex to collect metrics from all member clusters into a single query layer.

Configure your scrape targets to capture:

  • KServe: revision_request_latencies, revision_request_count, queue_depth
  • vLLM: vllm:num_requests_running, vllm:gpu_cache_usage_perc, vllm:avg_generation_throughput_toks_per_s
  • Karmada: karmada_cluster_ready_condition, karmada_resource_binding_synced

Key Inference Metrics

Focus your dashboards on the metrics that directly impact user experience and resource efficiency. Track p50, p95, and p99 latency percentiles separately—a healthy p50 with a degraded p99 indicates request queuing or cold start issues. Monitor GPU memory utilization alongside cache hit rates; vLLM’s PagedAttention performs best when the KV cache stays between 70-85% utilization.

Queue depth deserves particular attention in multi-cluster deployments. Rising queue depth in one cluster while others remain idle signals a Karmada scheduling misconfiguration or stale health checks.

Cross-Cluster Alerting

Build alerts around SLO violations rather than absolute thresholds. Define your target—for example, 95% of inference requests complete within 200ms—and alert when the burn rate threatens that objective. Use multi-window alerting to distinguish between brief spikes and sustained degradation.

Critical alerts for production inference stacks include: inference latency SLO burn rate exceeding budget, GPU memory pressure above 90% for more than 5 minutes, Karmada cluster connectivity loss, and vLLM engine restarts exceeding threshold.

💡 Pro Tip: Create a synthetic inference probe that runs continuously across all clusters. Comparing synthetic latency against real traffic latency reveals whether issues stem from your infrastructure or from specific model inputs.

With observability in place, you have the visibility needed to operate confidently. The final section consolidates everything into an actionable deployment checklist.

Production Deployment Checklist

Moving from proof-of-concept to production requires systematic attention to security, cost management, and operational sustainability. This checklist distills the critical considerations for running KServe, vLLM, and Karmada at scale.

Security Hardening

RBAC Configuration: Implement least-privilege access across all clusters. Create dedicated service accounts for KServe inference services, Karmada controllers, and vLLM model servers. Avoid cluster-admin bindings—instead, scope permissions to specific namespaces and resource types.

Network Policies: Enforce strict pod-to-pod communication rules. Inference endpoints should only accept traffic from authorized ingress controllers and service meshes. Block egress except for model storage backends and telemetry collectors.

Secrets Management: Store model registry credentials, API keys, and TLS certificates in external secrets managers like HashiCorp Vault or AWS Secrets Manager. Use the External Secrets Operator to sync secrets across Karmada member clusters, ensuring consistent rotation without manual intervention.

GPU Cost Optimization

Right-size GPU allocations: Profile your models under realistic load before committing to instance types. vLLM’s continuous batching often achieves target latency on smaller GPUs than initial estimates suggest.

Implement cluster autoscaling: Configure Karpenter or Cluster Autoscaler with GPU-aware node pools. Set aggressive scale-down delays (300-600 seconds) to avoid thrashing during traffic fluctuations.

Use spot instances strategically: Route non-critical inference traffic to spot-backed node pools. Karmada’s override policies enable automatic failover to on-demand capacity when spot instances face interruption.

💡 Pro Tip: Schedule batch inference workloads during off-peak hours using Kubernetes CronJobs, capitalizing on lower spot prices and reduced cluster contention.

Version Compatibility Matrix

Maintain explicit version pinning across components. KServe 0.12+ requires Knative Serving 1.12+, while vLLM runtime integration demands KServe’s ModelMesh or raw deployment mode. Karmada 1.9+ provides the PropagationPolicy features essential for GPU workload distribution. Test upgrade paths in staging clusters before production rollout.

With these foundations in place, you’re equipped to operate a resilient, cost-effective multi-cluster inference platform.

Key Takeaways

  • Start with KServe InferenceServices to get autoscaling and canary deployments before adding multi-cluster complexity
  • Use vLLM’s PagedAttention configuration to reduce GPU memory requirements by 2-4x for LLM workloads
  • Configure Karmada PropagationPolicies with cluster taints to ensure GPU workloads only land on capable clusters
  • Implement health-based failover in Karmada before you need it—cluster failures during inference spikes are not the time to learn the API