From Zero to Production AKS: A Battle-Tested Blueprint for Azure Kubernetes
Your development cluster works perfectly. Pods schedule instantly, services resolve without hiccups, and your CI/CD pipeline deploys like clockwork. Then you push to production. Within hours, you’re watching pods get evicted with cryptic OOMKilled messages. Your application’s latency spikes from 50ms to 2 seconds during peak traffic. The networking team asks why your cluster is consuming half the available IPs in the production subnet. And at the end of the month, finance wants to know why your Kubernetes “experiment” costs more than the three VMs it was supposed to replace.
This gap between development and production isn’t a knowledge gap—it’s an experience gap. The Kubernetes documentation tells you how to create a cluster. It doesn’t tell you that Azure CNI will exhaust your subnet if you don’t plan IP allocation upfront. It doesn’t mention that the default node pool configuration will leave you vulnerable to noisy neighbor problems. And it certainly doesn’t warn you that enabling every shiny feature—pod identity, Key Vault integration, Azure Policy—without understanding the performance implications will add 30 seconds to your pod startup time.
I’ve built AKS clusters that handled 10,000 requests per second without breaking a sweat, and I’ve inherited clusters that fell over at 100 RPS. The difference wasn’t the underlying Azure infrastructure or the Kubernetes version. It was the dozens of small decisions made during initial setup—decisions that seem inconsequential until production traffic exposes every shortcut and assumption.
What follows is the blueprint I wish I’d had three years ago: a production-ready AKS configuration built from real failures, unexpected outages, and hard-won operational knowledge.
The Production Gap: Why Dev Clusters Fail at Scale
That development AKS cluster humming along with your sample app? It will betray you in production. Not because Kubernetes is broken, but because production exposes every shortcut, assumption, and default configuration you accepted during development.

I’ve watched teams deploy their first production AKS cluster with confidence, only to face a cascade of failures within weeks. The patterns repeat with painful consistency.
The Failure Modes Nobody Warns You About
Networking becomes your first crisis. Your dev cluster uses kubenet with basic load balancers. Production demands Azure CNI for pod-level network policies, private endpoints for PaaS services, and proper subnet sizing. That /24 subnet you allocated? It exhausts IP addresses the moment you scale past 50 pods. Azure CNI assigns IPs from your VNet directly—run out, and new pods fail to schedule.
Identity sprawls into chaos. Dev clusters run with overprivileged service principals. Production requires workload identity federation, pod-managed identities, and granular RBAC that doesn’t grant cluster-admin to your CI pipeline. When your container registry pull fails at 2 AM because a service principal expired, you’ll understand why identity architecture matters.
Resource limits expose themselves violently. Azure subscription quotas, node pool VM family limits, and API rate limiting don’t surface until you’re scaling under load. That autoscaler configured to add nodes during traffic spikes? It fails silently when you’ve exhausted your regional vCPU quota.
The Cost Cliff
Your $200/month development cluster becomes $2000 in production through accumulation: premium SSDs for persistent volumes, reserved capacity for availability zones, Log Analytics ingestion at scale, multiple node pools for workload isolation, and the load balancers multiplying with each exposed service. Every production-grade feature adds cost that dev clusters deliberately avoid.
Pro Tip: Enable cost analysis tags from day one. Tracking spend by namespace, team, and environment prevents the month-end shock when Azure billing reveals your observability stack costs more than your compute.
Defining Production-Ready
Production readiness isn’t a checklist—it’s achieving specific outcomes: 99.9% availability through proper node pool design and pod disruption budgets, security through network policies and secrets management, observability through metrics that answer “why is this broken” rather than “something is wrong,” and cost predictability through resource governance and autoscaling limits.
These outcomes require architectural decisions made at cluster creation time, not bolted on later.
Let’s start with those foundational decisions: how to design your cluster architecture for resilience from the first az aks create command.
Cluster Architecture: Designing for Resilience from Day One
A production AKS cluster isn’t just a bigger version of your development environment. The architectural decisions you make at cluster creation time—node pool topology, networking model, and cluster accessibility—determine whether you’ll be running smoothly at scale or planning a painful migration six months down the road.

Node Pool Strategy: Separate System from User Workloads
The single most impactful architectural pattern is separating system node pools from user workloads. System pools run critical components: CoreDNS, kube-proxy, metrics-server, and Azure-specific controllers. When user workloads consume all available resources, these components fail—and your entire cluster becomes unresponsive.
Create a dedicated system pool with the CriticalAddonsOnly=true:NoSchedule taint. Size it for your control plane needs (typically 2-3 nodes in Standard_D4s_v3 or equivalent), and deploy user workloads to separate pools with appropriate taints and tolerations. This isolation means a runaway deployment won’t starve your DNS resolution or prevent new pods from scheduling.
For user pools, span multiple availability zones. Azure charges nothing extra for zone-redundant deployments, and you get genuine failure domain isolation. A single-zone cluster means a datacenter issue takes down everything—I’ve seen teams learn this lesson during Azure’s 2023 regional incidents.
Networking: The Decision You Can’t Easily Undo
Choose between Azure CNI and Kubenet before cluster creation. Migrating between networking models requires cluster recreation.
Azure CNI assigns VNet IP addresses directly to pods, enabling native integration with Azure services, Network Security Groups, and private endpoints. The tradeoff is IP address consumption—plan for three to four times your expected pod count in subnet sizing. For clusters running 200+ pods, you’ll need a /21 or larger subnet.
Kubenet uses NAT to conserve IP addresses but complicates connectivity. Pods aren’t directly addressable from outside the cluster, Azure Firewall rules become more complex, and you lose some Azure Monitor integration capabilities.
Pro Tip: Default to Azure CNI unless you’re genuinely constrained on IP address space. The operational simplicity pays for itself within months.
Private Clusters: Security vs. Operational Overhead
Private clusters disable public API server endpoints entirely. All control plane traffic flows through your VNet, eliminating one attack surface. However, this means your CI/CD pipelines, developer workstations, and monitoring systems all need VNet connectivity.
The middle ground—authorized IP ranges on a public cluster—works well for most organizations. Restrict API server access to your corporate egress IPs and Azure DevOps/GitHub runners. You get meaningful security without the jump box complexity.
Control Plane Sizing
AKS offers Free and Standard tiers. The Free tier provides no SLA and limits API server capacity. Once you exceed 10 nodes or run anything production-critical, Standard tier’s financially-backed SLA and higher API throughput become necessary.
Watch for API server throttling in your metrics. Symptoms include slow kubectl responses, delayed pod scheduling, and webhook timeouts. If you’re hitting limits, the problem is architectural—either too many small clusters or workloads that hammer the API excessively.
With your cluster architecture defined, the next step is encoding these decisions in repeatable infrastructure. Let’s look at Terraform patterns that capture these architectural choices while remaining maintainable as your platform evolves.
Infrastructure as Code: Terraform Patterns for AKS
Manual cluster creation works fine for learning. Production demands repeatability, version control, and the ability to recover from disasters without frantic clicking through the Azure portal at 3 AM. Terraform provides that foundation, but only when structured to handle the complexity AKS actually requires.
Modular Structure That Scales
A flat Terraform configuration becomes unmaintainable after a few hundred lines. Split your AKS infrastructure into composable modules that mirror operational boundaries:
resource "azurerm_kubernetes_cluster" "main" { name = var.cluster_name location = var.location resource_group_name = var.resource_group_name dns_prefix = var.cluster_name kubernetes_version = var.kubernetes_version
default_node_pool { name = "system" vm_size = "Standard_D4s_v5" zones = ["1", "2", "3"] auto_scaling_enabled = true min_count = 2 max_count = 5 vnet_subnet_id = var.subnet_id
upgrade_settings { max_surge = "33%" drain_timeout_in_minutes = 30 node_soak_duration_in_minutes = 0 } }
identity { type = "UserAssigned" identity_ids = [var.cluster_identity_id] }
oidc_issuer_enabled = true workload_identity_enabled = true
maintenance_window { allowed { day = "Sunday" hours = [2, 3, 4] } }}Separate modules for networking, identity, node pools, and the cluster itself let teams work independently and enable composition across environments. Your production cluster references the same node pool module as staging—just with different variables. This separation also simplifies testing: you can validate networking changes in isolation before they touch production clusters.
Structure your module hierarchy to reflect your organization’s operational model. A typical layout includes a modules/ directory containing aks-cluster/, networking/, identity/, and node-pools/ subdirectories. Environment-specific configurations in environments/dev/, environments/staging/, and environments/prod/ then compose these modules with appropriate variable overrides.
Azure AD and Workload Identity
Workload identity eliminates the need for storing credentials in Kubernetes secrets. The configuration requires coordination between Azure AD, the AKS cluster, and your deployments:
resource "azurerm_user_assigned_identity" "workload" { name = "${var.app_name}-identity" location = var.location resource_group_name = var.resource_group_name}
resource "azurerm_federated_identity_credential" "workload" { name = "${var.app_name}-federated" resource_group_name = var.resource_group_name parent_id = azurerm_user_assigned_identity.workload.id audience = ["api://AzureADTokenExchange"] issuer = var.oidc_issuer_url subject = "system:serviceaccount:${var.namespace}:${var.service_account_name}"}
resource "azurerm_role_assignment" "keyvault_access" { scope = var.keyvault_id role_definition_name = "Key Vault Secrets User" principal_id = azurerm_user_assigned_identity.workload.principal_id}The federated credential binds a specific Kubernetes service account to the Azure managed identity. Applications running with that service account automatically receive Azure tokens—no secrets rotation, no credential leakage risk. The subject field follows a strict format: changing the namespace or service account name breaks the federation silently, so validate these values against your Kubernetes manifests.
Beyond Key Vault access, workload identity enables secure connections to Azure SQL, Storage Accounts, Service Bus, and any Azure resource supporting managed identity authentication. Define role assignments per resource, following least-privilege principles. A single managed identity can hold multiple role assignments, but consider creating separate identities for applications with distinct security boundaries.
Node Pool Configuration for Cost and Performance
Production clusters need multiple node pools. System pools run cluster components with guaranteed capacity. Workload pools handle applications with appropriate sizing. Spot pools provide cost optimization for fault-tolerant workloads:
resource "azurerm_kubernetes_cluster_node_pool" "spot" { name = "spot" kubernetes_cluster_id = var.cluster_id vm_size = "Standard_D8s_v5" priority = "Spot" eviction_policy = "Delete" spot_max_price = -1
auto_scaling_enabled = true min_count = 0 max_count = 20 zones = ["1", "2", "3"] vnet_subnet_id = var.subnet_id
node_taints = ["kubernetes.azure.com/scalesetpriority=spot:NoSchedule"]
node_labels = { "workload-type" = "batch" "spot" = "true" }}Pro Tip: Set
spot_max_price = -1to accept the current spot price up to on-demand rates. This maximizes availability while still capturing savings averaging 60-80% for most VM sizes.
Taints on spot pools prevent accidental scheduling of critical workloads. Applications that tolerate interruption explicitly tolerate the taint; everything else lands on reliable on-demand nodes. Combine taints with node affinity rules in your deployments for precise placement control.
Consider creating dedicated node pools for workloads with specific requirements: GPU-enabled pools for machine learning, memory-optimized pools for caching layers, or burstable pools for development environments. Each pool can have independent autoscaling boundaries, ensuring cost control without sacrificing availability for critical services.
Managing Upgrades Through Code
AKS releases new Kubernetes versions monthly. Terraform manages this through explicit version pinning combined with maintenance windows:
module "aks" { source = "../../modules/aks-cluster"
cluster_name = "prod-westus2" kubernetes_version = "1.29.2"
automatic_upgrade_channel = "patch" node_os_upgrade_channel = "NodeImage"}The patch upgrade channel applies security fixes automatically within maintenance windows while keeping you in control of minor version upgrades. Node image upgrades refresh the underlying OS without changing Kubernetes versions—essential for CVE remediation.
Pin explicit versions in your Terraform configuration, test upgrades in lower environments, then update the version string and apply. The maintenance window ensures upgrades happen during acceptable timeframes rather than immediately on apply. For clusters spanning multiple node pools, upgrades proceed sequentially—plan for extended maintenance periods proportional to your cluster size.
Track version deprecation timelines in your infrastructure repository. AKS supports three minor versions simultaneously, and Microsoft announces deprecations months in advance. Integrate version checking into your CI pipeline to catch outdated configurations before they become urgent security issues.
With infrastructure codified and repeatable, the next challenge becomes hardening that infrastructure against the threats production clusters inevitably face.
Security Hardening: Beyond Default Configurations
A default AKS cluster provides basic isolation, but production environments demand defense-in-depth. The security configurations that follow address the gaps between a functional cluster and one that satisfies SOC 2, HIPAA, or PCI-DSS requirements—without creating friction that drives developers to find workarounds.
Workload Identity: The Pod-Managed Identity Replacement
Azure AD Workload Identity replaces the deprecated pod-managed identity with a more secure, standards-based approach using service account token federation. This eliminates the need for the NMI pods that created scaling bottlenecks and attack surface in larger clusters.
The architecture shift matters: pod-managed identity relied on a node-level daemon intercepting IMDS requests, creating a single point of failure and a tempting target for lateral movement. Workload Identity uses OpenID Connect federation, where Azure AD validates tokens issued by the cluster’s own OIDC provider.
apiVersion: v1kind: ServiceAccountmetadata: name: order-processor namespace: payments annotations: azure.workload.identity/client-id: "a1b2c3d4-5678-90ab-cdef-1234567890ab" azure.workload.identity/tenant-id: "98765432-10fe-dcba-0987-654321fedcba"---apiVersion: apps/v1kind: Deploymentmetadata: name: order-processor namespace: paymentsspec: template: metadata: labels: azure.workload.identity/use: "true" spec: serviceAccountName: order-processor containers: - name: processor image: myregistry.azurecr.io/order-processor:v2.1.0The federated credential binding happens in Azure, not in the cluster, which means compromising a pod grants no persistent identity access. Tokens are short-lived (typically one hour) and automatically rotated, significantly reducing the blast radius of credential theft.
Network Policies: Azure NPM vs Calico
Azure Network Policy Manager works for basic east-west traffic control, but Calico delivers the granular policy enforcement production workloads require. The performance overhead is minimal, and you gain DNS-aware policies, global network sets, and the ability to write policies that reference FQDNs rather than IP ranges that change without notice.
apiVersion: projectcalico.org/v3kind: GlobalNetworkPolicymetadata: name: deny-default-egressspec: selector: projectcalico.org/namespace != "kube-system" types: - Egress egress: - action: Allow destination: selector: k8s-app == "kube-dns" - action: Allow destination: nets: - 10.0.0.0/8 - action: DenyThis policy allows internal communication and DNS resolution while blocking unexpected external egress—a pattern that stops data exfiltration attempts and cryptominer callbacks. Start with this deny-by-default posture, then add explicit allow rules for each legitimate external dependency.
Pro Tip: Enable Calico’s flow logs to Azure Log Analytics before tightening policies. Two weeks of baseline traffic data prevents breaking legitimate connections during policy rollout and gives you evidence for compliance audits.
Defender for Containers: Coverage Gaps
Microsoft Defender for Containers catches runtime threats, vulnerable images, and suspicious process execution. The behavioral detection identifies cryptomining, reverse shells, and reconnaissance tools with reasonable accuracy. However, it misses subtle configuration drift, custom admission webhook bypasses, and supply chain attacks through base image layers older than its vulnerability database refresh cycle.
Supplement Defender with admission controllers that enforce your specific requirements:
- Require signed images from your private registry using Notary or Cosign
- Block privileged containers outside designated namespaces
- Enforce resource limits on all workloads to prevent resource exhaustion attacks
- Validate that pods specify explicit security contexts with appropriate restrictions
Secrets Management with Key Vault CSI Driver
The Secrets Store CSI Driver mounts Key Vault secrets directly into pods, eliminating Kubernetes Secret objects that persist in etcd. This approach satisfies auditors who require secrets never exist unencrypted at rest in the cluster—a common stumbling block for PCI-DSS and HIPAA compliance.
apiVersion: secrets-store.csi.x-k8s.io/v1kind: SecretProviderClassmetadata: name: azure-keyvault-payments namespace: paymentsspec: provider: azure parameters: usePodIdentity: "false" useVMUserAssignedIdentity: "false" clientID: "a1b2c3d4-5678-90ab-cdef-1234567890ab" keyvaultName: "prod-payments-kv" tenantId: "98765432-10fe-dcba-0987-654321fedcba" objects: | array: - | objectName: database-connection-string objectType: secret - | objectName: stripe-api-key objectType: secretMount the secrets as files rather than environment variables—file mounts support rotation without pod restarts, and they prevent secrets from appearing in process listings or crash dumps. Configure the rotationPollInterval to check for updated secrets every two minutes, balancing freshness against Key Vault API rate limits.
Security hardening protects your cluster, but protection means nothing if you cannot detect incidents when they occur. The observability stack transforms security events into actionable alerts.
Observability Stack: Monitoring That Actually Helps at 2 AM
The difference between a good observability stack and a great one becomes painfully clear during an incident. When your cluster is degrading at 2 AM, you need dashboards that surface the problem in seconds, not force you to hunt through a dozen views hoping something looks wrong. The goal isn’t more data—it’s faster answers.
Azure Monitor vs Prometheus: Strategic Coexistence
Azure Monitor with Container Insights provides immediate value: node health, pod scheduling issues, and integration with Azure’s alerting infrastructure. It requires zero configuration beyond enabling the add-on, and it integrates natively with Azure’s broader ecosystem for incident management and runbook automation. Prometheus excels at custom application metrics and high-cardinality queries that would be cost-prohibitive in Azure Monitor.
The production pattern that works: use Azure Monitor for infrastructure-level visibility and Prometheus for application-specific metrics. This isn’t about choosing sides—it’s about leveraging each tool’s strengths. Azure Monitor handles the “is my cluster healthy” questions while Prometheus answers “is my application behaving correctly” with the granularity your developers need.
apiVersion: v1kind: ConfigMapmetadata: name: prometheus-server-conf namespace: monitoringdata: prometheus.yml: | global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+)Alerts That Matter: Signal Over Noise
Most teams start with too many alerts and slowly disable them as fatigue sets in. Start with the alerts that indicate actual user impact or imminent failure. Every alert should have a clear owner and a documented response procedure—if you can’t articulate what action to take when an alert fires, it shouldn’t exist.
apiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: cluster-critical-alerts namespace: monitoringspec: groups: - name: aks-critical rules: - alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) > 0.5 for: 5m labels: severity: critical annotations: summary: "Pod {{ $labels.pod }} is crash looping" - alert: NodeMemoryPressure expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1 for: 5m labels: severity: warning annotations: summary: "Node {{ $labels.node }} memory below 10%" - alert: PersistentVolumeUsageCritical expr: (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) > 0.9 for: 10m labels: severity: critical annotations: summary: "PV {{ $labels.persistentvolumeclaim }} over 90% capacity"Pro Tip: Create a dedicated Slack channel or PagerDuty service for critical alerts only. If everything routes to the same place, nothing gets the attention it deserves.
Log Aggregation Without the Cost Spiral
Azure Log Analytics pricing catches teams off guard when verbose application logging meets pay-per-GB ingestion. A single chatty microservice can generate gigabytes of logs daily, quickly overwhelming budgets designed for modest workloads. Configure data collection rules to filter noise at the source rather than ingesting everything and querying selectively.
Focus container log collection on namespaces running your applications. System namespaces generate substantial log volume that rarely helps during application debugging. Set retention policies aggressively—90 days for application logs, 30 days for system logs. If you need longer retention for compliance, export to Azure Blob Storage at a fraction of the cost. Consider implementing log sampling for high-volume, low-value events like health checks and routine status messages.
Distributed Tracing Integration
Application Insights auto-instrumentation captures traces across your services without code changes. Deploy the OpenTelemetry collector as a DaemonSet to standardize telemetry collection and maintain flexibility in your backend choice. This approach decouples your instrumentation from any specific vendor, allowing you to switch backends or add additional destinations without modifying application code.
The correlation between traces, logs, and metrics is where real debugging power emerges. When an alert fires, you should be able to pivot from the metric that triggered it to the trace that shows the request path to the logs that reveal the root cause—all within the same investigation flow. This three-pillar correlation transforms debugging from an art into a systematic process.
With observability foundations in place, the next challenge is ensuring your production cluster doesn’t generate unexpected costs that undermine the business case for Kubernetes adoption.
Cost Optimization: Running Production Without the Surprise Bills
Azure Kubernetes Service pricing surprises stem from three sources: compute sprawl, inefficient autoscaling, and ignored commitment discounts. Address all three systematically, and you’ll cut costs by 40-60% without sacrificing reliability.
Spot Node Pools: Strategic Placement Matters
Spot instances offer up to 90% savings, but Azure can reclaim them with 30 seconds notice. The key is workload matching—not blanket adoption.
apiVersion: v1kind: NodePoolmetadata: name: spot-workersspec: scaleSetPriority: Spot spotMaxPrice: -1 # Pay up to on-demand price scaleSetEvictionPolicy: Delete nodeLabels: kubernetes.azure.com/scalesetpriority: spot nodeTaints: - key: kubernetes.azure.com/scalesetpriority value: spot effect: NoScheduleDeploy stateless, fault-tolerant workloads on spot: batch processing, dev/test environments, and horizontally-scaled API replicas behind proper pod disruption budgets. Never run databases, stateful services, or singleton controllers on spot nodes—the eviction risk isn’t worth the savings.
One pattern that works well: run your baseline capacity on regular nodes with reservations, then burst into spot pools during peak demand. This gives you predictable costs for steady-state workloads while capturing spot savings for elastic overflow. Configure your deployments with node affinity preferences (not requirements) so pods land on spot when available but gracefully fall back to on-demand capacity during spot shortages.
Autoscaler Tuning for Cost Efficiency
Default autoscaler settings optimize for responsiveness, not cost. Production clusters benefit from conservative tuning:
apiVersion: v1kind: ConfigMapmetadata: name: cluster-autoscaler-config namespace: kube-systemdata: scale-down-delay-after-add: "10m" scale-down-unneeded-time: "10m" scale-down-utilization-threshold: "0.5" skip-nodes-with-local-storage: "false" balance-similar-node-groups: "true" expander: "least-waste"The least-waste expander selects node types that minimize unused capacity. Combined with a 50% utilization threshold, you avoid the oscillation pattern where clusters repeatedly scale up and down during variable load. The 10-minute delays prevent thrashing during brief load fluctuations—without them, you’ll see nodes spinning up and down every few minutes during variable traffic, each cycle incurring startup costs and potential pod rescheduling disruption.
Reserved Instances and Savings Plans
For baseline capacity that runs 24/7, Azure Reservations deliver 30-72% savings depending on commitment term:
- 1-year reservations: 30-40% discount, lower commitment risk
- 3-year reservations: 60-72% discount, requires accurate capacity forecasting
Calculate your baseline by analyzing 90 days of node utilization. Reserve capacity for the 40th percentile of usage, cover peaks with on-demand or spot. Azure Savings Plans offer more flexibility than traditional reservations—they apply automatically across VM families and regions, making them ideal for clusters where you might change node SKUs during optimization cycles. Start with savings plans for general compute, then layer in specific reservations once your node pool configurations stabilize.
Resource Quotas That Prevent Runaway Costs
Without limits, a single misconfigured deployment can consume your entire cluster:
apiVersion: v1kind: ResourceQuotametadata: name: compute-quota namespace: productionspec: hard: requests.cpu: "100" requests.memory: 200Gi limits.cpu: "200" limits.memory: 400Gi persistentvolumeclaims: "20" services.loadbalancers: "5"Pro Tip: Set LimitRanges alongside quotas to enforce per-pod maximums. A namespace quota of 200Gi memory means nothing if one pod can request 180Gi. LimitRanges let you cap individual pods at reasonable sizes (say, 8Gi) and inject default requests for pods that omit them—preventing both resource hogging and scheduler inefficiency from pods without resource declarations.
Pair quotas with Azure Cost Management alerts at 50%, 75%, and 90% thresholds. By the time you hit 75%, you should already be investigating. Consider implementing showback or chargeback by tagging node pools and using Azure’s cost allocation features to attribute spending to specific teams or applications.
Cost optimization is ongoing, not one-time. Which brings us to the operational patterns that keep clusters healthy through upgrades, scaling events, and the inevitable 2 AM incidents.
Day Two Operations: Upgrades, Scaling, and Incident Response
Launching your AKS cluster is the easy part. Keeping it running through version upgrades, traffic spikes, and 2 AM incidents separates production-grade operations from expensive learning experiences.
AKS Version Upgrade Strategies
AKS releases new Kubernetes versions monthly, and you have roughly 12 months before older versions lose support. Two upgrade patterns dominate production environments:
In-place upgrades work well for non-critical workloads. AKS cordons nodes, drains pods, and upgrades sequentially. The risk: a failed upgrade leaves your cluster in a partially upgraded state. Recovery often requires manual intervention, and debugging a half-upgraded cluster under production pressure tests even experienced teams.
Blue-green cluster upgrades eliminate this risk entirely. Spin up a new cluster with the target version, migrate workloads, validate, then decommission the old cluster. More infrastructure cost during transition, but zero downtime and instant rollback capability. This approach also provides an opportunity to test your disaster recovery procedures and validate that your infrastructure-as-code actually produces identical environments.
For in-place upgrades, always upgrade the control plane first, then node pools one at a time:
## Upgrade control planeaz aks upgrade --resource-group rg-prod-aks --name aks-prod-westus2 \ --control-plane-only --kubernetes-version 1.29.2
## Upgrade system node poolaz aks nodepool upgrade --resource-group rg-prod-aks --cluster-name aks-prod-westus2 \ --name systempool --kubernetes-version 1.29.2
## Upgrade workload node pools sequentiallyaz aks nodepool upgrade --resource-group rg-prod-aks --cluster-name aks-prod-westus2 \ --name workloadpool --kubernetes-version 1.29.2Before any upgrade, verify your workloads tolerate node disruptions by running chaos engineering experiments in staging. An upgrade that succeeds technically but causes cascading application failures still counts as an outage.
Pod Disruption Budgets That Actually Protect
Without PDBs, node drains during upgrades or scaling events terminate pods indiscriminately. Define minimum availability for every production workload:
apiVersion: policy/v1kind: PodDisruptionBudgetmetadata: name: api-gateway-pdb namespace: productionspec: minAvailable: 2 selector: matchLabels: app: api-gatewayPro Tip: Use
minAvailablewith absolute numbers rather than percentages. During a scale-down event, percentage-based PDBs can permit disruptions you didn’t anticipate.
PDBs also serve as documentation of your availability requirements. When a new engineer reviews your manifests, explicit minimum replica counts communicate service criticality better than tribal knowledge ever could.
GitOps with ArgoCD for Consistent Deployments
Manual kubectl apply commands don’t scale and create drift between environments. ArgoCD watches your Git repository and reconciles cluster state automatically:
apiVersion: argoproj.io/v1alpha1kind: Applicationmetadata: name: payment-service namespace: argocdspec: project: production source: repoURL: https://github.com/acme-corp/k8s-manifests.git targetRevision: main path: apps/payment-service/overlays/prod destination: server: https://kubernetes.default.svc namespace: production syncPolicy: automated: prune: true selfHeal: true syncOptions: - CreateNamespace=trueThe selfHeal: true setting automatically reverts manual changes made directly to the cluster, preventing configuration drift that accumulates during incident response and never gets cleaned up.
Incident Response Runbooks
Document these scenarios before you need them:
- Node NotReady: Check kubelet logs, verify VM health in Azure portal, cordon and replace if unrecoverable
- Pod CrashLoopBackOff: Examine events with
kubectl describe, check resource limits, review recent deployments - API server unavailable: Verify network connectivity, check Azure Service Health, review firewall rules
- Persistent volume stuck: Identify blocking pods, force detach through Azure CLI if necessary
Store runbooks alongside your infrastructure code in the same repository. When incidents occur, engineers need answers in seconds, not minutes spent searching wikis or asking colleagues who might be asleep. Include specific commands, expected outputs, and escalation paths for each scenario.
With operational patterns established, your AKS cluster handles the inevitable chaos of production environments. These practices transform reactive firefighting into predictable, manageable operations.
Key Takeaways
- Start with private clusters and Azure CNI Overlay to avoid painful networking migrations later—the initial complexity pays dividends in security and scalability
- Implement Workload Identity and network policies before your first production deployment, not after your first security audit
- Configure cluster autoscaler with scale-down delays of at least 10 minutes and use PodDisruptionBudgets on all stateful workloads to prevent cascading failures
- Combine Azure Monitor Container Insights for baseline metrics with Prometheus for custom application metrics—pick one as your alerting source to avoid confusion
- Use spot instances only for fault-tolerant batch workloads and set explicit resource requests/limits on every pod to enable accurate autoscaling and cost attribution