Building Production-Ready Microservices on Azure Kubernetes Service: A Practical Engineering Guide
Your Kubernetes cluster works perfectly in dev. The pods spin up, services discover each other, and your deployment pipeline runs green. Then you push to production, and suddenly you’re drowning in networking issues, mysterious pod evictions, and an Azure bill that makes your CFO schedule an “urgent sync.” Sound familiar?
After migrating three enterprise platforms to AKS—including a financial services system handling 50,000 transactions per minute—I’ve learned that the gap between kubectl apply and production-ready infrastructure is where most teams stumble. And here’s the uncomfortable truth: the failures rarely stem from Kubernetes itself. They come from misunderstanding how Azure’s managed Kubernetes implementation behaves under real production load.
The tutorials get you running. They don’t prepare you for the 3 AM page when your ingress controller starts dropping connections because you sized your node pools based on CPU metrics alone. They don’t mention that Azure’s CNI networking has fundamentally different IP exhaustion patterns than the overlay networks you tested in minikube. They certainly don’t cover the moment when autoscaling triggers a cascade of pod evictions because you forgot that AKS system pods have their own resource reservations.
This guide covers what comes after the tutorial—the production patterns, failure modes, and operational practices that separate a working cluster from a production-grade platform. We’re focusing on the decisions that matter when actual users depend on your infrastructure and actual money flows through your services.
Let’s start with why your AKS deployment is probably already misconfigured, and you don’t know it yet.
Why AKS Fails in Production (And It’s Not Kubernetes’ Fault)
You’ve deployed applications to Kubernetes before. You understand pods, services, and deployments. You’ve even run production workloads on self-managed clusters or other cloud providers. Then you spin up AKS, deploy your workloads, and within weeks you’re debugging mysterious node failures at 3 AM.

The problem isn’t Kubernetes. It’s the gap between AKS-as-a-managed-service and AKS-as-a-production-platform.
The Three Failure Modes Nobody Warns You About
Node pool misconfiguration causes the majority of early production incidents. Teams deploy a single system node pool, mix workload types indiscriminately, and wonder why their API servers compete for resources with background batch jobs. The default Standard_DS2_v2 nodes work fine for demos. In production, you’ll exhaust memory before CPU, leaving nodes in a degraded state that Kubernetes’ scheduler can’t reason about correctly.
Networking bottlenecks surface differently on Azure than on other platforms. The kubenet CNI seems simpler—until you hit the 400-pod-per-cluster soft limit or discover that your pods can’t communicate with Azure PaaS services through private endpoints. Azure CNI solves these problems but requires careful IP address planning. Underestimate your CIDR allocation and you’ll face a cluster migration instead of a simple scale-up.
Resource quota exhaustion strikes without warning. Azure subscriptions come with regional vCPU limits that have nothing to do with your payment method. Your production cluster scales from 10 to 50 nodes during peak traffic—except when it doesn’t, because you hit the 20-vCPU default quota for your chosen VM family. The workloads queue, the pods stay pending, and your customers experience the outage.
The Mental Model Shift
Managed Kubernetes abstracts the control plane. It doesn’t abstract Azure.
Every AKS decision carries Azure-specific implications: subnet sizing affects pod density, availability zone placement determines failure domains, and managed identity configuration gates access to every dependent service. Engineers who treat AKS as generic Kubernetes miss these constraints until production exposes them.
💡 Pro Tip: Before your first production deployment, request quota increases for your target VM families across all regions you plan to use. This single action prevents the most common scaling failures.
The path forward requires intentional architecture decisions made before workloads go live. Starting with cluster topology—how you structure node pools, distribute across zones, and incorporate spot instances—determines whether your platform scales gracefully or fails spectacularly.
Cluster Architecture That Scales: Node Pools, Availability Zones, and Spot Instances
The default AKS cluster configuration creates a single node pool in a single availability zone. This works fine for development, but production traffic exposes its limitations within the first major incident. Strategic node pool design separates concerns, enables cost optimization, and prevents cascading failures. The architecture decisions you make here ripple through every deployment, affecting both operational stability and monthly Azure bills.

System vs User Node Pools
AKS runs critical system components—CoreDNS, metrics-server, kube-proxy—alongside your application workloads by default. When your application experiences memory pressure or runaway CPU usage, these system components compete for the same resources. A misbehaving deployment can starve DNS resolution, causing cascading failures across services that appear completely unrelated. The fix is architectural: dedicated system node pools.
## Create a system node pool with taintsaz aks nodepool add \ --resource-group rg-production \ --cluster-name aks-production-eastus \ --name systempool \ --node-count 3 \ --node-vm-size Standard_D4s_v3 \ --mode System \ --zones 1 2 3 \ --node-taints CriticalAddonsOnly=true:NoSchedule
## Create user node pool for application workloadsaz aks nodepool add \ --resource-group rg-production \ --cluster-name aks-production-eastus \ --name workloadpool \ --node-count 5 \ --node-vm-size Standard_D8s_v3 \ --mode User \ --zones 1 2 3 \ --enable-cluster-autoscaler \ --min-count 3 \ --max-count 20The CriticalAddonsOnly taint ensures only system components schedule onto system nodes. Your application pods need explicit tolerations to land there—which they shouldn’t have. This separation guarantees that even during application-level resource exhaustion, cluster-critical services continue operating normally. Production teams commonly allocate Standard_D4s_v3 instances for system pools since these components have predictable, modest resource requirements.
Multi-AZ Deployment Patterns
Spreading nodes across availability zones provides fault tolerance against datacenter-level failures. Azure charges for cross-AZ data transfer, and this cost surprises teams running chatty microservices. A service making 10,000 requests per second to another service in a different zone accumulates meaningful egress charges—potentially hundreds of dollars monthly for high-throughput applications.
The mitigation strategy involves topology-aware routing. Configure your services to prefer same-zone communication:
## Enable topology-aware hints on your clusterkubectl annotate service api-gateway \ service.kubernetes.io/topology-mode=AutoThis tells Kubernetes to route traffic to endpoints in the same zone when capacity allows, falling back to cross-zone only when necessary. Monitor your cross-zone traffic using Azure Network Watcher to validate the configuration reduces egress as expected. Teams running latency-sensitive workloads see dual benefits: reduced costs and improved response times from eliminating cross-datacenter network hops.
Spot Instances for Batch Workloads
Spot instances cost 60-90% less than regular VMs but Azure can evict them with 30 seconds notice. Batch processing, CI/CD runners, and stateless data pipelines thrive on spot capacity. Production API servers do not. The key is matching workload characteristics to instance type—jobs that checkpoint progress and tolerate restarts belong on spot nodes.
az aks nodepool add \ --resource-group rg-production \ --cluster-name aks-production-eastus \ --name spotpool \ --priority Spot \ --eviction-policy Delete \ --spot-max-price -1 \ --node-vm-size Standard_D8s_v3 \ --enable-cluster-autoscaler \ --min-count 0 \ --max-count 50 \ --node-taints kubernetes.azure.com/scalesetpriority=spot:NoScheduleSetting spot-max-price to -1 means you’ll pay up to the on-demand price, maximizing availability while still benefiting from spot pricing when available. The taint prevents non-spot-tolerant workloads from accidentally scheduling onto evictable nodes. Always pair spot node pools with pod disruption budgets on your workloads to ensure graceful handling when Azure reclaims capacity.
Autoscaler Configuration That Works
The default autoscaler settings assume gentle, predictable load patterns. Real traffic spikes faster than default thresholds detect. Flash sales, viral content, and coordinated batch jobs can overwhelm a cluster before the autoscaler even registers the need for more capacity. Adjust the scan interval and scale-down delay:
az aks update \ --resource-group rg-production \ --cluster-name aks-production-eastus \ --cluster-autoscaler-profile \ scan-interval=10s \ scale-down-delay-after-add=5m \ scale-down-unneeded-time=5m \ max-graceful-termination-sec=600💡 Pro Tip: Set
scale-down-delay-after-addto at least 5 minutes. Shorter values cause thrashing during traffic fluctuations, where the cluster scales up, immediately considers the new nodes underutilized, and scales down before pods fully distribute.
The combination of dedicated node pools, zone-aware routing, and properly-tuned autoscaling creates a cluster that handles traffic spikes without manual intervention. Test your autoscaler configuration under simulated load before relying on it in production—the difference between theory and practice often reveals configuration gaps. But cluster topology is only half the equation—network architecture determines whether your services can actually communicate reliably under load.
Networking Deep Dive: CNI Choices, Ingress, and Service Mesh Decisions
Networking decisions in AKS have long-term consequences. The wrong CNI choice leads to IP exhaustion during scaling events. A misconfigured ingress controller creates latency under load. Understanding these tradeoffs upfront saves weeks of remediation later.
Azure CNI vs Kubenet: Making the Right Call
The default recommendation is Azure CNI, but defaults serve average use cases—not yours. Here’s when to deviate.
Choose Kubenet when:
- Your VNet has limited IP space and you can’t expand it
- You’re running fewer than 400 nodes per cluster
- Pod-to-pod traffic stays within the cluster
Choose Azure CNI when:
- Pods need direct VNet IP addresses for compliance or legacy integration
- You require Windows node pools
- Network policies need Azure-native enforcement
The critical distinction: Kubenet uses NAT for pod networking, assigning IPs from a separate CIDR range. Azure CNI assigns VNet IPs directly to pods. A cluster with 50 nodes and 30 pods per node needs 1,500 IP addresses with Azure CNI—often more than teams anticipate.
💡 Pro Tip: Calculate your IP requirements for 3x your expected scale before choosing Azure CNI. VNet expansion during production incidents is painful.
Configuring NGINX Ingress Controller Properly
The NGINX Ingress Controller works well with Azure Load Balancer, but the default Helm installation creates a public-facing Standard Load Balancer. For production workloads, you need explicit control.
controller: replicaCount: 3 service: annotations: service.beta.kubernetes.io/azure-load-balancer-internal: "true" service.beta.kubernetes.io/azure-load-balancer-internal-subnet: "ingress-subnet" loadBalancerIP: 10.240.0.100 resources: requests: cpu: 100m memory: 128Mi limits: cpu: 500m memory: 512Mi affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app.kubernetes.io/name: ingress-nginx topologyKey: kubernetes.io/hostnameThe podAntiAffinity rule ensures ingress controller replicas spread across nodes. Without this, a single node failure takes down your entire ingress layer.
Private Clusters and VNet Integration
Private clusters disable the public API server endpoint, routing all control plane traffic through a private endpoint in your VNet. This architecture requires:
- A jump box or VPN connection for kubectl access
- Private DNS zones linked to your VNet
- Firewall rules allowing Azure dependencies
apiVersion: containerservice.azure.com/v1apikind: ManagedClustermetadata: name: prod-cluster-eastusspec: apiServerAccessProfile: enablePrivateCluster: true privateDNSZone: /subscriptions/a1b2c3d4-e5f6-7890-abcd-ef1234567890/resourceGroups/dns-rg/providers/Microsoft.Network/privateDnsZones/privatelink.eastus.azmk8s.io networkProfile: networkPlugin: azure serviceCidr: 10.0.0.0/16 dnsServiceIP: 10.0.0.10The private DNS zone integration ensures cluster nodes resolve the API server correctly. Skipping this step creates intermittent connectivity failures that are difficult to diagnose.
When You Actually Need a Service Mesh
Service meshes add operational complexity. Before deploying Istio or Linkerd, verify you need these capabilities:
You need a service mesh when:
- mTLS between services is a compliance requirement
- You need traffic splitting for canary deployments beyond what Ingress provides
- Distributed tracing requires automatic context propagation
You don’t need a service mesh when:
- You have fewer than 20 services
- Network policies provide sufficient traffic control
- Your services already handle retries and timeouts in application code
The sidecar injection pattern adds 50-100MB of memory per pod and introduces latency for every request. For clusters running hundreds of pods, this overhead compounds quickly.
Most teams benefit from starting with Kubernetes-native network policies and NGINX Ingress, adding a service mesh only when specific requirements emerge. Premature adoption creates operational burden without corresponding value.
With networking foundations established, the next critical layer is security hardening—where Azure’s defaults leave significant gaps that attackers actively exploit.
Security Hardening: Beyond the Defaults
AKS ships with reasonable security defaults, but “reasonable” won’t satisfy your security team or pass a SOC 2 audit. The gap between a working cluster and a hardened production environment requires deliberate configuration across identity, network, and secrets management layers. Each layer addresses a specific attack vector, and together they create defense in depth that auditors actually respect.
Workload Identity: The Right Way to Authenticate Pods
Azure AD Workload Identity replaces the deprecated pod-managed identity system with a standards-based approach using OpenID Connect. Each pod gets a Kubernetes service account linked to an Azure managed identity, eliminating the need for stored credentials entirely. The federation happens at the Azure AD level, meaning no secrets exist in your cluster that could be extracted through etcd access or memory dumps.
apiVersion: v1kind: ServiceAccountmetadata: name: order-service namespace: production annotations: azure.workload.identity/client-id: "a1b2c3d4-e5f6-7890-abcd-ef1234567890"---apiVersion: apps/v1kind: Deploymentmetadata: name: order-service namespace: productionspec: template: metadata: labels: azure.workload.identity/use: "true" spec: serviceAccountName: order-service containers: - name: order-service image: myregistry.azurecr.io/order-service:v2.1.0The Azure SDK automatically picks up the federated token. Your application code needs zero changes—just ensure you’re using a current SDK version that supports workload identity. Token refresh happens transparently, and you get full Azure AD audit logs for every authentication event.
Network Policies That Won’t Break Everything
Default Kubernetes networking allows all pod-to-pod traffic. This fails any zero-trust audit immediately. An attacker who compromises a single pod gains unrestricted network access to every service in the cluster. Start with a default-deny policy, then explicitly allow required paths.
apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: default-deny-all namespace: productionspec: podSelector: {} policyTypes: - Ingress - Egress---apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: allow-order-to-inventory namespace: productionspec: podSelector: matchLabels: app: inventory-service ingress: - from: - podSelector: matchLabels: app: order-service ports: - protocol: TCP port: 8080Building these policies incrementally prevents the classic “everything broke when we enabled network policies” scenario. Map your service dependencies first, create allow rules for each legitimate traffic flow, then enable the default-deny. AKS supports both Azure NPM and Calico for policy enforcement—Calico offers more granular controls if you need them.
💡 Pro Tip: Deploy network policies in a staging environment first with logging enabled. Use
kubectl logs -n kube-system -l k8s-app=azure-npmto trace dropped packets before they become production incidents.
Secrets Management with Key Vault CSI Driver
Storing secrets in Kubernetes etcd—even encrypted at rest—creates operational headaches around rotation and access auditing. The Azure Key Vault CSI driver mounts secrets directly from Key Vault into your pods, keeping sensitive data out of Kubernetes entirely.
apiVersion: secrets-store.csi.x-k8s.io/v1kind: SecretProviderClassmetadata: name: order-service-secrets namespace: productionspec: provider: azure parameters: usePodIdentity: "false" useVMManagedIdentity: "false" clientID: "a1b2c3d4-e5f6-7890-abcd-ef1234567890" keyvaultName: "prod-aks-keyvault" objects: | array: - | objectName: database-connection-string objectType: secret - | objectName: api-signing-key objectType: secret tenantId: "f8e7d6c5-b4a3-2190-fedc-ba9876543210"Reference this in your deployment’s volume configuration, and secrets appear as files in the mounted path. Rotation happens automatically when Key Vault values update—no pod restart required if you’re reading from the filesystem rather than environment variables. You also inherit Key Vault’s access policies, soft-delete protection, and comprehensive audit logging.
Pod Security Standards
Kubernetes deprecated PodSecurityPolicies in favor of Pod Security Standards enforced through admission controllers. Configure namespace-level enforcement to prevent containers from acquiring dangerous capabilities:
kubectl label namespace production \ pod-security.kubernetes.io/enforce=restricted \ pod-security.kubernetes.io/warn=restricted \ pod-security.kubernetes.io/audit=restrictedThe restricted profile blocks privilege escalation, host namespaces, and privileged containers. Most production workloads run fine under these constraints—if yours doesn’t, audit why before relaxing the policy. Common exceptions include monitoring agents that need host network access; handle these by isolating them in dedicated namespaces with appropriate policy levels.
These security layers compound. Workload identity removes credential theft vectors, network policies contain lateral movement, Key Vault centralizes secret governance, and pod security standards prevent container escapes. Together, they transform your cluster from “technically working” to “audit-ready.”
With security controls in place, you need visibility into what’s actually happening. That brings us to building an observability stack that surfaces real problems instead of drowning you in metrics.
Observability Stack: Metrics, Logs, and Traces That Actually Help
Production AKS clusters generate an overwhelming volume of telemetry data. The challenge isn’t collecting metrics—it’s building an observability stack that surfaces actionable insights without drowning your team in noise. A poorly configured observability layer creates its own problems: alert fatigue, spiraling costs, and dashboards that nobody trusts. Getting this right requires deliberate choices about what to measure, how long to retain it, and when to wake someone up.
Azure Monitor vs Prometheus/Grafana: Choose Both
Azure Monitor Container Insights provides immediate value with minimal setup: node performance, pod health, and container metrics flow automatically to your Log Analytics workspace. The integration with Azure’s alerting infrastructure means you can have meaningful notifications within hours of cluster creation. But senior engineers quickly hit its limitations—custom application metrics, cardinality constraints, and PromQL’s superior querying make Prometheus essential for serious workloads.
The pragmatic approach runs both systems in a complementary configuration. Use Azure Monitor for infrastructure-level visibility and alerting on Azure-native resources like node pools, load balancers, and managed disks. Deploy Prometheus for application metrics, custom dashboards, and workload-specific SLOs where you need fine-grained control over recording rules and alert expressions.
apiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata: name: app-metrics namespace: monitoringspec: selector: matchLabels: app.kubernetes.io/monitored: "true" endpoints: - port: metrics interval: 30s path: /metrics namespaceSelector: matchNames: - production - staging---apiVersion: v1kind: ConfigMapmetadata: name: prometheus-remote-write namespace: monitoringdata: remote-write.yaml: | remote_write: - url: https://my-workspace.eus.prometheus.monitor.azure.com/dataCollectionRules/dcr-abc123/streams/Microsoft-PrometheusMetrics/api/v1/write?api-version=2023-04-24 azure_ad: cloud: AzurePublic managed_identity: client_id: 12345678-abcd-1234-efgh-567890abcdefThis configuration feeds Prometheus metrics into Azure Monitor’s managed Prometheus service, giving you Grafana dashboards backed by Azure’s retention and scaling. The managed identity authentication eliminates credential rotation headaches while maintaining secure access to the ingestion endpoint.
Container Insights Configuration That Prevents Alert Fatigue
Default Container Insights alerts trigger on symptoms, not causes. A pod restarting once at 3 AM isn’t an emergency—five restarts in ten minutes indicates a crash loop requiring attention. The difference between a well-rested on-call engineer and a burned-out one often comes down to alert quality rather than system stability.
Configure alert rules with appropriate thresholds and aggregation windows. The for clause prevents transient spikes from triggering pages, while expression-based grouping ensures related alerts arrive as a single notification rather than a flood:
apiVersion: alerts.monitor.azure.com/v1kind: PrometheusRuleGroupmetadata: name: aks-production-alertsspec: interval: PT1M rules: - alert: PodCrashLooping expr: increase(kube_pod_container_status_restarts_total{namespace="production"}[10m]) > 3 for: 5m labels: severity: critical annotations: summary: "Pod {{ $labels.pod }} crash looping in {{ $labels.namespace }}" - alert: HighMemoryPressure expr: container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.85 for: 15m labels: severity: warningConsider implementing alert routing that respects business hours for warning-level notifications while ensuring critical alerts always reach the on-call rotation immediately.
Distributed Tracing Integration
OpenTelemetry has emerged as the standard for distributed tracing, replacing the fragmented landscape of vendor-specific SDKs. Configure your applications to export traces to Azure Monitor Application Insights while maintaining vendor portability through the collector pattern:
exporters: azuremonitor: connection_string: InstrumentationKey=9a8b7c6d-1234-5678-9abc-def012345678;IngestionEndpoint=https://eastus-0.in.applicationinsights.azure.com/ otlp: endpoint: jaeger-collector.monitoring:4317 tls: insecure: true
service: pipelines: traces: receivers: [otlp] processors: [batch, memory_limiter] exporters: [azuremonitor, otlp]This dual-export pattern sends traces to both Azure Monitor for long-term retention and Jaeger for local development debugging. The memory_limiter processor prevents the collector from consuming excessive resources during traffic spikes.
Cost-Effective Log Retention
Log Analytics costs spiral when you retain everything at hot-tier pricing. Implement tiered retention: 30 days for interactive queries, archive to blob storage for compliance requirements. Understanding your actual query patterns helps identify which log categories need immediate access versus cold storage.
💡 Pro Tip: Use resource-specific tables instead of the legacy ContainerLog table. The ContainerLogV2 schema reduces ingestion costs by 30-50% and enables faster queries through structured JSON parsing.
Configure data collection rules to filter noise at ingestion—dropping verbose health check logs and debug-level traces before they consume your budget. Kubernetes liveness and readiness probe logs alone can account for 20% of total log volume in high-traffic clusters.
With observability foundations in place, the next step is establishing reliable deployment pipelines. GitOps with ArgoCD provides the declarative, auditable deployment model that production workloads demand.
Deployment Strategies and GitOps with ArgoCD
Manual kubectl apply commands have no place in production Kubernetes. GitOps treats your Git repository as the single source of truth, enabling auditable deployments, automated drift detection, and one-click rollbacks. ArgoCD has emerged as the industry standard for Kubernetes GitOps, and its integration with Azure-native services makes it particularly effective on AKS. This section covers practical patterns for structuring Helm charts, configuring enterprise authentication, implementing progressive delivery, and managing the complex web of Azure CRD dependencies.
Helm Charts for AKS Workloads
Structure your Helm charts to handle Azure-specific requirements from the start. Rather than retrofitting Azure integrations later, design your chart templates with workload identity, Key Vault references, and ingress annotations as first-class concerns. Create environment-specific value files that configure Azure resources appropriately:
replicaCount: 3
image: repository: myacr.azurecr.io/api-service tag: "1.4.2"
azure: workloadIdentity: enabled: true clientId: "a1b2c3d4-e5f6-7890-abcd-ef1234567890" keyVault: name: "prod-secrets-kv" tenantId: "98765432-abcd-1234-efgh-567890abcdef"
ingress: className: azure-application-gateway annotations: appgw.ingress.kubernetes.io/ssl-redirect: "true" appgw.ingress.kubernetes.io/backend-protocol: "https"
resources: requests: cpu: 500m memory: 512Mi limits: cpu: 2000m memory: 2Gi
nodeSelector: agentpool: applicationThis separation keeps sensitive production values out of your base chart while enabling consistent templating across environments. Your staging environment might reference a different ACR, use smaller resource limits, and target a separate Key Vault—all without modifying the underlying templates.
ArgoCD with Azure AD Authentication
Deploy ArgoCD using its Helm chart with Azure AD OIDC configured for enterprise authentication. This integration allows your teams to authenticate using existing corporate credentials while maintaining granular RBAC policies based on Azure AD group memberships:
configs: cm: url: https://argocd.mycompany.com oidc.config: | name: AzureAD issuer: https://login.microsoftonline.com/98765432-abcd-1234-efgh-567890abcdef/v2.0 clientID: b2c3d4e5-f6a7-8901-bcde-f23456789012 clientSecret: $oidc.azure.clientSecret requestedScopes: - openid - profile - email rbac: policy.csv: | g, [email protected], role:admin g, [email protected], role:readonly
server: ingress: enabled: true ingressClassName: azure-application-gateway hosts: - argocd.mycompany.com tls: - secretName: argocd-tls hosts: - argocd.mycompany.com💡 Pro Tip: Store the Azure AD client secret in Azure Key Vault and sync it to Kubernetes using the CSI driver. ArgoCD reads secrets from Kubernetes, so this approach maintains security without exposing credentials in Git.
Progressive Delivery with Argo Rollouts
Argo Rollouts extends Kubernetes deployments with blue-green and canary strategies, providing fine-grained control over how new versions reach production traffic. For AKS workloads behind Application Gateway, configure traffic splitting using the Gateway API provider. This approach leverages native Kubernetes networking primitives rather than service mesh sidecars:
apiVersion: argoproj.io/v1alpha1kind: Rolloutmetadata: name: api-servicespec: replicas: 5 strategy: canary: steps: - setWeight: 10 - pause: {duration: 5m} - setWeight: 30 - pause: {duration: 5m} - setWeight: 60 - pause: {duration: 10m} analysis: templates: - templateName: success-rate startingStep: 2 trafficRouting: plugins: argoproj-labs/gatewayAPI: httpRoute: api-service-route namespace: production selector: matchLabels: app: api-service template: metadata: labels: app: api-service spec: containers: - name: api image: myacr.azurecr.io/api-service:1.4.2The analysis template queries Azure Monitor metrics to automatically roll back failed deployments. By integrating with Prometheus (or Azure Monitor’s Prometheus-compatible endpoint), you create a feedback loop that catches regressions before they impact all users:
apiVersion: argoproj.io/v1alpha1kind: AnalysisTemplatemetadata: name: success-ratespec: metrics: - name: success-rate interval: 60s successCondition: result[0] >= 0.95 failureLimit: 3 provider: prometheus: address: http://prometheus-server.monitoring:9090 query: | sum(rate(http_requests_total{status=~"2..",app="api-service"}[5m])) / sum(rate(http_requests_total{app="api-service"}[5m]))Managing Azure CRD Dependencies
Azure workload identity, Key Vault CSI driver, and AGIC install custom resource definitions that your applications depend on. Use ArgoCD sync waves to ensure infrastructure components deploy before applications:
metadata: annotations: argocd.argoproj.io/sync-wave: "2"Assign wave 0 to CRD installations, wave 1 to cluster-wide configurations like namespaces and SecretProviderClasses, and wave 2+ to application workloads. This ordering prevents race conditions where pods reference SecretProviderClass resources that don’t exist yet. Without proper wave sequencing, you’ll encounter intermittent deployment failures that pass on retry—a frustrating pattern that erodes confidence in your automation.
With GitOps delivering consistent, auditable deployments, the final piece of the production puzzle is ensuring you’re not overpaying for capacity you don’t need.
Cost Optimization: Running Lean Without Breaking Things
Azure bills add up faster than most teams anticipate. A single misconfigured deployment can double your monthly spend overnight. The good news: AKS provides multiple levers for cost control, and pulling them correctly yields 30-50% savings without sacrificing reliability.
Right-Sizing Resource Requests and Limits
The biggest cost leak in most clusters comes from poorly calibrated resource requests. Teams either over-provision out of caution (paying for idle capacity) or under-provision and trigger OOM kills and throttling.
Start by analyzing actual resource consumption over a two-week period using Azure Monitor’s container insights. Look for pods where requests exceed the 95th percentile of actual usage by more than 50%—these are candidates for trimming. Conversely, any pod hitting its limits regularly needs headroom.
A practical baseline: set requests at the 90th percentile of observed usage and limits at 150% of requests. This provides burst capacity while keeping your scheduler honest about actual cluster needs.
💡 Pro Tip: Enable the Vertical Pod Autoscaler in recommendation mode first. It analyzes workload patterns and suggests optimal values without automatically applying changes—giving you data to make informed decisions.
Reserved Instances and Savings Plans
For baseline workloads that run 24/7, pay-as-you-go pricing is throwing money away. Azure Reserved Instances offer up to 72% discount for one or three-year commitments on specific VM SKUs.
The strategy: identify your minimum steady-state node count across all node pools. Purchase reservations covering 70-80% of this baseline, leaving the remainder on pay-as-you-go for flexibility. Combine this with Azure Savings Plans for compute, which provide more flexibility across VM families and regions.
Autoscaler Tuning for Cost Efficiency
The default cluster autoscaler configuration optimizes for availability, not cost. Adjust these parameters for leaner operations:
Set scale-down-delay-after-add to 5 minutes instead of the default 10—nodes become eligible for removal faster after scale-up events. Configure scale-down-unneeded-time to 5 minutes as well, so underutilized nodes get reclaimed promptly.
For non-production environments, enable scale-to-zero on node pools. Development and staging clusters sitting idle overnight represent pure waste.
Monitoring Cost Anomalies
Implement Azure Cost Management alerts at the resource group level. Set thresholds at 80% and 100% of your expected monthly budget, with notifications going to both email and your team’s Slack channel. Weekly cost review meetings catch drift before it becomes a crisis.
Key Takeaways
- Start with separate system and user node pools from day one—retrofitting this architecture causes significant downtime
- Use Azure CNI with dynamic IP allocation for production workloads; Kubenet’s IP exhaustion issues will eventually bite you at scale
- Implement workload identity instead of pod identity for Azure resource access—it’s the modern approach with better security posture
- Deploy Prometheus alongside Azure Monitor Container Insights; the combination gives you both deep Kubernetes metrics and Azure integration
- Configure cluster autoscaler with conservative scale-down delays (10+ minutes) to prevent thrashing during traffic spikes