From Zero to Production: Deploying Your First AKS Cluster with Real-World Defaults
You’ve spun up a dozen Kubernetes clusters following Microsoft’s quickstart guides, only to rebuild them weeks later when security reviews flag missing configurations. The public API server endpoint that seemed fine during development suddenly becomes a blocker. The default CNI that “just worked” can’t support the network policies your compliance team requires. The cluster that took fifteen minutes to create takes three sprints to harden.
This pattern repeats across organizations because Azure’s quickstart defaults optimize for the wrong thing: time to first deployment. Microsoft wants you to see pods running within minutes, which means skipping the configurations that matter for anything beyond a demo. Public endpoints, basic networking, no monitoring integration, permissive RBAC—these defaults create clusters that work immediately and fail predictably.
The real cost isn’t the rebuild itself. It’s the workloads you’ve already deployed, the CI/CD pipelines pointing at the old cluster, the debugging sessions where “it worked in dev” becomes your team’s mantra. Every tutorial cluster that graduates to staging carries technical debt that compounds with each deployment.
Production-ready means something specific: a cluster configuration that passes security review on first submission, supports your networking requirements without migration, and provides observability from day one. It means making decisions upfront about identity, networking, and monitoring that align with where your infrastructure needs to be in six months, not where the tutorial assumes you are today.
The gap between quickstart and production-ready isn’t complexity—it’s intentionality. Let’s look at exactly where the default configurations fall short and what they assume you’ll figure out later.
Why Most AKS Quickstarts Set You Up for Failure
Run az aks create with the defaults, deploy a sample nginx pod, and celebrate—you’ve got Kubernetes running in Azure. The official tutorials walk you through this in under ten minutes. What they don’t mention is that you’ve just deployed a cluster that would fail any reasonable security audit.

The gap between “it works” and “it’s production-ready” in AKS is substantial, and Microsoft’s quickstart documentation optimizes for the former. This isn’t a criticism—tutorials serve a purpose. But treating a tutorial cluster as a production foundation creates technical debt that compounds with every workload you deploy.
What the Defaults Actually Give You
A vanilla az aks create command provisions a cluster with a publicly accessible API server. Anyone on the internet can attempt to authenticate against your Kubernetes control plane. The cluster uses kubenet networking, which lacks the pod-level network policy enforcement that Azure CNI provides. There’s no Azure Monitor integration, meaning you’re flying blind on container metrics and logs. Managed identity isn’t configured, so you’re dealing with service principal secrets that need rotation. And the default node pool runs a single VM size with no separation between system workloads and your applications.
Each of these defaults makes sense for a learning environment. None of them belong in production.
The Hidden Costs
The real problem isn’t the initial deployment—it’s what happens six months later. That tutorial cluster now runs three microservices, handles actual customer traffic, and nobody remembers why the API server is public. Retrofitting network policies onto a cluster designed without them requires careful planning. Migrating from kubenet to Azure CNI often means reprovisioning the entire cluster. The monitoring gap means your first indication of resource exhaustion is a customer complaint, not an alert.
Teams that start with tutorial defaults spend more time remediating than teams that start with production configurations. The “move fast” approach becomes “move fast, then stop everything to fix the foundation.”
What Production-Ready Actually Means
For AKS, production readiness has specific requirements: private API server endpoints, Azure CNI with network policies, integrated monitoring and logging, managed identities for Azure resource access, separated node pools for system and user workloads, and defined autoscaling boundaries. These aren’t aspirational features—they’re baseline expectations for any cluster handling real traffic.
Before diving into implementation, you need to understand where the responsibility boundaries lie. AKS is a managed service, but “managed” doesn’t mean “fully operated.” Let’s examine exactly what Azure handles and what remains your responsibility.
AKS Architecture: What Azure Manages vs What You Own
Understanding the responsibility boundary in AKS determines whether you’ll spend your time fighting infrastructure or shipping features. Unlike self-managed Kubernetes, AKS abstracts the control plane entirely—but that abstraction comes with trade-offs you need to understand before your first production deployment.

The Control Plane: Microsoft’s Domain
Azure fully manages the Kubernetes control plane: the API server, etcd, scheduler, and controller manager. You never SSH into these components, patch them, or worry about their high availability. Microsoft handles upgrades, security patches, and ensures 99.95% uptime under their SLA.
This is genuinely hands-off. The control plane scales automatically based on cluster size, and you pay nothing extra for it—it’s included in your node costs. However, you sacrifice visibility. You can’t tune etcd performance, adjust API server flags, or access control plane logs directly (though diagnostic settings can forward limited metrics to Azure Monitor).
💡 Pro Tip: The free control plane sounds great until you need custom admission controllers or API server audit policies. AKS supports these through Azure Policy and diagnostic settings, but the configuration paths differ from vanilla Kubernetes documentation.
Node Pools: Shared Responsibility Territory
Your worker nodes run on Azure VMs that you select, pay for, and partially manage. You choose the VM SKU, disk type, and networking configuration. Azure handles the underlying hypervisor, but you own:
- OS patching: Node images need regular updates. AKS provides node image upgrades, but you trigger them.
- Scaling policies: Cluster autoscaler configuration, including min/max nodes and scale-down behavior.
- Workload scheduling: Taints, tolerations, and node affinity rules that determine pod placement.
- Container runtime security: While Azure manages containerd, you configure pod security standards and runtime policies.
Node pools also define your blast radius. A misconfigured system node pool takes down cluster-critical components like CoreDNS. Separating system and user workloads into dedicated pools isn’t optional for production—it’s a requirement.
Networking: The Integration Minefield
AKS networking creates the deepest Azure dependencies. Your cluster requires a virtual network, subnet allocation, and decisions about CNI plugins that affect everything from IP address consumption to network policy enforcement.
The default kubenet plugin simplifies initial setup but limits you to 400 nodes and complicates Azure service integration. Azure CNI assigns pod IPs directly from your subnet, enabling direct communication with other Azure resources but consuming IP addresses rapidly. A /24 subnet that looks generous disappears quickly when every pod needs a routable IP.
Cost Implications
Architectural choices compound financially. Larger VM SKUs reduce node count but increase blast radius. Premium SSDs improve etcd-backed workload performance but triple storage costs. The control plane is free, but every add-on—Azure Policy, Key Vault integration, Defender for Containers—adds either direct costs or compute overhead on your nodes.
With this mental model established, you’re ready to provision a cluster that reflects production requirements from the start.
Provisioning AKS with Production Defaults Using Azure CLI
The default az aks create command produces a cluster that’s technically functional but operationally deficient. It uses kubenet networking, creates a public API server endpoint, and provisions a service principal that requires manual credential rotation. These defaults optimize for getting-started friction, not production operations.
Here’s a complete provisioning script that establishes security baselines from the start:
#!/bin/bashset -euo pipefail
## ConfigurationRESOURCE_GROUP="rg-myapp-prod-eastus"CLUSTER_NAME="aks-myapp-prod"LOCATION="eastus"VNET_NAME="vnet-myapp-prod"SUBNET_NAME="snet-aks-nodes"K8S_VERSION="1.29"
## Create resource groupaz group create \ --name $RESOURCE_GROUP \ --location $LOCATION
## Create VNet with dedicated AKS subnetaz network vnet create \ --resource-group $RESOURCE_GROUP \ --name $VNET_NAME \ --address-prefixes 10.0.0.0/16 \ --subnet-name $SUBNET_NAME \ --subnet-prefixes 10.0.0.0/22
## Get subnet ID for CNI configurationSUBNET_ID=$(az network vnet subnet show \ --resource-group $RESOURCE_GROUP \ --vnet-name $VNET_NAME \ --name $SUBNET_NAME \ --query id -o tsv)
## Provision AKS with production defaultsaz aks create \ --resource-group $RESOURCE_GROUP \ --name $CLUSTER_NAME \ --location $LOCATION \ --kubernetes-version $K8S_VERSION \ --node-count 3 \ --node-vm-size Standard_D4s_v5 \ --network-plugin azure \ --network-policy azure \ --vnet-subnet-id $SUBNET_ID \ --service-cidr 10.1.0.0/16 \ --dns-service-ip 10.1.0.10 \ --enable-private-cluster \ --private-dns-zone system \ --enable-managed-identity \ --enable-aad \ --enable-azure-rbac \ --enable-defender \ --enable-workload-identity \ --enable-oidc-issuer \ --auto-upgrade-channel stable \ --node-os-upgrade-channel NodeImage \ --tier standard \ --zones 1 2 3Why Azure CNI Over Kubenet
The --network-plugin azure flag configures Azure CNI, which assigns VNet IP addresses directly to pods. This eliminates the NAT layer that kubenet requires and enables direct pod-to-pod communication across VNets through peering. Your pods become first-class VNet citizens with routable IPs, which matters when integrating with Azure services that require VNet connectivity—Azure SQL with private endpoints, Azure Storage with service endpoints, or on-premises resources through ExpressRoute.
The /22 subnet prefix provides 1,024 addresses. Azure CNI reserves IPs per node (30 pods by default plus the node itself), so plan accordingly. A three-node cluster with default settings consumes roughly 93 IPs, leaving room for scaling and pod churn.
The --network-policy azure flag enables Azure Network Policy Manager, providing native Kubernetes network policies for microsegmentation. This allows you to define ingress and egress rules at the pod level, restricting traffic between namespaces or specific workloads without deploying third-party solutions like Calico.
Private Cluster Endpoints
The --enable-private-cluster flag removes the public API server endpoint entirely. The Kubernetes API becomes accessible only through a private endpoint within your VNet. This single flag eliminates an entire attack surface—no amount of RBAC misconfiguration exposes your control plane to the internet.
The --private-dns-zone system setting creates an Azure-managed private DNS zone for cluster name resolution. For multi-cluster environments sharing a VNet, consider using a custom private DNS zone instead to maintain consistent naming across clusters.
💡 Pro Tip: Private clusters require network connectivity for management. Set up Azure Bastion or a jump box VM before provisioning, or use
az aks command invokefor emergency access.
Managed Identity Configuration
The combination of --enable-managed-identity, --enable-workload-identity, and --enable-oidc-issuer establishes a credential-free security model. The cluster’s system-assigned managed identity handles Azure resource operations without storing credentials. Workload identity extends this pattern to your applications, allowing pods to authenticate to Azure services using Kubernetes service account tokens federated with Entra ID.
This eliminates service principal credentials that expire after one year and require rotation scripts. Managed identities rotate credentials automatically and never expose secrets to your deployment pipelines. The OIDC issuer publishes discovery documents that Azure can verify, enabling secure token exchange without shared secrets.
Azure RBAC Integration
The --enable-aad and --enable-azure-rbac flags replace Kubernetes-native RBAC with Azure role assignments. Cluster access integrates directly with Entra ID—the same groups controlling your Azure subscriptions now control kubectl access. This consolidates identity management and provides audit logs in Azure Activity Log rather than scattered across cluster logs.
Azure RBAC for Kubernetes supports built-in roles like Azure Kubernetes Service RBAC Reader, Writer, Admin, and Cluster Admin, but you can also create custom role definitions scoped to specific namespaces or resource types.
Security and Reliability Flags
The --enable-defender flag activates Microsoft Defender for Containers, providing runtime threat detection, vulnerability scanning for container images, and security recommendations specific to your cluster configuration. This integrates with Microsoft Defender for Cloud for centralized security management.
The --tier standard flag enables the uptime SLA (99.95% for zone-redundant clusters) and unlocks features like longer cluster support windows. Production workloads should never run on the free tier, which offers no availability guarantees and limited support options.
The --zones 1 2 3 flag distributes nodes across availability zones, ensuring that zone-level failures don’t take down your entire cluster. Combined with the standard tier, this configuration maximizes control plane and worker node resilience.
With the control plane provisioned, the next step is configuring node pools that balance cost, performance, and reliability across your workload types.
Node Pool Strategy: Sizing, Scaling, and Spot Instances
Node pools are where your Kubernetes workloads actually run, and getting them wrong creates problems that ripple through your entire infrastructure. The default AKS setup gives you a single node pool running everything from system components to your applications—a configuration that works for demos but fails in production.
System vs User Node Pools
AKS runs critical system components—CoreDNS, metrics-server, and the Azure CNI plugin—alongside your workloads by default. When your application pods consume excessive resources or trigger node scaling events, these system components get caught in the crossfire.
The fix is straightforward: dedicate a node pool to system workloads and run applications on separate user node pools.
## Create a dedicated system node pool with taintsaz aks nodepool add \ --resource-group rg-aks-prod \ --cluster-name aks-prod-westus2 \ --name system \ --node-count 3 \ --node-vm-size Standard_D2s_v5 \ --mode System \ --node-taints CriticalAddonsOnly=true:NoSchedule \ --zones 1 2 3
## Create a user node pool for application workloadsaz aks nodepool add \ --resource-group rg-aks-prod \ --cluster-name aks-prod-westus2 \ --name workloads \ --node-count 3 \ --node-vm-size Standard_D4s_v5 \ --mode User \ --enable-cluster-autoscaler \ --min-count 3 \ --max-count 10 \ --zones 1 2 3The CriticalAddonsOnly taint ensures only system components with matching tolerations schedule onto system nodes. Your application deployments need no changes—they simply won’t schedule on system nodes.
Choosing VM Sizes
VM selection depends on your workload profile, not abstract best practices. General-purpose D-series VMs (D4s_v5, D8s_v5) handle most web applications and APIs well. Memory-intensive workloads like Redis or Elasticsearch benefit from E-series VMs. Compute-heavy batch processing runs better on F-series.
Start with D4s_v5 (4 vCPU, 16 GB RAM) for user node pools. This size provides enough headroom for typical containerized applications while keeping per-node costs reasonable. Smaller VMs increase scheduling fragmentation; larger VMs create blast radius problems when nodes fail.
Configuring the Cluster Autoscaler
The autoscaler responds to pending pods that can’t be scheduled due to resource constraints. Set bounds that prevent both runaway costs and capacity starvation:
az aks update \ --resource-group rg-aks-prod \ --cluster-name aks-prod-westus2 \ --cluster-autoscaler-profile \ scale-down-delay-after-add=10m \ scale-down-unneeded-time=10m \ max-graceful-termination-sec=600 \ scan-interval=10sThe scale-down-delay-after-add parameter prevents thrashing during deployment rollouts. Without it, the autoscaler might add nodes during a deployment, then immediately try to remove them once pods redistribute.
💡 Pro Tip: Set
max-countbased on your Azure subscription quotas. The autoscaler fails silently when hitting vCPU limits, leaving pods pending indefinitely.
Spot Instances for Cost Optimization
Spot VMs offer 60-90% cost savings but Azure can evict them with 30 seconds notice. Use them for workloads that tolerate interruption: batch jobs, dev/test environments, or stateless workers behind a queue.
az aks nodepool add \ --resource-group rg-aks-prod \ --cluster-name aks-prod-westus2 \ --name spotworkers \ --node-count 0 \ --node-vm-size Standard_D4s_v5 \ --priority Spot \ --eviction-policy Delete \ --spot-max-price -1 \ --enable-cluster-autoscaler \ --min-count 0 \ --max-count 20 \ --node-taints kubernetes.azure.com/scalesetpriority=spot:NoScheduleSetting spot-max-price to -1 accepts the current spot price, maximizing availability. The taint requires workloads to explicitly tolerate spot placement, preventing accidental scheduling of critical services onto evictable nodes.
Handle evictions by setting appropriate terminationGracePeriodSeconds on your pods and implementing graceful shutdown logic. Kubernetes receives the eviction notice and begins draining—your application has seconds, not minutes, to clean up.
With node pools configured for isolation, scaling, and cost efficiency, you need visibility into what’s actually happening inside the cluster. Azure Monitor and Container Insights provide that observability layer.
Integrating Azure Monitor and Container Insights
Deploying a cluster without observability is deploying blind. Container Insights provides deep visibility into your AKS workloads—CPU, memory, container restarts, pod scheduling failures—all flowing into Log Analytics where you can query, alert, and diagnose. Enable it at cluster creation, not as an afterthought. The cost of retrofitting monitoring onto a production cluster under pressure far exceeds the minimal effort of configuring it upfront.
Enabling Container Insights at Cluster Creation
The cleanest approach creates a Log Analytics workspace and enables monitoring in a single provisioning command:
## Create a dedicated Log Analytics workspaceaz monitor log-analytics workspace create \ --resource-group prod-aks-rg \ --workspace-name prod-aks-logs \ --location eastus \ --retention-in-days 30
## Get the workspace resource IDWORKSPACE_ID=$(az monitor log-analytics workspace show \ --resource-group prod-aks-rg \ --workspace-name prod-aks-logs \ --query id -o tsv)
## Create cluster with Container Insights enabledaz aks create \ --resource-group prod-aks-rg \ --name prod-cluster \ --enable-addons monitoring \ --workspace-resource-id $WORKSPACE_ID \ --generate-ssh-keysFor existing clusters, enable the addon with az aks enable-addons --addons monitoring --workspace-resource-id $WORKSPACE_ID. The monitoring agent deploys as a DaemonSet, consuming approximately 50MB of memory per node—negligible overhead for the visibility it provides.
Key Metrics to Monitor from Day One
Container Insights surfaces metrics through both Azure Monitor and Prometheus. Focus on these from the start:
- Node CPU/Memory utilization: Sustained usage above 80% signals scaling needs. Spikes are normal; sustained pressure is not.
- Pod restart counts: Frequent restarts indicate crash loops or OOM kills. Even a single restart warrants investigation.
- Pending pods: Pods stuck in Pending state point to resource constraints, node affinity conflicts, or scheduling issues.
- Container working set memory: Tracks actual memory consumption against limits. When working set approaches limits, OOM kills follow.
- Network I/O per pod: Unusual spikes can indicate misconfigured services or potential security incidents.
Log Analytics Queries That Surface Real Problems
Raw logs are noise. These KQL queries extract signal:
// Pods in failed states (CrashLoopBackOff, Error, OOMKilled)KubePodInventory| where PodStatus in ("Failed", "Unknown")| summarize count() by Name, Namespace, PodStatus| order by count_ desc
// Containers hitting memory limitsPerf| where ObjectName == "K8SContainer" and CounterName == "memoryWorkingSetBytes"| summarize AvgMemory = avg(CounterValue) by InstanceName| where AvgMemory > 500000000
// Node resource pressure over timeKubeNodeInventory| where TimeGenerated > ago(24h)| summarize avg(ClusterCpuCapacity), avg(ClusterMemoryCapacity) by bin(TimeGenerated, 1h)Save these queries as functions in your Log Analytics workspace. When an incident occurs at 3 AM, you want answers in seconds, not time spent remembering query syntax.
Alert Rules That Actually Matter
Skip the default Azure alert templates—they generate noise. Configure these three alerts manually:
- Node NotReady for 5+ minutes: Critical. Indicates node failure or kubelet issues. This alert should wake someone up.
- Pod restart count > 5 in 10 minutes: Warning. Catches crash loops before they impact users.
- Persistent volume usage > 85%: Warning. PVC exhaustion causes application failures that are painful to debug retroactively.
az monitor metrics alert create \ --name "node-not-ready" \ --resource-group prod-aks-rg \ --scopes "/subscriptions/a1b2c3d4-5678-90ab-cdef-1234567890ab/resourceGroups/prod-aks-rg/providers/Microsoft.ContainerService/managedClusters/prod-cluster" \ --condition "avg kube_node_status_condition < 1" \ --window-size 5m \ --evaluation-frequency 1m \ --severity 1Configure action groups to route critical alerts to PagerDuty or your on-call system. Warnings can go to Slack or Teams. The goal is signal without fatigue—an alert that fires constantly gets ignored.
💡 Pro Tip: Set the Log Analytics retention to 30 days for production. The default 30-day retention balances cost against the need to investigate incidents that surface days after they begin. Increase to 90 days if compliance requires longer audit trails.
With observability in place, you have the visibility needed to deploy workloads confidently. The next section covers deploying your first application with proper RBAC—because running containers as cluster-admin is how breaches happen.
Deploying Your First Workload with Proper RBAC
You have a cluster. Now the real work begins: deploying applications without creating security debt you’ll spend months paying down. Most tutorials skip RBAC entirely or show you kubectl create clusterrolebinding cluster-admin — the Kubernetes equivalent of running everything as root. This section establishes the patterns that separate production-grade deployments from tutorial-quality experiments.
Connecting kubectl with Azure AD Authentication
Start by configuring kubectl to authenticate through Azure AD rather than using the local admin credentials. The local admin kubeconfig bypasses all RBAC policies and audit logging — convenient for break-glass scenarios, dangerous for daily operations:
az aks get-credentials --resource-group rg-aks-prod --name aks-prod-eastus --overwrite-existing
## Verify you're using Azure AD authenticationkubectl get nodesThe first kubectl command triggers an Azure AD login flow. This authentication integrates with your existing identity provider, enabling conditional access policies and MFA enforcement at the cluster level. Every API call now carries user identity through to the audit logs, creating the accountability trail that security teams require.
If you encounter authentication failures, verify that your Azure AD account belongs to a group with cluster access. The AKS-managed Azure AD integration requires explicit group assignments — no default access exists for tenant members.
Creating Namespaces with Resource Quotas
Never let teams deploy to a namespace without resource boundaries. A single misconfigured deployment with no memory limits brings down your entire node. Worse, it affects every other workload scheduled on that node, turning one team’s mistake into a cluster-wide incident:
apiVersion: v1kind: Namespacemetadata: name: team-frontend labels: team: frontend environment: production---apiVersion: v1kind: ResourceQuotametadata: name: team-frontend-quota namespace: team-frontendspec: hard: requests.cpu: "4" requests.memory: 8Gi limits.cpu: "8" limits.memory: 16Gi pods: "20" services: "10"---apiVersion: v1kind: LimitRangemetadata: name: team-frontend-limits namespace: team-frontendspec: limits: - default: cpu: 500m memory: 512Mi defaultRequest: cpu: 100m memory: 128Mi type: ContainerThe ResourceQuota enforces hard caps on aggregate consumption within the namespace. When teams hit these limits, Kubernetes rejects new pod creation rather than allowing unbounded growth. The LimitRange complements this by automatically injecting resource requests and limits into pods that don’t specify them, preventing accidental resource exhaustion from developers who forget to add constraints.
Size these quotas based on actual team needs. Start conservative — you can always increase limits, but reducing them after teams have deployed workloads creates friction.
Minimal RBAC Roles for Development Teams
Grant the minimum permissions teams need. This role allows developers to manage their workloads without touching cluster-wide resources or other teams’ namespaces:
apiVersion: rbac.authorization.k8s.io/v1kind: Rolemetadata: namespace: team-frontend name: developerrules:- apiGroups: ["", "apps", "batch"] resources: ["deployments", "pods", "services", "configmaps", "secrets", "jobs", "cronjobs"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]- apiGroups: [""] resources: ["pods/log", "pods/exec"] verbs: ["get", "create"]---apiVersion: rbac.authorization.k8s.io/v1kind: RoleBindingmetadata: name: frontend-developers namespace: team-frontendroleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: developersubjects:- kind: Group name: "a1b2c3d4-5678-90ab-cdef-1234567890ab" # Azure AD group ID apiGroup: rbac.authorization.k8s.ioReplace the group ID with your Azure AD group’s object ID. Binding to groups rather than individual users simplifies onboarding and offboarding — add someone to the Azure AD group, and they immediately inherit the appropriate cluster permissions. Remove them from the group, and access revokes within minutes as tokens expire.
Notice what this role excludes: no access to persistent volumes, no ability to create ingresses, no permission to modify network policies. Teams escalate through tickets when they need broader access, creating an audit trail and forcing intentional decisions about privilege expansion.
Deploying with Proper Security Context
Your first deployment should enforce security constraints that become standard across all workloads. These settings prevent container escapes and limit blast radius if an application gets compromised:
apiVersion: apps/v1kind: Deploymentmetadata: name: sample-api namespace: team-frontendspec: replicas: 2 selector: matchLabels: app: sample-api template: metadata: labels: app: sample-api spec: securityContext: runAsNonRoot: true runAsUser: 1000 fsGroup: 1000 seccompProfile: type: RuntimeDefault containers: - name: api image: mcr.microsoft.com/azuredocs/aks-helloworld:v1 ports: - containerPort: 80 securityContext: allowPrivilegeEscalation: false readOnlyRootFilesystem: true capabilities: drop: ["ALL"] resources: requests: cpu: 100m memory: 128Mi limits: cpu: 250m memory: 256MiThe pod-level securityContext establishes baseline constraints for all containers. The container-level settings then tighten restrictions further: dropping all Linux capabilities, preventing privilege escalation, and making the root filesystem read-only. Combined, these settings significantly reduce the attack surface available to compromised workloads.
💡 Pro Tip: The
readOnlyRootFilesystem: truesetting breaks applications that write to the container filesystem. Mount an emptyDir volume for/tmpor application-specific paths that require write access. This is a feature, not a bug — it forces explicit decisions about what your containers can modify.
Apply these manifests in order — namespace first, then RBAC, then the deployment:
kubectl apply -f namespace-team-frontend.yamlkubectl apply -f rbac-team-frontend.yamlkubectl apply -f deployment-sample-app.yamlVerify the deployment succeeded and pods are running with expected constraints:
kubectl get pods -n team-frontendkubectl describe pod -n team-frontend -l app=sample-apiYou now have a workload running with enforced resource boundaries, least-privilege RBAC, and a hardened security context. These patterns scale — every new team gets their own namespace with identical guardrails. Document your namespace provisioning process and consider automating it through GitOps or a self-service portal.
With workloads deployed securely, you’re ready to move beyond manual kubectl operations. The next section covers evolving this foundation toward GitOps workflows and the networking configurations that production traffic demands.
What’s Next: Graduating to GitOps and Advanced Networking
You now have a production-grade AKS cluster with proper monitoring, RBAC, and node pool strategy. But running Kubernetes at scale demands more than solid initial configuration—it requires operational patterns that grow with your team and workloads.
GitOps: When Manual Deploys Become a Liability
The moment you have more than one cluster or more than two people deploying, manual kubectl apply becomes a liability. ArgoCD and Flux both integrate natively with AKS and Azure Container Registry, but they serve different operational models.
ArgoCD shines when you need visibility. Its web UI gives product teams insight into deployment status without cluster access. Flux takes a lighter approach, running as a set of controllers without additional infrastructure. For most teams starting their GitOps journey, Flux’s lower operational overhead makes it the pragmatic first choice—you can always migrate to ArgoCD when you need its multi-tenancy features.
Network Policies and Service Mesh
Azure CNI gives you the foundation, but without network policies, every pod can talk to every other pod. Start with deny-all defaults and explicitly allow required traffic paths. Calico network policies (available through AKS) handle 90% of use cases without the complexity of a full service mesh.
Reserve Istio or Linkerd for when you genuinely need mTLS between services, advanced traffic shaping, or observability at the request level. Adding a service mesh to solve problems you don’t have creates operational burden that compounds over time.
Protecting Your Investment
Velero with Azure Blob Storage provides cluster-state backups that can restore entire namespaces. Schedule daily backups of critical namespaces and test restores quarterly—a backup you’ve never restored is a backup that doesn’t exist.
💡 Pro Tip: The AKS documentation’s “Best practices” section and the Azure Architecture Center’s AKS baseline reference architecture provide battle-tested patterns that align with everything covered in this guide.
Your cluster is production-ready. Now ship something.
Key Takeaways
- Always enable Azure CNI networking and private cluster endpoints for production AKS deployments
- Separate system and user node pools to isolate critical Kubernetes components from application workloads
- Configure Container Insights and basic alert rules before deploying your first application
- Use managed identities and Azure AD integration instead of service principals for authentication