Hero image for From Zero to Production: Deploying Your First AKS Cluster with Real-World Defaults

From Zero to Production: Deploying Your First AKS Cluster with Real-World Defaults


You’ve spun up a dozen Kubernetes clusters following Microsoft’s quickstart guides, only to rebuild them weeks later when security reviews flag missing configurations. The public API server endpoint that seemed fine during development suddenly becomes a blocker. The default CNI that “just worked” can’t support the network policies your compliance team requires. The cluster that took fifteen minutes to create takes three sprints to harden.

This pattern repeats across organizations because Azure’s quickstart defaults optimize for the wrong thing: time to first deployment. Microsoft wants you to see pods running within minutes, which means skipping the configurations that matter for anything beyond a demo. Public endpoints, basic networking, no monitoring integration, permissive RBAC—these defaults create clusters that work immediately and fail predictably.

The real cost isn’t the rebuild itself. It’s the workloads you’ve already deployed, the CI/CD pipelines pointing at the old cluster, the debugging sessions where “it worked in dev” becomes your team’s mantra. Every tutorial cluster that graduates to staging carries technical debt that compounds with each deployment.

Production-ready means something specific: a cluster configuration that passes security review on first submission, supports your networking requirements without migration, and provides observability from day one. It means making decisions upfront about identity, networking, and monitoring that align with where your infrastructure needs to be in six months, not where the tutorial assumes you are today.

The gap between quickstart and production-ready isn’t complexity—it’s intentionality. Let’s look at exactly where the default configurations fall short and what they assume you’ll figure out later.

Why Most AKS Quickstarts Set You Up for Failure

Run az aks create with the defaults, deploy a sample nginx pod, and celebrate—you’ve got Kubernetes running in Azure. The official tutorials walk you through this in under ten minutes. What they don’t mention is that you’ve just deployed a cluster that would fail any reasonable security audit.

Visual: AKS default configuration gaps

The gap between “it works” and “it’s production-ready” in AKS is substantial, and Microsoft’s quickstart documentation optimizes for the former. This isn’t a criticism—tutorials serve a purpose. But treating a tutorial cluster as a production foundation creates technical debt that compounds with every workload you deploy.

What the Defaults Actually Give You

A vanilla az aks create command provisions a cluster with a publicly accessible API server. Anyone on the internet can attempt to authenticate against your Kubernetes control plane. The cluster uses kubenet networking, which lacks the pod-level network policy enforcement that Azure CNI provides. There’s no Azure Monitor integration, meaning you’re flying blind on container metrics and logs. Managed identity isn’t configured, so you’re dealing with service principal secrets that need rotation. And the default node pool runs a single VM size with no separation between system workloads and your applications.

Each of these defaults makes sense for a learning environment. None of them belong in production.

The Hidden Costs

The real problem isn’t the initial deployment—it’s what happens six months later. That tutorial cluster now runs three microservices, handles actual customer traffic, and nobody remembers why the API server is public. Retrofitting network policies onto a cluster designed without them requires careful planning. Migrating from kubenet to Azure CNI often means reprovisioning the entire cluster. The monitoring gap means your first indication of resource exhaustion is a customer complaint, not an alert.

Teams that start with tutorial defaults spend more time remediating than teams that start with production configurations. The “move fast” approach becomes “move fast, then stop everything to fix the foundation.”

What Production-Ready Actually Means

For AKS, production readiness has specific requirements: private API server endpoints, Azure CNI with network policies, integrated monitoring and logging, managed identities for Azure resource access, separated node pools for system and user workloads, and defined autoscaling boundaries. These aren’t aspirational features—they’re baseline expectations for any cluster handling real traffic.

Before diving into implementation, you need to understand where the responsibility boundaries lie. AKS is a managed service, but “managed” doesn’t mean “fully operated.” Let’s examine exactly what Azure handles and what remains your responsibility.

AKS Architecture: What Azure Manages vs What You Own

Understanding the responsibility boundary in AKS determines whether you’ll spend your time fighting infrastructure or shipping features. Unlike self-managed Kubernetes, AKS abstracts the control plane entirely—but that abstraction comes with trade-offs you need to understand before your first production deployment.

Visual: AKS shared responsibility model

The Control Plane: Microsoft’s Domain

Azure fully manages the Kubernetes control plane: the API server, etcd, scheduler, and controller manager. You never SSH into these components, patch them, or worry about their high availability. Microsoft handles upgrades, security patches, and ensures 99.95% uptime under their SLA.

This is genuinely hands-off. The control plane scales automatically based on cluster size, and you pay nothing extra for it—it’s included in your node costs. However, you sacrifice visibility. You can’t tune etcd performance, adjust API server flags, or access control plane logs directly (though diagnostic settings can forward limited metrics to Azure Monitor).

💡 Pro Tip: The free control plane sounds great until you need custom admission controllers or API server audit policies. AKS supports these through Azure Policy and diagnostic settings, but the configuration paths differ from vanilla Kubernetes documentation.

Node Pools: Shared Responsibility Territory

Your worker nodes run on Azure VMs that you select, pay for, and partially manage. You choose the VM SKU, disk type, and networking configuration. Azure handles the underlying hypervisor, but you own:

  • OS patching: Node images need regular updates. AKS provides node image upgrades, but you trigger them.
  • Scaling policies: Cluster autoscaler configuration, including min/max nodes and scale-down behavior.
  • Workload scheduling: Taints, tolerations, and node affinity rules that determine pod placement.
  • Container runtime security: While Azure manages containerd, you configure pod security standards and runtime policies.

Node pools also define your blast radius. A misconfigured system node pool takes down cluster-critical components like CoreDNS. Separating system and user workloads into dedicated pools isn’t optional for production—it’s a requirement.

Networking: The Integration Minefield

AKS networking creates the deepest Azure dependencies. Your cluster requires a virtual network, subnet allocation, and decisions about CNI plugins that affect everything from IP address consumption to network policy enforcement.

The default kubenet plugin simplifies initial setup but limits you to 400 nodes and complicates Azure service integration. Azure CNI assigns pod IPs directly from your subnet, enabling direct communication with other Azure resources but consuming IP addresses rapidly. A /24 subnet that looks generous disappears quickly when every pod needs a routable IP.

Cost Implications

Architectural choices compound financially. Larger VM SKUs reduce node count but increase blast radius. Premium SSDs improve etcd-backed workload performance but triple storage costs. The control plane is free, but every add-on—Azure Policy, Key Vault integration, Defender for Containers—adds either direct costs or compute overhead on your nodes.

With this mental model established, you’re ready to provision a cluster that reflects production requirements from the start.

Provisioning AKS with Production Defaults Using Azure CLI

The default az aks create command produces a cluster that’s technically functional but operationally deficient. It uses kubenet networking, creates a public API server endpoint, and provisions a service principal that requires manual credential rotation. These defaults optimize for getting-started friction, not production operations.

Here’s a complete provisioning script that establishes security baselines from the start:

provision-aks.sh
#!/bin/bash
set -euo pipefail
## Configuration
RESOURCE_GROUP="rg-myapp-prod-eastus"
CLUSTER_NAME="aks-myapp-prod"
LOCATION="eastus"
VNET_NAME="vnet-myapp-prod"
SUBNET_NAME="snet-aks-nodes"
K8S_VERSION="1.29"
## Create resource group
az group create \
--name $RESOURCE_GROUP \
--location $LOCATION
## Create VNet with dedicated AKS subnet
az network vnet create \
--resource-group $RESOURCE_GROUP \
--name $VNET_NAME \
--address-prefixes 10.0.0.0/16 \
--subnet-name $SUBNET_NAME \
--subnet-prefixes 10.0.0.0/22
## Get subnet ID for CNI configuration
SUBNET_ID=$(az network vnet subnet show \
--resource-group $RESOURCE_GROUP \
--vnet-name $VNET_NAME \
--name $SUBNET_NAME \
--query id -o tsv)
## Provision AKS with production defaults
az aks create \
--resource-group $RESOURCE_GROUP \
--name $CLUSTER_NAME \
--location $LOCATION \
--kubernetes-version $K8S_VERSION \
--node-count 3 \
--node-vm-size Standard_D4s_v5 \
--network-plugin azure \
--network-policy azure \
--vnet-subnet-id $SUBNET_ID \
--service-cidr 10.1.0.0/16 \
--dns-service-ip 10.1.0.10 \
--enable-private-cluster \
--private-dns-zone system \
--enable-managed-identity \
--enable-aad \
--enable-azure-rbac \
--enable-defender \
--enable-workload-identity \
--enable-oidc-issuer \
--auto-upgrade-channel stable \
--node-os-upgrade-channel NodeImage \
--tier standard \
--zones 1 2 3

Why Azure CNI Over Kubenet

The --network-plugin azure flag configures Azure CNI, which assigns VNet IP addresses directly to pods. This eliminates the NAT layer that kubenet requires and enables direct pod-to-pod communication across VNets through peering. Your pods become first-class VNet citizens with routable IPs, which matters when integrating with Azure services that require VNet connectivity—Azure SQL with private endpoints, Azure Storage with service endpoints, or on-premises resources through ExpressRoute.

The /22 subnet prefix provides 1,024 addresses. Azure CNI reserves IPs per node (30 pods by default plus the node itself), so plan accordingly. A three-node cluster with default settings consumes roughly 93 IPs, leaving room for scaling and pod churn.

The --network-policy azure flag enables Azure Network Policy Manager, providing native Kubernetes network policies for microsegmentation. This allows you to define ingress and egress rules at the pod level, restricting traffic between namespaces or specific workloads without deploying third-party solutions like Calico.

Private Cluster Endpoints

The --enable-private-cluster flag removes the public API server endpoint entirely. The Kubernetes API becomes accessible only through a private endpoint within your VNet. This single flag eliminates an entire attack surface—no amount of RBAC misconfiguration exposes your control plane to the internet.

The --private-dns-zone system setting creates an Azure-managed private DNS zone for cluster name resolution. For multi-cluster environments sharing a VNet, consider using a custom private DNS zone instead to maintain consistent naming across clusters.

💡 Pro Tip: Private clusters require network connectivity for management. Set up Azure Bastion or a jump box VM before provisioning, or use az aks command invoke for emergency access.

Managed Identity Configuration

The combination of --enable-managed-identity, --enable-workload-identity, and --enable-oidc-issuer establishes a credential-free security model. The cluster’s system-assigned managed identity handles Azure resource operations without storing credentials. Workload identity extends this pattern to your applications, allowing pods to authenticate to Azure services using Kubernetes service account tokens federated with Entra ID.

This eliminates service principal credentials that expire after one year and require rotation scripts. Managed identities rotate credentials automatically and never expose secrets to your deployment pipelines. The OIDC issuer publishes discovery documents that Azure can verify, enabling secure token exchange without shared secrets.

Azure RBAC Integration

The --enable-aad and --enable-azure-rbac flags replace Kubernetes-native RBAC with Azure role assignments. Cluster access integrates directly with Entra ID—the same groups controlling your Azure subscriptions now control kubectl access. This consolidates identity management and provides audit logs in Azure Activity Log rather than scattered across cluster logs.

Azure RBAC for Kubernetes supports built-in roles like Azure Kubernetes Service RBAC Reader, Writer, Admin, and Cluster Admin, but you can also create custom role definitions scoped to specific namespaces or resource types.

Security and Reliability Flags

The --enable-defender flag activates Microsoft Defender for Containers, providing runtime threat detection, vulnerability scanning for container images, and security recommendations specific to your cluster configuration. This integrates with Microsoft Defender for Cloud for centralized security management.

The --tier standard flag enables the uptime SLA (99.95% for zone-redundant clusters) and unlocks features like longer cluster support windows. Production workloads should never run on the free tier, which offers no availability guarantees and limited support options.

The --zones 1 2 3 flag distributes nodes across availability zones, ensuring that zone-level failures don’t take down your entire cluster. Combined with the standard tier, this configuration maximizes control plane and worker node resilience.

With the control plane provisioned, the next step is configuring node pools that balance cost, performance, and reliability across your workload types.

Node Pool Strategy: Sizing, Scaling, and Spot Instances

Node pools are where your Kubernetes workloads actually run, and getting them wrong creates problems that ripple through your entire infrastructure. The default AKS setup gives you a single node pool running everything from system components to your applications—a configuration that works for demos but fails in production.

System vs User Node Pools

AKS runs critical system components—CoreDNS, metrics-server, and the Azure CNI plugin—alongside your workloads by default. When your application pods consume excessive resources or trigger node scaling events, these system components get caught in the crossfire.

The fix is straightforward: dedicate a node pool to system workloads and run applications on separate user node pools.

create-node-pools.sh
## Create a dedicated system node pool with taints
az aks nodepool add \
--resource-group rg-aks-prod \
--cluster-name aks-prod-westus2 \
--name system \
--node-count 3 \
--node-vm-size Standard_D2s_v5 \
--mode System \
--node-taints CriticalAddonsOnly=true:NoSchedule \
--zones 1 2 3
## Create a user node pool for application workloads
az aks nodepool add \
--resource-group rg-aks-prod \
--cluster-name aks-prod-westus2 \
--name workloads \
--node-count 3 \
--node-vm-size Standard_D4s_v5 \
--mode User \
--enable-cluster-autoscaler \
--min-count 3 \
--max-count 10 \
--zones 1 2 3

The CriticalAddonsOnly taint ensures only system components with matching tolerations schedule onto system nodes. Your application deployments need no changes—they simply won’t schedule on system nodes.

Choosing VM Sizes

VM selection depends on your workload profile, not abstract best practices. General-purpose D-series VMs (D4s_v5, D8s_v5) handle most web applications and APIs well. Memory-intensive workloads like Redis or Elasticsearch benefit from E-series VMs. Compute-heavy batch processing runs better on F-series.

Start with D4s_v5 (4 vCPU, 16 GB RAM) for user node pools. This size provides enough headroom for typical containerized applications while keeping per-node costs reasonable. Smaller VMs increase scheduling fragmentation; larger VMs create blast radius problems when nodes fail.

Configuring the Cluster Autoscaler

The autoscaler responds to pending pods that can’t be scheduled due to resource constraints. Set bounds that prevent both runaway costs and capacity starvation:

autoscaler-profile.sh
az aks update \
--resource-group rg-aks-prod \
--cluster-name aks-prod-westus2 \
--cluster-autoscaler-profile \
scale-down-delay-after-add=10m \
scale-down-unneeded-time=10m \
max-graceful-termination-sec=600 \
scan-interval=10s

The scale-down-delay-after-add parameter prevents thrashing during deployment rollouts. Without it, the autoscaler might add nodes during a deployment, then immediately try to remove them once pods redistribute.

💡 Pro Tip: Set max-count based on your Azure subscription quotas. The autoscaler fails silently when hitting vCPU limits, leaving pods pending indefinitely.

Spot Instances for Cost Optimization

Spot VMs offer 60-90% cost savings but Azure can evict them with 30 seconds notice. Use them for workloads that tolerate interruption: batch jobs, dev/test environments, or stateless workers behind a queue.

spot-node-pool.sh
az aks nodepool add \
--resource-group rg-aks-prod \
--cluster-name aks-prod-westus2 \
--name spotworkers \
--node-count 0 \
--node-vm-size Standard_D4s_v5 \
--priority Spot \
--eviction-policy Delete \
--spot-max-price -1 \
--enable-cluster-autoscaler \
--min-count 0 \
--max-count 20 \
--node-taints kubernetes.azure.com/scalesetpriority=spot:NoSchedule

Setting spot-max-price to -1 accepts the current spot price, maximizing availability. The taint requires workloads to explicitly tolerate spot placement, preventing accidental scheduling of critical services onto evictable nodes.

Handle evictions by setting appropriate terminationGracePeriodSeconds on your pods and implementing graceful shutdown logic. Kubernetes receives the eviction notice and begins draining—your application has seconds, not minutes, to clean up.

With node pools configured for isolation, scaling, and cost efficiency, you need visibility into what’s actually happening inside the cluster. Azure Monitor and Container Insights provide that observability layer.

Integrating Azure Monitor and Container Insights

Deploying a cluster without observability is deploying blind. Container Insights provides deep visibility into your AKS workloads—CPU, memory, container restarts, pod scheduling failures—all flowing into Log Analytics where you can query, alert, and diagnose. Enable it at cluster creation, not as an afterthought. The cost of retrofitting monitoring onto a production cluster under pressure far exceeds the minimal effort of configuring it upfront.

Enabling Container Insights at Cluster Creation

The cleanest approach creates a Log Analytics workspace and enables monitoring in a single provisioning command:

enable-monitoring.sh
## Create a dedicated Log Analytics workspace
az monitor log-analytics workspace create \
--resource-group prod-aks-rg \
--workspace-name prod-aks-logs \
--location eastus \
--retention-in-days 30
## Get the workspace resource ID
WORKSPACE_ID=$(az monitor log-analytics workspace show \
--resource-group prod-aks-rg \
--workspace-name prod-aks-logs \
--query id -o tsv)
## Create cluster with Container Insights enabled
az aks create \
--resource-group prod-aks-rg \
--name prod-cluster \
--enable-addons monitoring \
--workspace-resource-id $WORKSPACE_ID \
--generate-ssh-keys

For existing clusters, enable the addon with az aks enable-addons --addons monitoring --workspace-resource-id $WORKSPACE_ID. The monitoring agent deploys as a DaemonSet, consuming approximately 50MB of memory per node—negligible overhead for the visibility it provides.

Key Metrics to Monitor from Day One

Container Insights surfaces metrics through both Azure Monitor and Prometheus. Focus on these from the start:

  • Node CPU/Memory utilization: Sustained usage above 80% signals scaling needs. Spikes are normal; sustained pressure is not.
  • Pod restart counts: Frequent restarts indicate crash loops or OOM kills. Even a single restart warrants investigation.
  • Pending pods: Pods stuck in Pending state point to resource constraints, node affinity conflicts, or scheduling issues.
  • Container working set memory: Tracks actual memory consumption against limits. When working set approaches limits, OOM kills follow.
  • Network I/O per pod: Unusual spikes can indicate misconfigured services or potential security incidents.

Log Analytics Queries That Surface Real Problems

Raw logs are noise. These KQL queries extract signal:

useful-queries.kql
// Pods in failed states (CrashLoopBackOff, Error, OOMKilled)
KubePodInventory
| where PodStatus in ("Failed", "Unknown")
| summarize count() by Name, Namespace, PodStatus
| order by count_ desc
// Containers hitting memory limits
Perf
| where ObjectName == "K8SContainer" and CounterName == "memoryWorkingSetBytes"
| summarize AvgMemory = avg(CounterValue) by InstanceName
| where AvgMemory > 500000000
// Node resource pressure over time
KubeNodeInventory
| where TimeGenerated > ago(24h)
| summarize avg(ClusterCpuCapacity), avg(ClusterMemoryCapacity) by bin(TimeGenerated, 1h)

Save these queries as functions in your Log Analytics workspace. When an incident occurs at 3 AM, you want answers in seconds, not time spent remembering query syntax.

Alert Rules That Actually Matter

Skip the default Azure alert templates—they generate noise. Configure these three alerts manually:

  1. Node NotReady for 5+ minutes: Critical. Indicates node failure or kubelet issues. This alert should wake someone up.
  2. Pod restart count > 5 in 10 minutes: Warning. Catches crash loops before they impact users.
  3. Persistent volume usage > 85%: Warning. PVC exhaustion causes application failures that are painful to debug retroactively.
create-alert.sh
az monitor metrics alert create \
--name "node-not-ready" \
--resource-group prod-aks-rg \
--scopes "/subscriptions/a1b2c3d4-5678-90ab-cdef-1234567890ab/resourceGroups/prod-aks-rg/providers/Microsoft.ContainerService/managedClusters/prod-cluster" \
--condition "avg kube_node_status_condition < 1" \
--window-size 5m \
--evaluation-frequency 1m \
--severity 1

Configure action groups to route critical alerts to PagerDuty or your on-call system. Warnings can go to Slack or Teams. The goal is signal without fatigue—an alert that fires constantly gets ignored.

💡 Pro Tip: Set the Log Analytics retention to 30 days for production. The default 30-day retention balances cost against the need to investigate incidents that surface days after they begin. Increase to 90 days if compliance requires longer audit trails.

With observability in place, you have the visibility needed to deploy workloads confidently. The next section covers deploying your first application with proper RBAC—because running containers as cluster-admin is how breaches happen.

Deploying Your First Workload with Proper RBAC

You have a cluster. Now the real work begins: deploying applications without creating security debt you’ll spend months paying down. Most tutorials skip RBAC entirely or show you kubectl create clusterrolebinding cluster-admin — the Kubernetes equivalent of running everything as root. This section establishes the patterns that separate production-grade deployments from tutorial-quality experiments.

Connecting kubectl with Azure AD Authentication

Start by configuring kubectl to authenticate through Azure AD rather than using the local admin credentials. The local admin kubeconfig bypasses all RBAC policies and audit logging — convenient for break-glass scenarios, dangerous for daily operations:

terminal
az aks get-credentials --resource-group rg-aks-prod --name aks-prod-eastus --overwrite-existing
## Verify you're using Azure AD authentication
kubectl get nodes

The first kubectl command triggers an Azure AD login flow. This authentication integrates with your existing identity provider, enabling conditional access policies and MFA enforcement at the cluster level. Every API call now carries user identity through to the audit logs, creating the accountability trail that security teams require.

If you encounter authentication failures, verify that your Azure AD account belongs to a group with cluster access. The AKS-managed Azure AD integration requires explicit group assignments — no default access exists for tenant members.

Creating Namespaces with Resource Quotas

Never let teams deploy to a namespace without resource boundaries. A single misconfigured deployment with no memory limits brings down your entire node. Worse, it affects every other workload scheduled on that node, turning one team’s mistake into a cluster-wide incident:

namespace-team-frontend.yaml
apiVersion: v1
kind: Namespace
metadata:
name: team-frontend
labels:
team: frontend
environment: production
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-frontend-quota
namespace: team-frontend
spec:
hard:
requests.cpu: "4"
requests.memory: 8Gi
limits.cpu: "8"
limits.memory: 16Gi
pods: "20"
services: "10"
---
apiVersion: v1
kind: LimitRange
metadata:
name: team-frontend-limits
namespace: team-frontend
spec:
limits:
- default:
cpu: 500m
memory: 512Mi
defaultRequest:
cpu: 100m
memory: 128Mi
type: Container

The ResourceQuota enforces hard caps on aggregate consumption within the namespace. When teams hit these limits, Kubernetes rejects new pod creation rather than allowing unbounded growth. The LimitRange complements this by automatically injecting resource requests and limits into pods that don’t specify them, preventing accidental resource exhaustion from developers who forget to add constraints.

Size these quotas based on actual team needs. Start conservative — you can always increase limits, but reducing them after teams have deployed workloads creates friction.

Minimal RBAC Roles for Development Teams

Grant the minimum permissions teams need. This role allows developers to manage their workloads without touching cluster-wide resources or other teams’ namespaces:

rbac-team-frontend.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: team-frontend
name: developer
rules:
- apiGroups: ["", "apps", "batch"]
resources: ["deployments", "pods", "services", "configmaps", "secrets", "jobs", "cronjobs"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
resources: ["pods/log", "pods/exec"]
verbs: ["get", "create"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: frontend-developers
namespace: team-frontend
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: developer
subjects:
- kind: Group
name: "a1b2c3d4-5678-90ab-cdef-1234567890ab" # Azure AD group ID
apiGroup: rbac.authorization.k8s.io

Replace the group ID with your Azure AD group’s object ID. Binding to groups rather than individual users simplifies onboarding and offboarding — add someone to the Azure AD group, and they immediately inherit the appropriate cluster permissions. Remove them from the group, and access revokes within minutes as tokens expire.

Notice what this role excludes: no access to persistent volumes, no ability to create ingresses, no permission to modify network policies. Teams escalate through tickets when they need broader access, creating an audit trail and forcing intentional decisions about privilege expansion.

Deploying with Proper Security Context

Your first deployment should enforce security constraints that become standard across all workloads. These settings prevent container escapes and limit blast radius if an application gets compromised:

deployment-sample-app.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: sample-api
namespace: team-frontend
spec:
replicas: 2
selector:
matchLabels:
app: sample-api
template:
metadata:
labels:
app: sample-api
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
containers:
- name: api
image: mcr.microsoft.com/azuredocs/aks-helloworld:v1
ports:
- containerPort: 80
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 250m
memory: 256Mi

The pod-level securityContext establishes baseline constraints for all containers. The container-level settings then tighten restrictions further: dropping all Linux capabilities, preventing privilege escalation, and making the root filesystem read-only. Combined, these settings significantly reduce the attack surface available to compromised workloads.

💡 Pro Tip: The readOnlyRootFilesystem: true setting breaks applications that write to the container filesystem. Mount an emptyDir volume for /tmp or application-specific paths that require write access. This is a feature, not a bug — it forces explicit decisions about what your containers can modify.

Apply these manifests in order — namespace first, then RBAC, then the deployment:

terminal
kubectl apply -f namespace-team-frontend.yaml
kubectl apply -f rbac-team-frontend.yaml
kubectl apply -f deployment-sample-app.yaml

Verify the deployment succeeded and pods are running with expected constraints:

terminal
kubectl get pods -n team-frontend
kubectl describe pod -n team-frontend -l app=sample-api

You now have a workload running with enforced resource boundaries, least-privilege RBAC, and a hardened security context. These patterns scale — every new team gets their own namespace with identical guardrails. Document your namespace provisioning process and consider automating it through GitOps or a self-service portal.

With workloads deployed securely, you’re ready to move beyond manual kubectl operations. The next section covers evolving this foundation toward GitOps workflows and the networking configurations that production traffic demands.

What’s Next: Graduating to GitOps and Advanced Networking

You now have a production-grade AKS cluster with proper monitoring, RBAC, and node pool strategy. But running Kubernetes at scale demands more than solid initial configuration—it requires operational patterns that grow with your team and workloads.

GitOps: When Manual Deploys Become a Liability

The moment you have more than one cluster or more than two people deploying, manual kubectl apply becomes a liability. ArgoCD and Flux both integrate natively with AKS and Azure Container Registry, but they serve different operational models.

ArgoCD shines when you need visibility. Its web UI gives product teams insight into deployment status without cluster access. Flux takes a lighter approach, running as a set of controllers without additional infrastructure. For most teams starting their GitOps journey, Flux’s lower operational overhead makes it the pragmatic first choice—you can always migrate to ArgoCD when you need its multi-tenancy features.

Network Policies and Service Mesh

Azure CNI gives you the foundation, but without network policies, every pod can talk to every other pod. Start with deny-all defaults and explicitly allow required traffic paths. Calico network policies (available through AKS) handle 90% of use cases without the complexity of a full service mesh.

Reserve Istio or Linkerd for when you genuinely need mTLS between services, advanced traffic shaping, or observability at the request level. Adding a service mesh to solve problems you don’t have creates operational burden that compounds over time.

Protecting Your Investment

Velero with Azure Blob Storage provides cluster-state backups that can restore entire namespaces. Schedule daily backups of critical namespaces and test restores quarterly—a backup you’ve never restored is a backup that doesn’t exist.

💡 Pro Tip: The AKS documentation’s “Best practices” section and the Azure Architecture Center’s AKS baseline reference architecture provide battle-tested patterns that align with everything covered in this guide.

Your cluster is production-ready. Now ship something.

Key Takeaways

  • Always enable Azure CNI networking and private cluster endpoints for production AKS deployments
  • Separate system and user node pools to isolate critical Kubernetes components from application workloads
  • Configure Container Insights and basic alert rules before deploying your first application
  • Use managed identities and Azure AD integration instead of service principals for authentication