GKE Best Practices: Hardening Security, Networking, and Automation for Production Clusters
Your GKE cluster works in staging, but production is a different beast. A misconfigured network policy leaks traffic between namespaces, your nodes run as root, and a manual kubectl apply is the only thing standing between you and a deployment. These aren’t hypothetical risks—they’re the gaps that separate a cluster that runs from one that’s actually production-ready.
The default GKE configuration is optimized for getting started, not for operating at scale under real security constraints. Legacy authorization modes ship enabled. Node service accounts carry permissions they don’t need. Workload-to-workload traffic flows freely across namespace boundaries. None of this causes problems in a developer sandbox. In production, each of these is an incident waiting for the right trigger.
What makes GKE specifically demanding isn’t Kubernetes complexity—it’s the false confidence that a managed control plane creates. Google handles etcd, API server upgrades, and control plane availability. That’s real value. But it doesn’t touch your node configuration, your network policies, your IAM bindings, or your deployment pipeline. The operational surface that causes production incidents lives entirely in your half of the responsibility model.
GKE also ships primitives that vanilla Kubernetes doesn’t: Workload Identity for pod-level IAM, Binary Authorization for enforcing signed images, Shielded Nodes for protecting the node boot chain. These features exist because Google’s own production experience identified where clusters get compromised. Ignoring them isn’t neutral—it’s leaving a known attack surface unaddressed.
Understanding that gap—between what GKE gives you and what production-hardened actually means—is the foundation everything else builds on.
Why GKE Demands More Than Default Kubernetes Assumptions
Google Kubernetes Engine offloads control plane management—etcd, the API server, scheduler, and controller manager all run on infrastructure Google operates and maintains. This is genuinely valuable. It is also frequently misread as a broader safety guarantee than it actually is.

The managed control plane eliminates an entire class of operational burden. It does not eliminate the attack surface, blast radius, or compliance obligations that come with running workloads at production scale. The responsibility model shifts, but the operator’s responsibility does not disappear. Node security, workload isolation, network policy enforcement, identity federation, and supply chain integrity all remain firmly in the platform team’s domain.
Default Settings Are Optimized for Accessibility, Not Production
A freshly created GKE cluster with default settings is designed to get you running quickly. It is not designed to pass a security audit, withstand a compromised workload, or operate cleanly at the scale where failure has organizational consequences. The defaults ship with legacy metadata API access enabled, no network policies enforced, basic authentication options present, and node service accounts that carry broader IAM permissions than any production workload should possess.
The gap between a functioning cluster and a hardened cluster is where most production incidents originate. Workloads run. Deployments succeed. And then a misconfigured service account, an overpermissioned node pool, or an unencrypted secret surfaces in a post-incident review.
GKE Exposes Primitives That Generic Kubernetes Advice Ignores
Generic Kubernetes hardening guidance treats cloud identity as an external concern. GKE does not. Workload Identity binds Kubernetes service accounts directly to Google Cloud IAM without key files or credential rotation machinery. Binary Authorization enforces attestation policies at deploy time, making it possible to reject images that haven’t cleared your signing pipeline. Shielded Nodes extend Secure Boot and vTPM-backed integrity verification to the node layer itself.
These primitives exist because Google’s threat model for GKE reflects the realities of multi-tenant, internet-connected production infrastructure. Treating them as optional hardening steps rather than baseline configuration is the pattern that separates clusters that hold up under scrutiny from those that don’t.
💡 Pro Tip: Review Google’s GKE hardening guide as a checklist, not a tutorial—it maps directly to CIS Kubernetes Benchmark controls and gives each recommendation an explicit risk rating.
The architecture decisions made before a cluster receives its first workload determine how much remediation work follows. That starts with selecting the right cluster mode and node configuration for the operational profile you’re actually running.
Cluster Architecture: Choosing the Right Mode and Node Configuration
Before writing a single Terraform resource block, the architectural decisions you make about cluster topology determine whether your production environment is resilient, secure, and economical—or fragile by design.

Autopilot vs. Standard Mode
Autopilot trades configurability for operational simplicity. Google manages node provisioning, scaling, and security hardening automatically, enforcing pod-level resource requests and applying security policies by default. For teams without a dedicated platform engineering function, Autopilot eliminates an entire category of node-level operational burden.
Standard mode remains the right choice when you need direct control over node configuration: custom machine families, GPUs, specific kernel parameters, or DaemonSets that require privileged access. Most platform teams running mixed workloads—APIs, batch jobs, ML inference—operate Standard clusters precisely because workload diversity demands that control surface.
The decision is not philosophical. If your organization enforces a strict separation between application teams and infrastructure teams, and node-level customization requests arrive frequently, choose Standard and invest in node pool automation. If your teams are small and workload profiles are uniform, Autopilot eliminates operational overhead that has no business value.
Regional Clusters and Control Plane HA
Run regional clusters in production. A zonal cluster ties the Kubernetes API server to a single zone; a zone outage takes down the control plane entirely, blocking deployments and autoscaling even if your workloads survive on remaining nodes. Regional clusters distribute the control plane across three zones automatically, providing HA with no additional configuration.
Zonal clusters are acceptable for development and staging environments where cost reduction justifies the availability tradeoff. For production, the cost difference between regional and zonal is marginal compared to the blast radius of a control plane outage during an incident.
Node Pool Design
Structure node pools around workload classes, not individual applications. A practical baseline for most production clusters:
- System pool: small, stable machine types (e2-medium or n2-standard-2) reserved for cluster-critical DaemonSets and add-ons
- Application pool: general-purpose nodes (n2-standard-4 or n2-standard-8) with autoscaling enabled for stateless services
- Batch pool: Spot instances (formerly Preemptible) for fault-tolerant workloads—data pipelines, build jobs, ML training runs where interruption is acceptable
Spot nodes on GKE Spot deliver 60–90% cost savings over on-demand pricing. The trade-off is interruption with a 30-second eviction notice. Design batch workloads to checkpoint state and tolerate restarts, and Spot becomes economically rational, not a compromise.
Shielded Nodes and Secure Boot
Enable Shielded Nodes on every Standard cluster. Shielded Nodes provide verifiable node integrity using Secure Boot, vTPM, and integrity monitoring—preventing rootkits and boot-level tampering from compromising the node before the OS fully initializes. This is non-negotiable for clusters handling regulated data or multi-tenant workloads.
Resource Requests and Limits as Cluster Health Policy
Resource requests are not performance tuning—they are the scheduler’s only input for bin-packing decisions. Pods without requests create invisible load that destabilizes nodes under pressure. Enforce requests and limits through admission control (covered in the security section below), and treat any workload missing them as a configuration defect, not a gap to revisit later.
With cluster topology settled, the next step is encoding these decisions as reproducible infrastructure—which is where Terraform and Cloud Build turn architectural intent into auditable, version-controlled configuration.
Infrastructure as Code: Provisioning GKE with Terraform and Cloud Build
Manual cluster creation via gcloud or the Cloud Console works once. It fails at team scale because the second engineer who runs a slightly different command produces a cluster with different settings, and nobody notices until an audit or an incident. There is no diff, no review, no history. Flags get omitted. Defaults change between SDK versions. One engineer enables Workload Identity; another does not. The clusters diverge silently until a security scan or a production incident surfaces the inconsistency.
Terraform solves this by making the cluster declaration the source of truth—every node pool size, every security flag, every network setting lives in version-controlled HCL that can be reviewed, approved, and applied consistently across environments. The cluster configuration becomes a pull request, not a tribal memory artifact.
Defining the Cluster in Terraform
The google_container_cluster resource exposes every GKE control-plane knob. The arguments below are not optional hygiene—they are the baseline for a cluster you would run in production:
resource "google_container_cluster" "primary" { name = "prod-cluster" location = "us-central1" project = "my-gke-project-123456"
# Remove the default node pool immediately; manage nodes separately. remove_default_node_pool = true initial_node_count = 1
network = google_compute_network.vpc.name subnetwork = google_compute_subnetwork.nodes.name
networking_mode = "VPC_NATIVE" ip_allocation_policy { cluster_secondary_range_name = "pods" services_secondary_range_name = "services" }
private_cluster_config { enable_private_nodes = true enable_private_endpoint = false master_ipv4_cidr_block = "172.16.0.0/28" }
master_authorized_networks_config { cidr_blocks { cidr_block = "10.0.0.0/8" display_name = "internal-vpn" } }
workload_identity_config { workload_pool = "my-gke-project-123456.svc.id.goog" }
release_channel { channel = "REGULAR" }
deletion_protection = true}
resource "google_container_node_pool" "primary_nodes" { name = "primary-pool" cluster = google_container_cluster.primary.name location = "us-central1" node_count = 3
node_config { machine_type = "e2-standard-4" service_account = google_service_account.node_sa.email oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
shielded_instance_config { enable_secure_boot = true enable_integrity_monitoring = true }
workload_metadata_config { mode = "GKE_METADATA" } }
management { auto_repair = true auto_upgrade = true }}Three decisions here are worth calling out explicitly. Setting remove_default_node_pool = true prevents GKE from creating an unmanaged pool that sits outside your Terraform state—a pool you cannot modify or delete through IaC without manual intervention. Managing nodes in a separate google_container_node_pool resource allows you to replace node pools independently without destroying the control plane, which matters when you need to change machine types or update node configurations in production. The deletion_protection = true flag is a hard guardrail against accidental terraform destroy; removing it requires an explicit plan-and-apply cycle, giving you one additional checkpoint before an irreversible operation.
Shielded nodes and GKE_METADATA mode on workload_metadata_config are equally non-negotiable. Shielded nodes verify the integrity of the node OS on every boot. GKE_METADATA mode prevents workloads from querying the instance metadata server directly—a common lateral movement path when Workload Identity is not enforced at the node level.
Remote State with GCS and State Locking
Local state is a liability the moment more than one person touches the codebase. Use a GCS bucket as the backend with object versioning enabled so every state file change is recoverable:
terraform { backend "gcs" { bucket = "my-terraform-state-123456" prefix = "gke/prod" }}GCS provides state locking natively through object metadata—no separate DynamoDB table required as with S3. Enable versioning on the bucket through Terraform itself so you can roll back state corruption without losing the history of what changed and when:
resource "google_storage_bucket" "tf_state" { name = "my-terraform-state-123456" location = "US" force_destroy = false
versioning { enabled = true }
uniform_bucket_level_access = true}Restrict IAM on this bucket tightly. Only the Cloud Build service account and a break-glass admin group should have write access. Engineers who need to inspect state can be granted storage.objectViewer on the bucket without being able to corrupt or overwrite it.
Cloud Build Pipeline with Approval Gates
A plan that runs locally and an apply that runs from a laptop are both undocumented. The Cloud Build pipeline below makes every cluster change auditable and requires a human approval between plan and apply:
steps: - name: "hashicorp/terraform:1.9" entrypoint: "sh" args: - "-c" - | terraform init \ -backend-config="bucket=my-terraform-state-123456" \ -backend-config="prefix=gke/prod" id: init
- name: "hashicorp/terraform:1.9" entrypoint: "terraform" args: ["validate"] id: validate
- name: "hashicorp/terraform:1.9" entrypoint: "terraform" args: ["plan", "-out=tfplan", "-input=false"] id: plan
- name: "hashicorp/terraform:1.9" entrypoint: "terraform" args: ["apply", "-input=false", "tfplan"] id: apply
options: logging: CLOUD_LOGGING_ONLYConnect this pipeline to a protected branch in your repository. Enable Cloud Build’s manual approval feature on the apply step so a second engineer reviews the plan output before any change reaches the cluster. The service account running Cloud Build needs only container.admin and iam.serviceAccountUser on the project—nothing broader. Auditors get a complete record: who approved, what the plan showed, and exactly when the apply ran.
Pro Tip: Keep cluster provisioning and workload deployment in entirely separate pipelines. Mixing them creates a coupling where a broken application deployment can block a critical infrastructure change, or worse, a cluster upgrade triggers an accidental workload rollout. Separation also narrows the blast radius of pipeline failures—a bad Helm chart does not take down your ability to patch a node pool.
With the cluster declared, version-controlled, and deployable through a reproducible pipeline, the foundation is stable. The next section builds on it by locking down who and what can run inside that cluster—covering Workload Identity bindings, RBAC policies, and Pod Security Admission enforcement.
GKE Security Hardening: Workload Identity, RBAC, and Pod Security
Default GKE clusters give you a running Kubernetes environment. Production clusters require something harder to achieve: a security posture where compromised workloads cannot escalate privileges, exfiltrate credentials, or pull unauthorized images. This section builds that posture across four layers—identity, access control, pod security, and image supply chain.
Replace Service Account Keys with Workload Identity
Service account JSON keys are a persistent liability. They can be copied, leaked through logs, or accidentally committed to source control. Workload Identity eliminates them by federating Kubernetes service accounts with GCP IAM service accounts, scoping GCP API access to individual workloads without any key material on disk.
Enable Workload Identity at the cluster level, then bind a Kubernetes service account to a GCP IAM service account:
## Enable on an existing clustergcloud container clusters update my-gke-cluster \ --region=us-central1 \ --workload-pool=my-project-123456.svc.id.googapiVersion: v1kind: ServiceAccountmetadata: name: payments-processor namespace: payments annotations:gcloud iam service-accounts add-iam-policy-binding \ --role=roles/iam.workloadIdentityUser \ --member="serviceAccount:my-project-123456.svc.id.goog[payments/payments-processor]"Pods in the payments namespace using the payments-processor KSA now authenticate to GCP as payments-sa via the metadata server—no keys, no rotation schedules, no secret mounts.
💡 Pro Tip: Block access to the underlying GCE metadata server from workloads that should not reach it. Set
--workload-metadata=GKE_METADATAon node pools to expose only the Workload Identity endpoint and hide node-level credentials entirely.
RBAC: Namespace-Scoped, Least-Privilege by Default
Cluster-admin bindings are the Kubernetes equivalent of giving every engineer root on every production server. Scope roles to namespaces and grant only what the workload explicitly needs.
apiVersion: rbac.authorization.k8s.io/v1kind: Rolemetadata: name: payments-deployer namespace: paymentsrules: - apiGroups: ["apps"] resources: ["deployments", "replicasets"] verbs: ["get", "list", "watch", "create", "update", "patch"] - apiGroups: [""] resources: ["pods", "pods/log"] verbs: ["get", "list", "watch"]---apiVersion: rbac.authorization.k8s.io/v1kind: RoleBindingmetadata: name: payments-deployer-binding namespace: paymentssubjects: - kind: ServiceAccount name: payments-processor namespace: paymentsroleRef: kind: Role name: payments-deployer apiGroup: rbac.authorization.k8s.ioAudit existing bindings regularly with kubectl get clusterrolebindings -o json | jq to surface any cluster-admin grants that crept in through Helm charts or manual kubectl operations.
Pod Security Standards via Admission Control
Kubernetes Pod Security Standards (PSS) replaced PodSecurityPolicy in 1.25. Enforce them at the namespace level using the built-in admission controller:
apiVersion: v1kind: Namespacemetadata: name: payments labels: pod-security.kubernetes.io/enforce: restricted pod-security.kubernetes.io/enforce-version: latest pod-security.kubernetes.io/warn: restricted pod-security.kubernetes.io/audit: restrictedThe restricted profile blocks privilege escalation, requires non-root UIDs, enforces read-only root filesystems, and drops all Linux capabilities. Apply baseline to namespaces running third-party workloads that the restricted profile breaks, and investigate why they need the additional permissions.
Binary Authorization: Gate on Attested Images
Binary Authorization enforces a deploy-time policy that containers must be attested before GKE admits them. Pair it with Artifact Registry and Cloud Build to create a verifiable chain from source commit to running pod.
defaultAdmissionRule: evaluationMode: REQUIRE_ATTESTATION enforcementMode: ENFORCED_BLOCK_AND_AUDIT_LOG requireAttestationsBy: - projects/my-project-123456/attestors/build-verifiedclusterAdmissionRules: us-central1.my-gke-cluster: evaluationMode: REQUIRE_ATTESTATION enforcementMode: ENFORCED_BLOCK_AND_AUDIT_LOG requireAttestationsBy: - projects/my-project-123456/attestors/build-verifiedImages that were not signed by your Cloud Build pipeline—including images pulled directly from Docker Hub with latest tags—are blocked at admission. This single control eliminates an entire class of supply chain attacks.
With identity, access control, pod security, and image verification in place, the next layer to harden is the network perimeter itself. The following section covers VPC-native cluster configuration, Network Policies that enforce service-to-service communication rules, and load balancer hardening for traffic entering the cluster from outside.
GKE Networking: VPC-Native Clusters, Network Policies, and Load Balancing
GKE networking decisions made at cluster creation time are permanent. VPC-native mode, private endpoints, and network policy enforcement are foundational choices that determine your security posture for the cluster’s lifetime. Get them right upfront.
VPC-Native Clusters Are the Only Production Choice
VPC-native clusters use alias IP ranges, assigning each pod an IP address directly from the VPC subnet rather than routing traffic through node IP masquerading. This eliminates the need for custom routes, enables direct pod-to-pod communication across nodes without overlay overhead, and unlocks native integration with Cloud Load Balancing, Cloud NAT, and VPC firewall rules.
Routes-based clusters require GKE to create one static route per node in your VPC, which hits a hard 250-route limit per network and creates operational drag as clusters scale. Alias IP clusters have no such constraint. Enable VPC-native mode at cluster creation with --enable-ip-alias; it cannot be retrofitted.
Private Clusters: Lock Down Node and Control Plane Access
Private clusters assign nodes only internal IP addresses. The control plane is accessible either through a private endpoint only, or through both private and public endpoints with authorized networks. For production, restrict external access to the control plane entirely:
resource "google_container_cluster" "prod" { name = "prod-cluster" location = "us-central1"
private_cluster_config { enable_private_nodes = true enable_private_endpoint = true master_ipv4_cidr_block = "172.16.0.32/28" }
master_authorized_networks_config { cidr_blocks { cidr_block = "10.0.0.0/8" display_name = "internal-vpn" } }
ip_allocation_policy { cluster_secondary_range_name = "pods" services_secondary_range_name = "services" }}Private nodes have no public IPs, so outbound internet access requires Cloud NAT. Attach a Cloud NAT gateway to the node subnet and GKE handles the rest transparently — nodes pull container images, reach external APIs, and download updates without any node holding a routable public address.
Zero-Trust Pod Communication with NetworkPolicy
By default, all pods in a cluster can communicate freely. Enforce explicit allow-lists using Kubernetes NetworkPolicy, backed by Calico (enabled automatically on Dataplane V2 clusters via eBPF):
apiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: default-deny-all namespace: paymentsspec: podSelector: {} policyTypes: - Ingress - EgressapiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: allow-api-to-postgres namespace: paymentsspec: podSelector: matchLabels: app: postgres ingress: - from: - podSelector: matchLabels: app: api-server ports: - protocol: TCP port: 5432Start with a default-deny-all policy in every namespace and add explicit allow rules. This forces every team to declare their communication requirements, making blast radius from a compromised pod deterministic.
💡 Pro Tip: Enable Dataplane V2 (
--enable-dataplane-v2) at cluster creation. It replaces kube-proxy with eBPF, enforces NetworkPolicy in-kernel with lower latency, and provides built-in network observability through the GKE Network Policy Logging feature.
Ingress vs. Gateway API vs. Cloud Load Balancing
For external traffic ingress, the decision tree is straightforward. Use the Gateway API (gateway.networking.k8s.io) for new workloads — it supports multi-cluster routing, traffic splitting, and header-based routing without annotation sprawl. The GKE Gateway controller provisions external Application Load Balancers natively.
Reserve the classic Ingress resource for existing workloads already using the GKE Ingress controller. Use Service with type: LoadBalancer directly only for TCP/UDP passthrough where HTTP semantics are unnecessary — internal gRPC backends and database proxies are typical cases.
With networking hardened and traffic routing correctly scoped, the focus shifts to making cluster behavior visible — which requires an observability stack that accounts for GKE-specific telemetry sources.
Observability: Logging, Monitoring, and Alerting That Actually Works in GKE
GKE ships with Cloud Logging and Cloud Monitoring enabled by default, but “enabled” is doing a lot of heavy lifting. What you get out of the box is node-level system logs and basic container stdout/stderr. What you need in production is structured, queryable application logs, workload-level metrics, and alerts that fire before a page happens — not after.
What GKE Exports by Default (and What It Doesn’t)
By default, GKE exports system component logs (kubelet, kube-apiserver, kube-scheduler) and unstructured container output to Cloud Logging. Cloud Monitoring receives node-level metrics: CPU, memory, disk, and network. That’s enough to know a node is unhealthy. It’s not enough to know why a specific deployment is degrading.
The two gaps you must close explicitly:
- Application structured logging — your containers must write JSON to stdout, and your log router must be configured to parse it
- Workload metrics — GKE doesn’t scrape your application’s Prometheus endpoints unless you enable Managed Service for Prometheus (GMP)
Structured Logging from Application Containers
Cloud Logging automatically promotes JSON log fields to indexed, queryable fields if your application writes valid JSON to stdout. This means a log line like:
{ "severity": "ERROR", "message": "upstream timeout", "traceId": "abc123def456", "httpRequest": { "requestMethod": "GET", "requestUrl": "/api/orders", "status": 504, "latencyMs": 3012 }}becomes filterable in Log Explorer with jsonPayload.httpRequest.status=504. Without this, you’re grepping unstructured text — not viable at scale.
GKE Managed Prometheus for Workload Metrics
Enable Managed Service for Prometheus at the cluster level via Terraform:
resource "google_container_cluster" "primary" { name = "prod-cluster" location = "us-central1"
monitoring_config { enable_components = ["SYSTEM_COMPONENTS", "APISERVER", "SCHEDULER", "CONTROLLER_MANAGER"] managed_prometheus { enabled = true } }
logging_config { enable_components = ["SYSTEM_COMPONENTS", "WORKLOADS"] }}Once GMP is active, instrument your workloads with a PodMonitoring resource to scrape application metrics without running a self-managed Prometheus:
apiVersion: monitoring.googleapis.com/v1kind: PodMonitoringmetadata: name: orders-service-monitor namespace: productionspec: selector: matchLabels: app: orders-service endpoints: - port: metrics interval: 30sGMP handles collection, storage, and federation — no Thanos, no PVCs, no retention management.
Alerting on the Right Signals
Alerting on CPU at 80% is noise. Alert on conditions that indicate actual production risk:
| Signal | Metric / Log Filter | Threshold |
|---|---|---|
| Node not-ready | kubernetes.io/node/status | >= 1 node for 5m |
| Pod crash loop | kubernetes.io/pod/restart_count | delta > 5 in 10m |
| OOM kill | Log filter: "OOMKilled" in jsonPayload.reason | Any occurrence |
| Quota exhaustion | compute.googleapis.com/quota/exceeded | Any occurrence |
Define these as Cloud Monitoring alerting policies in Terraform to ensure they’re version-controlled and deployed consistently across environments — not manually clicked into existence in the console.
💡 Pro Tip: Add
kubernetes_clusterandnamespaceas grouping labels in your alert policies. Without them, a single noisy namespace can suppress alerts from the rest of the cluster during an incident.
With structured logs flowing, Prometheus metrics scraped, and alert policies codified, you have a production-grade observability layer that runs entirely on GCP-native infrastructure. The next section addresses the operational side of keeping that cluster healthy long-term: upgrade strategies, autoscaling configuration, and cost governance to prevent runaway spend as the cluster scales.
Operational Readiness: Upgrades, Autoscaling, and Cost Governance
A GKE cluster that is secure and well-networked on day one degrades without a disciplined operational model. Upgrades slip, costs sprawl, and scaling decisions made under load pressure introduce instability. This section defines the framework that keeps a production cluster healthy over its lifetime.
Release Channels: Choosing Your Upgrade Cadence
GKE’s release channels — Rapid, Regular, and Stable — determine when your control plane and node pools receive Kubernetes version updates. For most production workloads, Regular is the right default: it delivers versions that have already been validated in Rapid for several weeks, balancing feature availability against risk. Reserve Stable for regulated environments or clusters running workloads where any unexpected change carries significant operational cost.
Enroll node pools in the same release channel as the control plane, and enable maintenance windows to confine automatic upgrades to low-traffic periods. Letting GKE manage upgrades automatically is almost always preferable to manual version pinning — deferred upgrades accumulate security debt faster than most teams realize.
Autoscaling Without Thrashing
The cluster autoscaler (CA) and Vertical Pod Autoscaler (VPA) are complementary but require coordination. CA scales nodes based on unschedulable pods; VPA adjusts pod resource requests based on observed usage. The failure mode is a feedback loop: VPA raises a pod’s CPU request, CA adds a node, VPA raises requests further, and the cycle continues.
Prevent this by running VPA in Recommend mode initially. Audit the recommendations over one to two weeks before switching to Auto mode. Set explicit minAllowed and maxAllowed bounds in every VPA object, and configure CA’s scale-down-utilization-threshold conservatively (0.6 is a reasonable starting point) to avoid premature scale-in that triggers immediate scale-out again.
PodDisruptionBudgets as a Safety Gate
Node upgrades and scale-down events both trigger pod evictions. Without PodDisruptionBudgets (PDBs), a rolling node drain can briefly take an entire deployment offline. Define a PDB for every stateless service with more than one replica, enforcing minAvailable of at least one. For stateful workloads, set maxUnavailable: 0 to serialize evictions entirely.
💡 Pro Tip: PDBs protect against voluntary disruptions only. Pair them with appropriate pod anti-affinity rules to ensure replicas are spread across nodes — otherwise a PDB with
minAvailable: 1on a two-replica deployment where both pods sit on the same node provides no real protection.
Cost Attribution and Namespace Quotas
Resource labels are the foundation of GKE cost attribution. Apply team, env, and cost-center labels consistently to node pools — GKE Cost Allocation surfaces these in the Billing console at the namespace and label level. Namespace-level ResourceQuotas set hard ceilings on CPU and memory consumption, preventing a single team’s runaway workload from starving cluster neighbors.
Review quota utilization monthly. A namespace consistently hitting 90% of its quota is a signal to right-size the allocation, not an automatic trigger to increase it — investigate actual resource usage via VPA recommendations first.
Operational Health Review Checklist
Run this review weekly against every production cluster:
- Control plane and node pool versions within one minor version of the release channel tip
- No nodes in
NotReadyorSchedulingDisabledstate for more than 15 minutes - CA activity log shows no repeated scale-up/scale-down cycles within the same hour
- PDB violations in the past seven days (check via
kubectl get events) - Namespace quota utilization above 80% flagged for review
- Committed Use Discounts coverage reviewed against current node pool sizing
Operational readiness is not a one-time checklist — it is a recurring practice embedded in the team’s rhythm.
Together, security hardening, networking, and operational discipline form a coherent production posture rather than three independent workstreams. The clusters that hold up under real conditions are the ones where these layers reinforce each other: IaC ensures security flags are consistently set, network policies constrain the blast radius of a compromised workload, and observability surfaces drift before it becomes an incident.
Key Takeaways
- Enable Workload Identity on every GKE cluster from day one—retrofitting it onto a running cluster with existing service account keys is painful and error-prone
- Provision GKE clusters exclusively through Terraform with Cloud Build CI—manual cluster creation is a one-way door to configuration drift and undocumented state
- Configure NetworkPolicy to default-deny all pod-to-pod traffic per namespace and explicitly allow only what’s required—open-by-default is the root cause of most lateral movement in compromised clusters
- Subscribe to the Regular release channel and test upgrades in a staging cluster before production—never let your production cluster fall more than two minor versions behind
- Set PodDisruptionBudgets on all stateful and latency-sensitive workloads before enabling cluster autoscaler to prevent autoscaler-driven disruptions from cascading into outages