Hero image for Building Kubernetes Operators: A Pattern Language for Distributed State Machines

Building Kubernetes Operators: A Pattern Language for Distributed State Machines


You’ve written your first Kubernetes operator, it works in development, but in production it corrupts data during network partitions and creates duplicate resources during controller restarts. The operator pattern isn’t just about custom resources—it’s about building correct distributed state machines.

The problem is subtle. Your reconciliation loop watches for changes, compares observed state to desired state, and issues API calls to converge them. It passes every test. Then production hits: the API server becomes briefly unreachable, your controller crashes mid-reconciliation, or two replicas of your operator process the same event simultaneously. Suddenly you’re debugging why a single database backup request spawned three PersistentVolumes, or why a scaling operation left your StatefulSet in an inconsistent state that requires manual intervention.

These failures don’t stem from bugs in your business logic. They emerge from treating the operator as a simple control loop rather than what it actually is: a distributed state machine operating in an environment with partial failures, concurrent modifications, and no global synchronization. The reconciliation function isn’t just a callback—it’s a state transition function that must be idempotent, handle arbitrary starting states, and converge correctly even when the system underneath it lies.

The Kubernetes API provides specific guarantees around resource versioning, watch consistency, and optimistic concurrency. Your operator’s correctness depends entirely on how you design state transitions around these guarantees. Miss one assumption about ordering or retries, and you’ve built a distributed system bug that only manifests under load.

The Operator as a Distributed State Machine

When you write a Kubernetes operator, you’re not just building an API controller that watches resources and makes changes. You’re implementing a distributed state machine that must handle partial failures, concurrent modifications, and eventual consistency—all while maintaining correctness guarantees that would make a database engineer proud.

Visual: Operator state machine with reconciliation loops and failure paths

The reconciliation loop is your state transition function. It receives the current observed state (what actually exists in the cluster and external systems) and the desired state (what the custom resource declares), then executes transitions to converge them. But unlike a textbook state machine with clean inputs and deterministic outputs, your operator runs in an environment where reads can be stale, writes can fail halfway through, and the same transition might execute multiple times.

This matters because the failure modes that plague production operators are almost never logic errors in happy-path code. They’re the bugs that emerge from implicit assumptions about state machine properties. An operator that works perfectly in development can exhibit data corruption in production because the developer assumed transitions would execute exactly once, or that observed state would always reflect the most recent write, or that external API calls would either fully succeed or fully fail.

State Guarantees You Don’t Actually Have

Kubernetes provides eventual consistency, not strong consistency. When your reconciliation loop reads a Pod’s status, that status might be seconds or even minutes old due to caching and watch delays. When you update a resource, other controllers might see the old version for a brief window. This means your state transitions must tolerate operating on stale data without corrupting system state.

The watch-based architecture compounds this. Your operator doesn’t poll for changes—it receives events. But events can arrive out of order, get delayed arbitrarily, or in rare cases, get lost entirely. The reconciliation pattern handles this through eventual convergence: as long as you eventually receive an event (or a periodic resync triggers), you’ll eventually reach the desired state. But “eventually” can mask serious bugs if your transition function assumes sequential, ordered execution.

Designing for Distributed Reality

The most reliable operators treat every reconciliation as if it’s the first time they’ve seen a resource, regardless of what they think they know from previous runs. They don’t maintain in-memory state about what actions they’ve taken. Instead, they examine the current observed state, compare it to the desired state, and execute whatever transitions are necessary—even if those transitions were “already done” in a previous reconciliation that partially failed.

This approach only works if your transitions are idempotent, which brings us to the core challenge of operator development: designing reconciliation logic that produces correct results whether it runs once or a thousand times, regardless of where previous attempts failed.

Designing Idempotent Reconciliation Logic

The reconciliation loop is the heart of any Kubernetes operator, and its correctness depends entirely on idempotency. A reconcile function that produces different results when called multiple times with the same input will create unpredictable behavior, resource drift, and cascading failures. The controller-runtime framework doesn’t guarantee exactly-once execution—it guarantees at-least-once. Your reconciler will be called repeatedly, sometimes for the same event, and must handle this gracefully.

Understanding why idempotency matters requires understanding how Kubernetes controllers actually run. Events arrive out of order. Network partitions cause retries. Controller restarts replay the watch stream. A single user change might trigger multiple reconciliations due to status updates, finalizer modifications, or owner reference changes. If your reconciler isn’t idempotent, these normal operating conditions become error conditions. You might create duplicate cloud resources, send duplicate notifications, or corrupt state by applying the same mutation twice.

The Compare-and-Reconcile Pattern

The foundational pattern for idempotent reconciliation is compare-and-reconcile: read the current state, compare it to the desired state, and take action only when they diverge. Never blindly apply changes.

controllers/database_controller.go
func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
var db apiv1.Database
if err := r.Get(ctx, req.NamespacedName, &db); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// Read actual state from external system
actual, err := r.DatabaseClient.GetInstance(ctx, db.Spec.InstanceID)
if err != nil {
return ctrl.Result{}, fmt.Errorf("failed to read instance: %w", err)
}
// Compare desired vs actual
if actual.Storage != db.Spec.StorageGB {
// Only modify if divergent
if err := r.DatabaseClient.ResizeStorage(ctx, db.Spec.InstanceID, db.Spec.StorageGB); err != nil {
return ctrl.Result{}, err
}
}
if actual.Version != db.Spec.Version {
if err := r.DatabaseClient.UpgradeVersion(ctx, db.Spec.InstanceID, db.Spec.Version); err != nil {
return ctrl.Result{}, err
}
}
return ctrl.Result{}, nil
}

This pattern ensures that calling Reconcile multiple times with an already-correct state is a no-op. The comparison step is critical—without it, you’d attempt to resize storage or upgrade versions that are already at the target state, potentially triggering errors or unnecessary work in your external systems.

The comparison logic must be precise. Avoid string comparisons for numeric values. Normalize inputs before comparing—trim whitespace, convert to canonical formats, handle default values consistently. If your external system returns "true" as a string but your spec uses a boolean, the comparison will always show divergence even when the states match semantically. Consider using deep equality checks with normalized representations, or implement custom comparison functions that understand your domain’s semantics.

Handling Partial Failures and Resumability

Real-world reconciliation involves multiple steps, and any step can fail. The reconciler must be resumable: if it fails halfway through, the next invocation should continue from where it left off, not restart from the beginning. This property is essential for operations that take minutes or hours—provisioning cloud infrastructure, waiting for DNS propagation, or running database migrations.

controllers/database_controller.go
func (r *DatabaseReconciler) reconcileDatabase(ctx context.Context, db *apiv1.Database) error {
// Step 1: Ensure instance exists
if db.Status.InstanceID == "" {
instanceID, err := r.DatabaseClient.CreateInstance(ctx, db.Spec)
if err != nil {
return fmt.Errorf("failed to create instance: %w", err)
}
db.Status.InstanceID = instanceID
db.Status.Phase = "Provisioning"
return r.Status().Update(ctx, db) // Persist progress
}
// Step 2: Wait for instance to be ready
ready, err := r.DatabaseClient.IsReady(ctx, db.Status.InstanceID)
if err != nil {
return err
}
if !ready {
return fmt.Errorf("instance not ready, requeuing")
}
// Step 3: Configure backups
if db.Status.BackupsConfigured != true {
if err := r.DatabaseClient.EnableBackups(ctx, db.Status.InstanceID, db.Spec.BackupSchedule); err != nil {
return err
}
db.Status.BackupsConfigured = true
return r.Status().Update(ctx, db)
}
db.Status.Phase = "Ready"
return r.Status().Update(ctx, db)
}

By persisting progress in the status subresource after each successful step, the reconciler can skip already-completed work on retry. If EnableBackups fails, the next reconciliation will start from that step rather than attempting to recreate an instance that already exists. This checkpoint-and-resume pattern transforms a fragile multi-step process into a reliable state machine.

Each checkpoint requires a status update, which triggers another reconciliation. This is intentional and correct. The subsequent reconciliation will see the updated status, skip completed steps, and proceed to the next. The cost of extra reconciliations is negligible compared to the robustness gained.

💡 Pro Tip: Use status conditions to track granular progress. A single Phase field doesn’t capture enough detail for complex multi-step operations. Add typed conditions like InstanceReady, BackupsConfigured, and MonitoringEnabled for better observability and more precise resumption logic.

Observed Generation Tracking

Kubernetes updates the metadata.generation field whenever the spec changes. Your operator should track which generation it has successfully reconciled to avoid reprocessing unchanged resources. Without this mechanism, every reconciliation—triggered by status updates, periodic resyncs, or controller restarts—would recheck and potentially modify external state, even when nothing has changed.

controllers/database_controller.go
func (r *DatabaseReconciler) needsReconciliation(db *apiv1.Database) bool {
return db.Status.ObservedGeneration != db.Generation
}
func (r *DatabaseReconciler) markReconciled(ctx context.Context, db *apiv1.Database) error {
db.Status.ObservedGeneration = db.Generation
return r.Status().Update(ctx, db)
}

When observedGeneration matches generation, the resource is in sync. This prevents wasteful reconciliations triggered by status-only updates or controller restarts. Combined with idempotent operations, observed generation tracking ensures your operator scales efficiently even with thousands of resources.

However, don’t skip reconciliation entirely when generations match. External systems can drift independently—a cloud provider might resize your database instance, a user might manually modify configuration, or another controller might make conflicting changes. Use observed generation to skip expensive spec-driven operations, but still perform periodic drift detection by comparing actual state to desired state. A hybrid approach—fast-path when generations match, full reconciliation on a slower cadence—provides both efficiency and correctness.

The patterns in this section form the foundation for reliable operator behavior. With idempotent reconciliation established, the next challenge is managing state that lives outside the Kubernetes API—external databases, cloud resources, and third-party services that don’t natively support Kubernetes-style declarative management.

Managing External State and Side Effects

Most production operators manage resources that exist outside Kubernetes: cloud databases, DNS records, storage buckets, certificates. This creates a fundamental challenge: Kubernetes controllers operate on eventual consistency, but external systems often require explicit cleanup and may fail partially. Without careful design, you’ll leak resources, orphan cloud infrastructure, or corrupt state during failures.

The External State Problem

When your operator provisions an RDS instance or creates a cloud load balancer, Kubernetes becomes a control plane, not the source of truth. The external resource has its own lifecycle, billing implications, and failure modes. If your pod crashes mid-creation, you need to detect the partially-created resource and either complete or clean it up—not create a duplicate.

The solution is treating external state as part of your state machine. Store enough information in the custom resource’s status to correlate Kubernetes objects with external resources:

api/v1/database_types.go
type DatabaseStatus struct {
// ProviderID uniquely identifies the external resource
ProviderID string `json:"providerID,omitempty"`
// Phase tracks the reconciliation state
Phase DatabasePhase `json:"phase"`
// Endpoint is populated once the database is ready
Endpoint string `json:"endpoint,omitempty"`
Conditions []metav1.Condition `json:"conditions,omitempty"`
}

Your reconciler checks the ProviderID first. If it exists, query the external system to determine the actual state. If it doesn’t exist, you’re either creating a new resource or recovering from a crash before the ID was persisted. This idempotency token pattern is critical: cloud providers often support client-provided request IDs that prevent duplicate resource creation even when network failures force retries.

Store additional metadata when dealing with eventually-consistent external APIs. For example, AWS resources might not be immediately queryable after creation. Track the creation timestamp and expected propagation delay to distinguish between “not created yet” and “created but not visible.” This prevents false negatives that trigger duplicate creation attempts.

Finalizers and Deletion Guarantees

Kubernetes deletes objects immediately unless they have finalizers. For operators managing external state, finalizers are mandatory—they’re your guarantee that cleanup code runs before the object disappears:

controllers/database_controller.go
const databaseFinalizerName = "database.example.com/finalizer"
func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
db := &examplev1.Database{}
if err := r.Get(ctx, req.NamespacedName, db); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// Handle deletion
if !db.DeletionTimestamp.IsZero() {
if controllerutil.ContainsFinalizer(db, databaseFinalizerName) {
if err := r.deleteExternalResources(ctx, db); err != nil {
return ctrl.Result{}, fmt.Errorf("failed to delete external resources: %w", err)
}
controllerutil.RemoveFinalizer(db, databaseFinalizerName)
if err := r.Update(ctx, db); err != nil {
return ctrl.Result{}, err
}
}
return ctrl.Result{}, nil
}
// Add finalizer if it doesn't exist
if !controllerutil.ContainsFinalizer(db, databaseFinalizerName) {
controllerutil.AddFinalizer(db, databaseFinalizerName)
if err := r.Update(ctx, db); err != nil {
return ctrl.Result{}, err
}
}
// Normal reconciliation continues here
return r.reconcileDatabase(ctx, db)
}

The finalizer ensures deleteExternalResources runs even if the operator crashes—Kubernetes will retry the reconciliation until the finalizer is removed. Never skip finalizers for resources with external state; the cost of orphaned infrastructure far exceeds the complexity.

Make your deletion logic idempotent and defensive. External resources may already be deleted (manual cleanup, separate automation, billing lapses). Check existence before attempting deletion and treat “resource not found” as success, not failure. Similarly, handle partial deletion gracefully: if you manage both a database instance and its backup schedule, delete them in dependency order and track which steps completed in status conditions.

Two-Phase Operations

External APIs are rarely transactional. Creating a database instance might succeed while configuring its security group fails. Implement two-phase operations by tracking intermediate states:

controllers/database_controller.go
func (r *DatabaseReconciler) reconcileDatabase(ctx context.Context, db *examplev1.Database) (ctrl.Result, error) {
switch db.Status.Phase {
case "":
// Phase 1: Create the database instance
providerID, err := r.provisionInstance(ctx, db)
if err != nil {
return ctrl.Result{}, err
}
db.Status.ProviderID = providerID
db.Status.Phase = "Provisioning"
return ctrl.Result{Requeue: true}, r.Status().Update(ctx, db)
case "Provisioning":
// Phase 2: Wait for ready and configure networking
ready, endpoint, err := r.checkInstanceReady(ctx, db.Status.ProviderID)
if err != nil {
return ctrl.Result{}, err
}
if !ready {
return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
}
db.Status.Endpoint = endpoint
db.Status.Phase = "Ready"
return ctrl.Result{}, r.Status().Update(ctx, db)
case "Ready":
// Continuously verify the external resource exists
return r.verifyExternalState(ctx, db)
}
return ctrl.Result{}, nil
}

Each phase persists its progress before moving forward. If the operator restarts, it resumes from the last successful state instead of duplicating work. This state machine approach naturally handles transient failures: a network timeout during phase 2 simply causes a retry without losing the ProviderID from phase 1.

For long-running operations like database restores or certificate issuance, use conditions to communicate progress. Set a condition like type: DatabaseProvisioning, status: True, reason: WaitingForEndpoint so users understand why their resource isn’t ready yet. Update the condition’s message with details like “Instance i-abc123 is 60% through initial backup” when the external API provides progress information.

💡 Pro Tip: Use conditions to track individual steps within phases. This gives users visibility into long-running operations and helps debugging when automation stalls.

Ownership References for Cascading Cleanup

When your operator creates supporting Kubernetes resources (Secrets, ConfigMaps, Services), use owner references to enable automatic garbage collection. This prevents orphaned resources when the parent is deleted:

controllers/database_controller.go
secret := &corev1.Secret{
ObjectMeta: metav1.ObjectMeta{
Name: db.Name + "-credentials",
Namespace: db.Namespace,
OwnerReferences: []metav1.OwnerReference{
*metav1.NewControllerRef(db, examplev1.GroupVersion.WithKind("Database")),
},
},
Data: map[string][]byte{
"password": generatedPassword,
},
}

The OwnerReferences field creates a dependency chain. When the Database is deleted, Kubernetes automatically removes owned resources. This pattern only works for resources in the same namespace—for cluster-scoped resources or cross-namespace dependencies, implement explicit cleanup in your finalizer.

Consider the timing of ownership assignment carefully. If you create a Secret before the parent resource’s finalizer is added, there’s a brief window where deleting the parent could orphan the Secret. Either add the finalizer first (in the same reconciliation that creates owned resources) or make owned resource cleanup explicit in the finalizer as a safety net.

With proper finalizers, phase tracking, and ownership references, your operator handles external state reliably. The next challenge is surviving the inevitable failures: network partitions, cloud API rate limits, and transient errors that require sophisticated retry strategies.

Failure Modes and Recovery Strategies

Operators run in inherently unreliable environments. Network partitions occur, pods restart mid-reconciliation, and external APIs fail. The difference between a production-ready operator and a toy example lies in how gracefully it handles these failure modes.

Network Partitions and Split-Brain Prevention

When an operator loses connectivity to the API server or an external system, it must avoid creating inconsistent state. The controller-runtime’s lease-based leader election prevents multiple controller instances from reconciling simultaneously, but this doesn’t protect against stale reads or writes that complete after a partition heals.

Always use resource versions and optimistic concurrency control. Every status update should include the observed generation and use strategic merge patches to avoid clobbering concurrent updates:

controllers/database_controller.go
func (r *DatabaseReconciler) updateStatus(ctx context.Context, db *v1alpha1.Database, status v1alpha1.DatabaseStatus) error {
patch := client.MergeFrom(db.DeepCopy())
db.Status = status
db.Status.ObservedGeneration = db.Generation
if err := r.Status().Patch(ctx, db, patch); err != nil {
if apierrors.IsConflict(err) {
// Another reconciliation updated status; requeue to observe latest state
return fmt.Errorf("status update conflict (expected generation %d): %w", db.Generation, err)
}
return err
}
return nil
}

The ObservedGeneration field serves as a fencing token. Clients can detect stale status by comparing it against the resource’s current generation. If your operator makes idempotent external API calls, include unique operation identifiers derived from the resource UID and generation to prevent duplicate execution when retrying after a partition. For non-idempotent operations like sending notifications, store completion markers in the status before executing the action, then check these markers on every reconciliation to avoid repeating side effects.

Controller Restarts and Operation Recovery

When your operator pod crashes during a long-running operation like a database migration, the new instance must pick up where the previous one left off. Store operation state in custom resource status fields, not in-memory:

controllers/migration_controller.go
func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
migration := &v1alpha1.Migration{}
if err := r.Get(ctx, req.NamespacedName, migration); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// Resume from last checkpoint stored in status
switch migration.Status.Phase {
case "":
// Start new migration
return r.initiateMigration(ctx, migration)
case "SchemaValidated":
// Resume at data copy phase
return r.copyData(ctx, migration)
case "DataCopied":
// Resume at verification phase
return r.verifyMigration(ctx, migration)
case "Completed":
return ctrl.Result{}, nil
default:
return ctrl.Result{}, fmt.Errorf("unknown phase: %s", migration.Status.Phase)
}
}

Each phase transition updates the status before proceeding, ensuring the operator can resume at the correct step after any restart. For operations involving external systems, verify the actual state rather than trusting your recorded phase—a previous reconciliation might have partially completed before the crash. Query the external system to confirm the migration phase matches what you expect, and handle cases where reality has diverged from your status record.

Rate Limiting and Exponential Backoff

Failing reconciliations consume cluster resources and can overwhelm external APIs. The default controller-runtime workqueue implements exponential backoff, but you should customize it for your failure patterns:

main.go
func main() {
mgr, _ := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{})
controller, _ := ctrl.NewControllerManagedBy(mgr).
For(&v1alpha1.Database{}).
WithOptions(controller.Options{
RateLimiter: workqueue.NewItemExponentialFailureRateLimiter(
5*time.Second, // Base delay
300*time.Second, // Max delay (5 minutes)
),
MaxConcurrentReconciles: 3,
}).
Build(&DatabaseReconciler{})
}

For transient external API failures, return ctrl.Result{RequeueAfter: 30*time.Second} instead of an error to bypass exponential backoff and retry with a predictable delay. Distinguish between retriable errors (network timeouts, rate limits) and permanent failures (invalid credentials, resource not found). Permanent failures should update status conditions and stop requeueing, while transient failures should trigger controlled retries.

Consider implementing circuit breakers for external dependencies. If an external API returns errors on five consecutive reconciliations, enter a degraded mode where you skip that integration and update status to reflect the unavailable dependency. Resume normal operation only after the external system responds successfully to health checks.

Observability: Metrics, Logs, and Alerts

Instrument reconciliation outcomes to understand failure patterns. Track these metrics at minimum:

  • reconciliation_duration_seconds (histogram): Identifies performance regressions and helps set SLOs
  • reconciliation_errors_total (counter with reason label): Distinguishes retriable vs. permanent failures
  • resource_conditions (gauge): Exposes current health state to Prometheus alerts
  • external_api_requests_total (counter with endpoint and status labels): Detects third-party service degradation

Log structured errors with enough context for debugging but avoid logging on every reconciliation success—status updates already provide an audit trail. Include the resource namespace, name, and generation in every log entry to correlate events across controller restarts. For errors, log the full context needed to reproduce the issue: relevant spec fields, observed external state, and the specific operation that failed.

Alert on sustained error rates, not individual failures:

alerts/operator.yaml
- alert: OperatorHighErrorRate
expr: rate(reconciliation_errors_total[5m]) > 0.1
for: 10m
annotations:
summary: "Operator {{ $labels.controller }} failing >10% of reconciliations"
- alert: OperatorStuckReconciliation
expr: reconciliation_duration_seconds{quantile="0.99"} > 300
for: 15m
annotations:
summary: "P99 reconciliation latency exceeds 5 minutes"

Surface critical status conditions as metrics so platform teams can alert on degraded resources without querying the Kubernetes API directly. A resource_condition{condition="Ready",status="False"} metric enables alerts like “fire if more than 10% of Database resources report Ready=False for 5 minutes.”

With deliberate failure handling, your operator becomes self-healing infrastructure rather than another component requiring manual intervention. Next, we’ll explore how to coordinate multiple custom resources within a single reconciliation cycle to implement complex workflows.

Advanced Patterns: Multi-Resource Coordination

Operators that manage a single Custom Resource are straightforward. Real-world complexity emerges when your operator must coordinate multiple Kubernetes resources with ordering dependencies, where a failure in one reconciliation must prevent or roll back changes to related resources. These multi-resource scenarios demand explicit coordination strategies that go beyond simple CRUD operations.

Parent-Child Hierarchies with Owner References

The simplest coordination pattern uses Kubernetes owner references to establish parent-child relationships. When a parent resource is deleted, the garbage collector automatically cleans up children. However, this automatic cleanup is only one dimension of the coordination problem—you must also enforce ordering constraints during creation and updates.

controllers/database_controller.go
func (r *DatabaseReconciler) reconcileStatefulSet(ctx context.Context, db *v1alpha1.Database) error {
sts := &appsv1.StatefulSet{
ObjectMeta: metav1.ObjectMeta{
Name: db.Name + "-nodes",
Namespace: db.Namespace,
},
}
if err := ctrl.SetControllerReference(db, sts, r.Scheme); err != nil {
return err
}
_, err := controllerutil.CreateOrUpdate(ctx, r.Client, sts, func() error {
// Only proceed if parent is in Ready phase
if db.Status.Phase != v1alpha1.PhaseReady {
return fmt.Errorf("parent not ready, skipping StatefulSet reconciliation")
}
sts.Spec = r.buildStatefulSetSpec(db)
return nil
})
return err
}

This pattern enforces ordering: the controller refuses to reconcile child resources until the parent achieves a specific state. The SetControllerReference call establishes the ownership link, enabling automatic garbage collection. For cascade reconciliation—where parent updates must propagate to children—the parent controller triggers child updates by modifying the parent’s generation or a watched annotation. This creates a reconciliation chain where each level waits for the previous level to stabilize before proceeding.

The limitation of owner references is their namespace boundary: they only work within the same namespace. Cross-namespace dependencies require a different approach.

Cross-Resource Dependencies Without Ownership

Some resources can’t use owner references because they exist in different namespaces or shouldn’t be garbage collected with the parent. Here, you establish dependencies through watches and status conditions, creating a loosely-coupled coordination mechanism.

controllers/backup_controller.go
func (r *BackupReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerFor(&v1alpha1.Backup{}, mgr).
Watches(&v1alpha1.Database{},
handler.EnqueueRequestsFromMapFunc(r.findBackupsForDatabase)).
Complete(r)
}
func (r *BackupReconciler) findBackupsForDatabase(ctx context.Context, db client.Object) []reconcile.Request {
var backups v1alpha1.BackupList
if err := r.List(ctx, &backups,
client.InNamespace(db.GetNamespace()),
client.MatchingFields{"spec.databaseRef": db.GetName()}); err != nil {
return nil
}
requests := make([]reconcile.Request, len(backups.Items))
for i, backup := range backups.Items {
requests[i] = reconcile.Request{
NamespacedName: types.NamespacedName{
Name: backup.Name,
Namespace: backup.Namespace,
},
}
}
return requests
}

When the Database resource changes, this watch automatically triggers reconciliation for all dependent Backups. The backup controller checks the database’s status conditions before proceeding with its own operations. This pattern inverts the dependency relationship: instead of the parent managing children, children watch their dependencies and react to changes. The trade-off is complexity—you must implement the indexing and watch logic yourself—but you gain flexibility to coordinate resources across namespace boundaries and avoid tight coupling through ownership.

Sagas for Multi-Step Stateful Workflows

Visual: Multi-resource coordination with sagas and compensating actions

Distributed sagas provide transaction semantics across multiple resources when you need stronger guarantees than eventual consistency. Each step defines both a forward action and a compensating action for rollback. Store saga state in the resource’s status to survive controller restarts, ensuring the workflow can resume from any point.

controllers/migration_controller.go
type MigrationStep string
const (
StepCreateSnapshot MigrationStep = "CreateSnapshot"
StepScaleDownOld MigrationStep = "ScaleDownOld"
StepProvisionNew MigrationStep = "ProvisionNew"
StepMigrateData MigrationStep = "MigrateData"
StepSwitchTraffic MigrationStep = "SwitchTraffic"
)
func (r *MigrationReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
migration := &v1alpha1.Migration{}
if err := r.Get(ctx, req.NamespacedName, migration); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
if migration.Status.Phase == v1alpha1.PhaseFailed {
return r.executeRollback(ctx, migration)
}
currentStep := migration.Status.CurrentStep
if currentStep == "" {
currentStep = StepCreateSnapshot
}
completed, err := r.executeStep(ctx, migration, currentStep)
if err != nil {
migration.Status.Phase = v1alpha1.PhaseFailed
migration.Status.FailedStep = string(currentStep)
r.Status().Update(ctx, migration)
return ctrl.Result{}, err
}
if !completed {
return ctrl.Result{RequeueAfter: 5 * time.Second}, nil
}
nextStep := r.getNextStep(currentStep)
if nextStep == "" {
migration.Status.Phase = v1alpha1.PhaseComplete
} else {
migration.Status.CurrentStep = string(nextStep)
}
return ctrl.Result{Requeue: true}, r.Status().Update(ctx, migration)
}

The saga pattern provides clear failure semantics: if any step fails, the controller executes compensating actions in reverse order to restore the previous consistent state. Each step must be idempotent, as the reconciliation loop may execute the same step multiple times after a controller restart. The key insight is that sagas trade immediate consistency for reliability—you accept that the system may be temporarily inconsistent during execution, but you guarantee eventual consistency through compensating actions.

Controller Architecture Decisions

The choice between a single controller managing multiple resource types versus multiple specialized controllers depends on your coordination requirements. Use a single controller with multiple watches when resources share reconciliation logic and must coordinate tightly—for example, a database controller that manages Deployments, Services, and PersistentVolumeClaims all driven by the same parent Database resource. This approach simplifies coordination because all logic executes in the same reconciliation loop with shared state.

Split into multiple controllers when resources have independent lifecycles, when you need different rate limiting and concurrency settings for different resource types, or when team boundaries suggest separate ownership. Multiple controllers increase operational complexity—you must reason about ordering across separate reconciliation loops—but they provide better isolation and scaling characteristics.

💡 Pro Tip: Use a single controller with multiple watches when resources share reconciliation logic. Split into multiple controllers when resources have independent lifecycles or when you need different rate limiting and concurrency settings for different resource types.

With these coordination patterns established, the next challenge becomes validating that your operator behaves correctly under failure conditions and concurrent updates—problems that require testing strategies beyond traditional unit tests.

Testing Operators Beyond Unit Tests

Unit tests validate individual functions, but operators are distributed systems—they interact with the Kubernetes API, manage external resources, and handle asynchronous events. Comprehensive testing requires simulating the runtime environment and validating reconciliation behavior under failure conditions, race scenarios, and real-world chaos.

Testing Idempotency with envtest

The controller-runtime’s envtest package provides a real etcd and API server for integration testing. This enables testing reconciliation loops against actual Kubernetes resources without a full cluster, making it the foundation for verifying operator correctness.

controller_test.go
package controller
import (
"context"
"path/filepath"
"testing"
"time"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"k8s.io/client-go/kubernetes/scheme"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/envtest"
)
var k8sClient client.Client
var testEnv *envtest.Environment
func TestReconcileIdempotency(t *testing.T) {
RegisterFailHandler(Fail)
RunSpecs(t, "Controller Suite")
}
var _ = BeforeSuite(func() {
testEnv = &envtest.Environment{
CRDDirectoryPaths: []string{filepath.Join("..", "config", "crd", "bases")},
}
cfg, err := testEnv.Start()
Expect(err).NotTo(HaveOccurred())
k8sClient, err = client.New(cfg, client.Options{Scheme: scheme.Scheme})
Expect(err).NotTo(HaveOccurred())
})
var _ = Describe("Database Reconciliation", func() {
It("produces identical results on repeated reconciliation", func() {
ctx := context.Background()
database := &v1alpha1.Database{
ObjectMeta: metav1.ObjectMeta{
Name: "test-db",
Namespace: "default",
},
Spec: v1alpha1.DatabaseSpec{
Replicas: 3,
Version: "14.5",
},
}
Expect(k8sClient.Create(ctx, database)).To(Succeed())
// First reconciliation
_, err := reconciler.Reconcile(ctx, reconcile.Request{
NamespacedName: types.NamespacedName{
Name: "test-db",
Namespace: "default",
},
})
Expect(err).NotTo(HaveOccurred())
var statefulSet appsv1.StatefulSet
Expect(k8sClient.Get(ctx, client.ObjectKey{
Name: "test-db",
Namespace: "default",
}, &statefulSet)).To(Succeed())
firstGeneration := statefulSet.Generation
// Second reconciliation without changes
_, err = reconciler.Reconcile(ctx, reconcile.Request{
NamespacedName: types.NamespacedName{
Name: "test-db",
Namespace: "default",
},
})
Expect(err).NotTo(HaveOccurred())
Expect(k8sClient.Get(ctx, client.ObjectKey{
Name: "test-db",
Namespace: "default",
}, &statefulSet)).To(Succeed())
Expect(statefulSet.Generation).To(Equal(firstGeneration))
})
})

This test validates that reconciliation is truly idempotent—running it twice without state changes produces no Kubernetes resource updates, evidenced by unchanged generation counters. Idempotency is critical because the reconciliation loop runs continuously, and any non-idempotent operation would cause resource thrashing.

Beyond basic idempotency, envtest excels at validating owner references, finalizer logic, and status subresource updates. Test that your operator correctly sets controller references to enable garbage collection, implements finalizers to clean up external resources before deletion, and updates status without triggering unnecessary reconciliation loops. These patterns are easy to get wrong and difficult to debug in production.

Simulating Failure Scenarios and Race Conditions

Operators must handle transient failures gracefully. Testing failure paths requires injecting errors into dependencies and simulating concurrent modifications that trigger race conditions.

failure_test.go
var _ = Describe("Backup Failure Handling", func() {
It("retries failed backups with exponential backoff", func() {
ctx := context.Background()
// Inject a failing backup client
failingClient := &fakeBackupClient{
createBackupFunc: func(name string) error {
return fmt.Errorf("storage unavailable")
},
}
reconciler.backupClient = failingClient
backup := &v1alpha1.Backup{
ObjectMeta: metav1.ObjectMeta{
Name: "failed-backup",
Namespace: "default",
},
}
Expect(k8sClient.Create(ctx, backup)).To(Succeed())
result, err := reconciler.Reconcile(ctx, reconcile.Request{
NamespacedName: types.NamespacedName{
Name: "failed-backup",
Namespace: "default",
},
})
// Expect requeue with delay
Expect(err).NotTo(HaveOccurred())
Expect(result.RequeueAfter).To(BeNumerically(">", 0))
// Verify status reflects failure
var updatedBackup v1alpha1.Backup
Expect(k8sClient.Get(ctx, client.ObjectKeyFromObject(backup), &updatedBackup)).To(Succeed())
Expect(updatedBackup.Status.Phase).To(Equal("Failed"))
Expect(updatedBackup.Status.Retries).To(Equal(1))
})
It("handles conflicting updates from concurrent reconciliations", func() {
ctx := context.Background()
database := &v1alpha1.Database{
ObjectMeta: metav1.ObjectMeta{
Name: "concurrent-db",
Namespace: "default",
},
Spec: v1alpha1.DatabaseSpec{Replicas: 3},
}
Expect(k8sClient.Create(ctx, database)).To(Succeed())
// Simulate concurrent modification by updating resourceVersion mid-reconciliation
go func() {
time.Sleep(50 * time.Millisecond)
var db v1alpha1.Database
Expect(k8sClient.Get(ctx, client.ObjectKeyFromObject(database), &db)).To(Succeed())
db.Status.Phase = "Modified"
Expect(k8sClient.Status().Update(ctx, &db)).To(Succeed())
}()
// Reconciliation should detect conflict and retry
result, err := reconciler.Reconcile(ctx, reconcile.Request{
NamespacedName: types.NamespacedName{
Name: "concurrent-db",
Namespace: "default",
},
})
// Controller should either succeed after retry or requeue
Expect(err).NotTo(HaveOccurred())
Expect(result.Requeue || result.RequeueAfter > 0).To(BeTrue())
})
})

💡 Pro Tip: Use fake implementations of external dependencies rather than mocks. Fakes allow testing realistic failure sequences without brittle mock expectations, and they’re easier to maintain as your operator evolves.

Failure testing should also cover partial failures where some operations succeed while others fail. For example, if creating a StatefulSet succeeds but creating the corresponding Service fails, does your operator leave the system in a consistent state? Can it recover on the next reconciliation? These scenarios expose bugs in error handling and rollback logic that never surface in happy-path testing.

Integration Testing with Real Dependencies

While envtest provides a real Kubernetes API, testing operators that manage external resources requires integration with those systems. For a database operator, this means testing against actual database instances, not mocks.

integration_test.go
var _ = Describe("PostgreSQL Integration", func() {
var testPostgres *PostgreSQLContainer
BeforeEach(func() {
// Start real PostgreSQL using testcontainers
testPostgres = StartPostgreSQLContainer()
reconciler.postgresClient = NewPostgresClient(testPostgres.ConnectionString())
})
AfterEach(func() {
testPostgres.Terminate()
})
It("creates databases and users with correct permissions", func() {
ctx := context.Background()
dbUser := &v1alpha1.DatabaseUser{
ObjectMeta: metav1.ObjectMeta{Name: "app-user"},
Spec: v1alpha1.DatabaseUserSpec{
Database: "appdb",
Permissions: []string{"SELECT", "INSERT", "UPDATE"},
},
}
Expect(k8sClient.Create(ctx, dbUser)).To(Succeed())
Eventually(func() error {
_, err := reconciler.Reconcile(ctx, reconcile.Request{
NamespacedName: client.ObjectKeyFromObject(dbUser),
})
return err
}, "10s", "1s").Should(Succeed())
// Verify actual database state
grants, err := testPostgres.QueryGrants("app-user", "appdb")
Expect(err).NotTo(HaveOccurred())
Expect(grants).To(ConsistOf("SELECT", "INSERT", "UPDATE"))
})
})

This approach catches issues that unit tests miss—authentication failures, schema migrations, connection pooling problems, and version compatibility issues with external systems. Integration tests are slower and more complex to set up, but they provide confidence that your operator works correctly with real infrastructure, not just idealized interfaces.

For cloud-managed services like RDS or Cloud SQL, consider using emulators or local alternatives during development (LocalStack for AWS, the GCP emulator suite) and reserve testing against real cloud APIs for CI/CD pipelines. This keeps local test runs fast while still validating cloud-specific behavior before deployment.

Chaos Engineering for Production Validation

For chaos engineering in staging environments, tools like Chaos Mesh inject network partitions, pod failures, and API server latency. Run your operator under these conditions to validate graceful degradation and eventual consistency guarantees.

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: operator-partition
spec:
action: partition
mode: all
selector:
namespaces:
- operator-system
direction: to
target:
selector:
namespaces:
- default
duration: "30s"

After applying this chaos experiment, your operator should detect the network partition, back off retries, and eventually reconcile state once connectivity is restored. Monitor metrics like reconciliation duration, error rates, and requeue counts to identify degradation patterns.

Chaos testing reveals emergent behavior that only appears under stress—race conditions that manifest when reconciliation loops are delayed, deadlocks when multiple controllers compete for resources, and cascading failures when error handling isn’t defensive enough. Combine chaos experiments with observability tools to understand how your operator behaves when assumptions about network reliability and API availability break down.

With reconciliation semantics validated through comprehensive testing—from unit tests to chaos experiments—the next consideration is ensuring operators perform efficiently at scale.

Performance and Scale Considerations

Production operators frequently manage thousands of custom resources across multiple namespaces. Without careful optimization, watch cache memory consumption and reconciliation queue backlogs can degrade cluster stability. Understanding controller-runtime’s internal mechanisms is essential for building operators that scale.

Watch Cache and Memory Footprint

Controller-runtime maintains an in-memory cache of all watched resources, synchronized via Kubernetes list/watch APIs. For operators watching cluster-scoped resources or multiple resource types, this cache can consume gigabytes of memory. A single operator watching Pods, Services, and ConfigMaps across a 5,000-node cluster might cache 100,000+ objects.

Monitor cache size using controller-runtime metrics (workqueue_depth, client_go_request_duration_seconds). For operators managing specific resource subsets, use field selectors or label selectors in watch configurations to limit cached objects. Namespace-scoped operators should explicitly configure cache namespaces rather than defaulting to cluster-wide watches.

When memory constraints are severe, consider deploying multiple operator instances with namespace-based sharding. Each instance watches only its assigned namespaces, distributing cache load across pods. Implement shard assignment through environment variables or dedicated sharding CRDs that control which operator instance owns which resources.

Reconciliation Queue Tuning

The reconciliation queue serializes state transitions for each resource. Default worker pool configurations (typically 1-10 concurrent workers) often bottleneck operators managing high-churn resources. Tune MaxConcurrentReconciles based on reconciliation latency profiles—CPU-bound reconcilers benefit from worker counts matching available cores, while I/O-bound reconcilers (calling external APIs) scale better with higher concurrency.

Rate limiting prevents reconciliation storms when large resource batches arrive simultaneously. Controller-runtime’s workqueue supports per-item and global rate limits. Set per-item exponential backoff (DefaultItemBasedRateLimiter) to 1s base delay with 1000s maximum for graceful retry behavior. Global rate limits (MaxOfRateLimiter) prevent queue flooding during cluster-wide events like node failures.

Predicate Filtering

Predicates eliminate unnecessary reconciliations by filtering watch events before queuing. A common anti-pattern is reconciling on every resource update, including status-only changes triggered by the operator itself. Implement GenerationChangedPredicate to ignore status updates, reducing reconciliation volume by 60-80% in typical workloads.

Custom predicates should filter based on annotation changes, specific field mutations, or resource lifecycle transitions. For operators coordinating multiple resources, use owner reference predicates to reconcile only when child resource states actually diverge from desired configurations.

Key Takeaways

  • Design reconciliation loops as pure state transitions that can safely execute multiple times
  • Use finalizers and status subresources to track external state changes and prevent resource leaks
  • Test failure scenarios explicitly—network partitions, controller restarts, and API server delays
  • Implement structured observability from day one: reconciliation duration, queue depth, error rates by type
  • Choose multi-controller architectures only when you need different RBAC or scaling characteristics