Feb 10, 2026

Building Your First Kubernetes Operator with Kubebuilder: From Zero to Production

You’ve deployed dozens of StatefulSets and written countless Helm charts, but your application still needs a 3 AM page when the primary database fails over. You’ve automated the deployment, but someone still runs that backup script manually every night. The monitoring dashboards are pristine, yet scaling decisions require a human to interpret the metrics and kubectl their way to a solution.

This is the gap that Kubernetes doesn’t close on its own. Declarative configuration handles the what—three replicas, this container image, these resource limits—but it says nothing about the how of day-two operations. How should the system respond when a node becomes unreachable? What’s the correct sequence for upgrading a clustered database without data loss? When should we add replicas versus optimize queries?

Human operators carry this knowledge. We encode it in runbooks, share it in incident postmortems, and transfer it through pairing sessions. But this knowledge lives outside the cluster, accessible only when someone reads the documentation and executes the steps correctly under pressure.

The Kubernetes operator pattern offers a different approach: encode that operational expertise directly into a controller that watches your resources and responds to changes automatically. Instead of documenting “if the backup is older than 24 hours, trigger a new one,” you write code that enforces this invariant continuously. The operator becomes the on-call engineer who never sleeps, never forgets the runbook, and responds in milliseconds.

Building an operator sounds intimidating—and the ecosystem doesn’t help, with competing frameworks and scattered documentation. But the core concepts are surprisingly approachable once you understand what Kubernetes controllers actually do under the hood.

Why Operators Exist: Encoding Human Knowledge into Controllers

Kubernetes excels at managing stateless workloads. Define a Deployment, specify three replicas, and the built-in controllers handle the rest. But what happens when you need to run a PostgreSQL cluster with automated failover, or manage a distributed cache that requires careful node-by-node rolling upgrades?

Visual: The operator pattern bridges declarative configuration and operational expertise

This is where Kubernetes’ declarative model meets operational reality.

The Gap Between Declaration and Operation

Consider running a production database on Kubernetes. You declare that you want a three-node cluster, but the actual operation involves nuanced decisions: Which node becomes primary? How do you handle a split-brain scenario? When a node fails, do you immediately replace it or wait for potential recovery? These decisions require domain expertise that Kubernetes’ built-in controllers don’t possess.

Traditionally, this knowledge lived in runbooks, tribal documentation, and the heads of experienced operators who handled 3 AM pages. The operator pattern changes this by encoding that expertise directly into software—a controller that watches your custom resources and takes the same actions a skilled human operator would, but with machine speed and consistency.

Extending Kubernetes with Domain Knowledge

An operator combines two Kubernetes primitives:

Custom Resource Definitions (CRDs) extend the Kubernetes API with new resource types. Instead of managing raw StatefulSets and ConfigMaps, you declare a PostgresCluster or RedisCache with domain-specific fields like replicationMode or backupSchedule.

Custom Controllers watch these resources and reconcile the actual state with your declared intent. When you update your PostgresCluster spec to add a replica, the controller handles the intricate dance of provisioning storage, initializing the node, configuring replication, and updating connection routing.

The result is a Kubernetes-native experience for complex software: kubectl apply -f postgres-cluster.yaml and the operator handles everything else.

When to Build vs. Buy

Building an operator represents significant investment. Before starting, ask yourself:

Does a mature operator already exist? Check OperatorHub.io. For common databases, message queues, and monitoring tools, battle-tested operators already exist.
Is your operational logic truly custom? If you’re running standard PostgreSQL, use an existing operator. If you’re running a proprietary system or have organization-specific operational requirements, building makes sense.
Do you have ongoing operational burden? Operators shine when they automate repeated, error-prone tasks. A one-time deployment doesn’t justify the effort.

Pro Tip: Start by using existing operators to understand the pattern from a user’s perspective before building your own. Operating someone else’s operator teaches you what makes a good one.

With this foundation in place, let’s set up our development environment and initialize a Kubebuilder project.

Kubebuilder Project Initialization and Scaffolding

Before writing a single line of operator logic, you need a solid foundation. Kubebuilder generates this foundation through scaffolding—creating the project structure, Makefile targets, and boilerplate code that every operator needs. Getting this right from the start saves hours of debugging later.

Prerequisites

Kubebuilder requires Go 1.21 or later and access to a Kubernetes cluster (local or remote). Install Kubebuilder itself using the official installation script:

curl -L -o kubebuilder "https://go.kubebuilder.io/dl/latest/$(go env GOOS)/$(go env GOARCH)"
chmod +x kubebuilder
sudo mv kubebuilder /usr/local/bin/

Verify the installation:

kubebuilder version
## Version: main.version{KubeBuilderVersion:"4.3.1", ...}

You also need kubectl configured to communicate with your cluster and make for running build targets. For local development, tools like kind or minikube provide lightweight Kubernetes environments that work well with Kubebuilder’s development workflow.

Creating Your First Project

Initialize a new operator project with the init command. This creates the Go module structure and essential configuration files:

mkdir memcached-operator && cd memcached-operator
kubebuilder init --domain example.com --repo github.com/myorg/memcached-operator

The --domain flag sets the API group suffix for your CRDs (your resources will live under cache.example.com). The --repo flag matches your Go module path—critical for imports to resolve correctly. Choose your domain carefully; changing it later requires regenerating most of the project files.

Kubebuilder also supports additional plugins for different project layouts. The default Go plugin works for most use cases, but you can explore available plugins with kubebuilder init --plugins to see alternatives suited to specific requirements.

Understanding the Generated Structure

Kubebuilder generates a complete project skeleton:

memcached-operator/
├── cmd/
│   └── main.go              # Operator entrypoint
├── config/
│   ├── default/             # Kustomize base configuration
│   ├── manager/             # Controller manager deployment
│   ├── rbac/                # RBAC permissions
│   └── crd/                 # CRD manifests (generated later)
├── internal/
│   └── controller/          # Your reconciliation logic lives here
├── Dockerfile               # Multi-stage build for the operator image
├── Makefile                 # Build, test, and deploy targets
└── go.mod                   # Go module definition

The config/ directory uses Kustomize overlays, allowing you to customize deployments for different environments without duplicating YAML. The internal/controller/ directory remains empty until you create your first API. The cmd/main.go file bootstraps the controller manager, setting up leader election, metrics endpoints, and health probes—all configured with sensible defaults that you can customize as your operator matures.

Pay particular attention to the config/rbac/ directory. Kubebuilder generates RBAC manifests based on markers in your controller code, ensuring your operator requests only the permissions it needs. This least-privilege approach is essential for production deployments where security constraints are strict.

Essential Makefile Targets

The generated Makefile provides everything you need for the development lifecycle:

make help          # List all available targets
make manifests     # Generate CRD manifests from Go types
make generate      # Generate code (DeepCopy methods)
make build         # Compile the operator binary
make test          # Run unit tests
make docker-build  # Build the container image
make install       # Install CRDs into the cluster
make deploy        # Deploy the operator to the cluster

Pro Tip: Run make manifests generate after every change to your API types. Forgetting this step is the most common cause of “my changes aren’t taking effect” confusion.

The Makefile also includes targets for linting (make lint), formatting (make fmt), and running the operator locally outside the cluster (make run). The make run target is particularly valuable during development—it starts your operator against the currently configured cluster without requiring a container build, enabling rapid iteration cycles.

The scaffolded project compiles immediately—run make build to confirm everything is wired correctly before adding custom logic. If the build succeeds, your development environment is properly configured and you can proceed with confidence.

With your project structure in place, the next step is defining the custom resource that your operator will manage. This is where you encode the desired state that users will declare in their YAML manifests.

Defining Your Custom Resource Definition

With Kubebuilder’s scaffolding complete, you now define what your custom resource looks like. The CRD acts as a contract between users and your operator—it specifies what users can configure (Spec) and what the operator reports back (Status). A well-designed CRD provides clear semantics, enforces constraints at admission time, and surfaces relevant information through kubectl.

Designing the Spec Struct

The Spec struct captures user intent. For a WebApp operator that manages deployments with optional Redis caching, open api/v1/webapp_types.go and define:

// WebAppSpec defines the desired state of WebApp
type WebAppSpec struct {
    // Image is the container image to deploy
    // +kubebuilder:validation:Required
    // +kubebuilder:validation:MinLength=1
    Image string `json:"image"`

    // Replicas is the number of pods to run
    // +kubebuilder:validation:Minimum=1
    // +kubebuilder:validation:Maximum=10
    // +kubebuilder:default=1
    Replicas int32 `json:"replicas,omitempty"`

    // Port is the container port to expose
    // +kubebuilder:validation:Minimum=1
    // +kubebuilder:validation:Maximum=65535
    // +kubebuilder:default=8080
    Port int32 `json:"port,omitempty"`

    // EnableRedis deploys a Redis sidecar for caching
    // +kubebuilder:default=false
    EnableRedis bool `json:"enableRedis,omitempty"`
}

Those comment lines starting with +kubebuilder: are markers that controller-gen parses to generate OpenAPI validation schemas. The validation:Required marker makes a field mandatory, while validation:Minimum and validation:Maximum enforce numeric bounds. The default marker sets values when users omit them, reducing boilerplate in resource manifests.

When designing your Spec, follow the principle of declarative intent: users describe what they want, not how to achieve it. The Spec should contain only fields that users care about configuring. Implementation details like internal service names or generated labels belong in your controller logic, not in the API surface.

Designing the Status Struct

The Status struct reports observed state back to users. It answers questions like “Is my app healthy?” and “How many replicas are actually running?” Unlike Spec, which users write, Status is exclusively controller-managed and provides observability into the system’s current state.

// WebAppStatus defines the observed state of WebApp
type WebAppStatus struct {
    // AvailableReplicas is the number of pods ready to serve traffic
    AvailableReplicas int32 `json:"availableReplicas,omitempty"`

    // Conditions represent the latest observations of the WebApp's state
    Conditions []metav1.Condition `json:"conditions,omitempty"`

    // RedisReady indicates if the Redis sidecar is operational
    RedisReady bool `json:"redisReady,omitempty"`
}

Using metav1.Condition follows Kubernetes conventions and integrates with tooling that understands standard condition types. Conditions provide a structured way to communicate multiple aspects of resource state—whether the deployment succeeded, whether dependencies are satisfied, or whether errors occurred. Each condition includes a type, status (True/False/Unknown), reason, and human-readable message, giving users and automation detailed insight into what’s happening.

Adding Printer Columns

When users run kubectl get webapps, they see only NAME and AGE by default. Printer column markers customize this output to surface the most operationally relevant fields at a glance:

// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
// +kubebuilder:printcolumn:name="Image",type=string,JSONPath=`.spec.image`
// +kubebuilder:printcolumn:name="Replicas",type=integer,JSONPath=`.spec.replicas`
// +kubebuilder:printcolumn:name="Available",type=integer,JSONPath=`.status.availableReplicas`
// +kubebuilder:printcolumn:name="Age",type=date,JSONPath=`.metadata.creationTimestamp`

type WebApp struct {
    metav1.TypeMeta   `json:",inline"`
    metav1.ObjectMeta `json:"metadata,omitempty"`

    Spec   WebAppSpec   `json:"spec,omitempty"`
    Status WebAppStatus `json:"status,omitempty"`
}

The subresource:status marker enables the /status subresource, letting your controller update status without triggering reconciliation loops. This separation is critical: when your controller updates status, it shouldn’t cause the watch to fire again for a spec change, avoiding infinite reconciliation cycles.

Choose printer columns that help operators quickly assess resource health. Comparing desired replicas against available replicas immediately reveals scaling issues, while displaying the image helps identify which version is deployed across your fleet.

Generating CRD Manifests

With your types defined, generate the CRD YAML:

make manifests

This invokes controller-gen, which reads your markers and produces config/crd/bases/apps.timderzhavets.com_webapps.yaml. The generated manifest includes all validation rules as OpenAPI v3 schemas that the API server enforces at admission time. This means invalid resources are rejected before your controller ever sees them, providing fast feedback to users and reducing error handling complexity in your reconciliation logic.

Pro Tip: Run make manifests after every change to your type definitions. The generated CRD must stay synchronized with your Go structs, or you’ll encounter runtime validation mismatches that are difficult to debug.

Verify your CRD by examining the generated file:

cat config/crd/bases/apps.timderzhavets.com_webapps.yaml | head -50

You’ll see your validation rules translated into OpenAPI properties with minimum, maximum, and required constraints. The API server uses these schemas to validate resources, meaning malformed requests fail immediately with descriptive error messages rather than silently creating invalid state.

Your CRD now provides a clean API surface with server-side validation, sensible defaults, and informative kubectl output. Next, you’ll implement the reconciliation loop that watches these resources and makes reality match the user’s declared intent.

The Reconciliation Loop: Heart of Every Operator

The reconciliation loop is where your operator transforms from a passive observer into an active controller. Every operator, regardless of complexity, follows the same fundamental pattern: watch for changes, compare desired state against actual state, and take action to close the gap. Understanding this pattern deeply is essential—it’s not merely an implementation detail but the philosophical foundation that makes Kubernetes operators reliable and self-healing.

Visual: The reconciliation loop continuously aligns actual state with desired state

How Controller-Runtime Manages Events

When you create a controller with Kubebuilder, the controller-runtime library handles the heavy lifting of watching Kubernetes resources. Here’s what happens under the hood:

Informers watch the API server for changes to resources your controller cares about
Events (create, update, delete) get pushed into a work queue
The reconciler pulls items from the queue and processes them one at a time

The work queue provides critical guarantees: automatic deduplication (multiple rapid changes to the same object result in one reconciliation), rate limiting, and exponential backoff for failures. Your reconciler receives a simple Request containing only the namespace and name of the object that changed—you fetch the current state yourself. This design is intentional: by the time you process the request, the object may have changed again, so you always work with the latest state rather than stale event data.

func (r *ApplicationReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := log.FromContext(ctx)

    // Fetch the Application instance
    var app webv1.Application
    if err := r.Get(ctx, req.NamespacedName, &app); err != nil {
        if apierrors.IsNotFound(err) {
            // Object deleted before reconciliation - nothing to do
            log.Info("Application resource not found, likely deleted")
            return ctrl.Result{}, nil
        }
        return ctrl.Result{}, err
    }

    // Your reconciliation logic goes here
    return ctrl.Result{}, nil
}

Notice the IsNotFound check. By the time your reconciler runs, the object triggering the event may already be gone. This is normal—always handle this case gracefully. Treating a missing resource as an error would cause unnecessary error logs and failed reconciliation metrics.

Writing Idempotent Reconciliation Logic

Idempotency is non-negotiable. Your reconciler will be called multiple times for the same object, and running it repeatedly must produce the same result as running it once. Never assume you know why you’re being called—the event could be a create, an update, a resync from the informer cache, or a requeue from a previous failed attempt.

The pattern that makes this achievable: always reconcile toward the desired state, never react to specific events. Don’t try to track whether this is a create, update, or delete. Instead, read the current desired state from your custom resource, observe the actual state of the world, and make whatever changes are necessary to align them.

func (r *ApplicationReconciler) reconcileDeployment(ctx context.Context, app *webv1.Application) error {
    desired := r.constructDeployment(app)

    var existing appsv1.Deployment
    err := r.Get(ctx, types.NamespacedName{Name: desired.Name, Namespace: desired.Namespace}, &existing)

    if apierrors.IsNotFound(err) {
        // Deployment doesn't exist - create it
        return r.Create(ctx, desired)
    }
    if err != nil {
        return err
    }

    // Deployment exists - update if spec differs
    if !equality.Semantic.DeepEqual(existing.Spec, desired.Spec) {
        existing.Spec = desired.Spec
        return r.Update(ctx, &existing)
    }

    return nil
}

This code works correctly whether the Deployment already exists, was manually modified, or has never been created. Every path leads to the same outcome: a Deployment matching the desired specification. If someone manually edits the Deployment, your next reconciliation will detect the drift and correct it automatically.

Pro Tip: Use equality.Semantic.DeepEqual from k8s.io/apimachinery/pkg/api/equality for comparing Kubernetes objects. It handles nil vs empty slice differences and other Kubernetes-specific quirks that trip up standard reflection-based comparison.

Handling Errors and Requeue Strategies

Return values from your reconciler control what happens next. Choosing the right return strategy affects both the reliability of your operator and the load it places on the API server:

// Success - don't requeue unless a watched resource changes
return ctrl.Result{}, nil

// Requeue immediately due to transient error (with exponential backoff)
return ctrl.Result{}, err

// Requeue after specific duration (e.g., polling external resource)
return ctrl.Result{RequeueAfter: 30 * time.Second}, nil

// Requeue immediately without error (rare, usually means incomplete work)
return ctrl.Result{Requeue: true}, nil

Returning an error triggers the work queue’s built-in exponential backoff, starting at a few milliseconds and increasing to several minutes. This protects both your operator and the API server from tight retry loops during outages. Reserve error returns for genuinely unexpected failures—transient network issues, permission problems, or bugs in your logic.

For operations depending on external systems—waiting for a cloud provider to provision a load balancer, for instance—use RequeueAfter with a reasonable interval rather than busy-waiting with immediate requeues. Thirty seconds to a few minutes is typical for external dependencies. This approach keeps your operator responsive without hammering external APIs or flooding your logs.

The Requeue: true option without an error is useful when your reconciliation made progress but isn’t complete—perhaps you’re waiting for a Deployment to become ready before proceeding to the next step. However, prefer watching the relevant resources instead when possible, as event-driven reconciliation is more efficient than polling.

With the reconciliation pattern understood, the next challenge is managing the Kubernetes resources your operator creates. Ownership references ensure child resources get cleaned up automatically when your custom resource is deleted.

Managing Child Resources and Ownership

Most operators do more than manage a single custom resource—they create and manage dependent Kubernetes resources. A WebApp operator might create Deployments, Services, and ConfigMaps. A database operator might spin up StatefulSets and PersistentVolumeClaims. Understanding how to create these child resources and wire them together correctly is essential for building production-grade operators.

Creating Deployments from Your Controller

Let’s extend our controller to create a Deployment when a new WebApp resource appears. The controller constructs the Deployment spec programmatically, matching what you’d write in YAML but with the flexibility of Go.

func (r *WebAppReconciler) reconcileDeployment(ctx context.Context, webapp *appsv1alpha1.WebApp) error {
    log := log.FromContext(ctx)

    deployment := &appsv1.Deployment{
        ObjectMeta: metav1.ObjectMeta{
            Name:      webapp.Name,
            Namespace: webapp.Namespace,
        },
        Spec: appsv1.DeploymentSpec{
            Replicas: webapp.Spec.Replicas,
            Selector: &metav1.LabelSelector{
                MatchLabels: map[string]string{"app": webapp.Name},
            },
            Template: corev1.PodTemplateSpec{
                ObjectMeta: metav1.ObjectMeta{
                    Labels: map[string]string{"app": webapp.Name},
                },
                Spec: corev1.PodSpec{
                    Containers: []corev1.Container{{
                        Name:  "webapp",
                        Image: webapp.Spec.Image,
                        Ports: []corev1.ContainerPort{{
                            ContainerPort: 8080,
                        }},
                    }},
                },
            },
        },
    }

    // Set owner reference for garbage collection
    if err := controllerutil.SetControllerReference(webapp, deployment, r.Scheme); err != nil {
        return err
    }

    // Create or update the deployment
    _, err := controllerutil.CreateOrUpdate(ctx, r.Client, deployment, func() error {
        deployment.Spec.Replicas = webapp.Spec.Replicas
        deployment.Spec.Template.Spec.Containers[0].Image = webapp.Spec.Image
        return nil
    })

    if err != nil {
        log.Error(err, "Failed to reconcile Deployment")
        return err
    }

    return nil
}

The controllerutil.CreateOrUpdate function handles the create-or-update logic elegantly. It fetches the existing resource if present, applies your mutations in the callback function, and performs the appropriate API call. This pattern eliminates the boilerplate of checking whether a resource exists before deciding to create or update it.

Creating Services to Expose Your Application

Deployments alone aren’t sufficient for a complete application stack. Your controller should also create a Service to expose the pods. Following the same pattern, you can add a service reconciliation method:

func (r *WebAppReconciler) reconcileService(ctx context.Context, webapp *appsv1alpha1.WebApp) error {
    service := &corev1.Service{
        ObjectMeta: metav1.ObjectMeta{
            Name:      webapp.Name,
            Namespace: webapp.Namespace,
        },
        Spec: corev1.ServiceSpec{
            Selector: map[string]string{"app": webapp.Name},
            Ports: []corev1.ServicePort{{
                Port:       80,
                TargetPort: intstr.FromInt(8080),
            }},
        },
    }

    if err := controllerutil.SetControllerReference(webapp, service, r.Scheme); err != nil {
        return err
    }

    _, err := controllerutil.CreateOrUpdate(ctx, r.Client, service, func() error {
        return nil
    })
    return err
}

Setting Owner References for Garbage Collection

The SetControllerReference call in the code above establishes a parent-child relationship between your WebApp and its Deployment. This owner reference tells Kubernetes that when the WebApp is deleted, the Deployment should be garbage collected automatically.

Owner references provide three guarantees:

Cascading deletion: Child resources are cleaned up when the parent disappears
Ownership tracking: You can identify which controller manages a resource
Conflict prevention: Only one controller can be the owner of a resource

When Kubernetes processes a deletion request for your custom resource, the garbage collector inspects all resources in the namespace for matching owner references. Resources with the blockOwnerDeletion field set to true will prevent the parent from being removed until they themselves are deleted. This ensures orderly cleanup of your entire resource tree.

Pro Tip: Always set owner references in the same namespace. Cross-namespace ownership isn’t supported by Kubernetes garbage collection. If your operator manages cluster-scoped resources, use finalizers instead for cleanup logic.

Watching Secondary Resources

Your controller must watch the child resources it creates. When someone manually edits or deletes a Deployment your operator created, the controller needs to detect that change and reconcile back to the desired state. Without these watches, drift between desired and actual state would go unnoticed.

func (r *WebAppReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&appsv1alpha1.WebApp{}).
        Owns(&appsv1.Deployment{}).
        Owns(&corev1.Service{}).
        Complete(r)
}

The Owns method tells the controller to watch Deployments and Services that have an owner reference pointing to a WebApp. When these secondary resources change, the controller automatically enqueues a reconciliation for the owning WebApp—not the Deployment itself. This design keeps your reconciliation logic focused on the parent resource while reacting to changes across the entire resource tree.

The watch mechanism leverages Kubernetes informers under the hood, maintaining a local cache of watched resources. This means your controller can detect changes without constantly polling the API server, reducing load on the cluster while maintaining responsiveness to configuration drift.

With child resource management in place, your operator can now create complete application stacks from a single custom resource. The next step is validating this behavior through testing before deploying to a real cluster.

Testing and Local Development Workflow

A fast feedback loop separates productive operator development from frustrating trial-and-error. Kubebuilder provides tooling that lets you iterate on your reconciliation logic in seconds rather than minutes, catching bugs before they reach your cluster. Understanding how to leverage these tools effectively will dramatically accelerate your development cycle.

Running Against a Local Cluster

The quickest way to test your operator is running it directly against a local Kubernetes cluster. With kind or minikube running, execute:

make install  # Install CRDs into the cluster
make run      # Run the controller locally

This starts your controller as a local process that connects to your cluster’s API server. You get immediate console output, can set breakpoints in your IDE, and see reconciliation happen in real-time. Changes to your Go code require only stopping and restarting the process—no container builds needed.

Apply a test resource and watch your controller respond:

kubectl apply -f config/samples/webapp_v1_guestbook.yaml
kubectl get guestbooks -w

The watch flag keeps your terminal updated as the controller modifies the resource status. Open a second terminal to monitor the pods, deployments, or other resources your operator creates. This dual-pane approach gives you visibility into both the custom resource lifecycle and the downstream effects of your reconciliation logic.

Writing envtest-Based Integration Tests

For automated testing, Kubebuilder scaffolds tests using the controller-runtime envtest package. This spins up a real API server and etcd without requiring a full cluster, giving you fast, isolated integration tests that run reliably in CI pipelines.

var _ = Describe("Guestbook Controller", func() {
    Context("When reconciling a resource", func() {
        const resourceName = "test-guestbook"

        ctx := context.Background()
        typeNamespacedName := types.NamespacedName{
            Name:      resourceName,
            Namespace: "default",
        }

        BeforeEach(func() {
            guestbook := &webappv1.Guestbook{
                ObjectMeta: metav1.ObjectMeta{
                    Name:      resourceName,
                    Namespace: "default",
                },
                Spec: webappv1.GuestbookSpec{
                    Size: 3,
                },
            }
            Expect(k8sClient.Create(ctx, guestbook)).To(Succeed())
        })

        It("should create the correct number of replicas", func() {
            Eventually(func() int32 {
                deployment := &appsv1.Deployment{}
                err := k8sClient.Get(ctx, typeNamespacedName, deployment)
                if err != nil {
                    return 0
                }
                return *deployment.Spec.Replicas
            }, time.Second*10, time.Millisecond*250).Should(Equal(int32(3)))
        })
    })
})

Run your test suite with make test. The envtest framework handles API server lifecycle automatically. These tests execute in milliseconds compared to the seconds required for full cluster tests, making them ideal for test-driven development workflows. Structure your test suites to cover both happy paths and error conditions—particularly focusing on edge cases around resource deletion, update conflicts, and partial failures.

Debugging with Structured Logging

When reconciliation behaves unexpectedly, structured logging reveals what your controller sees. Use the logger from the reconciler context:

func (r *GuestbookReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := log.FromContext(ctx)

    var guestbook webappv1.Guestbook
    if err := r.Get(ctx, req.NamespacedName, &guestbook); err != nil {
        log.Error(err, "unable to fetch Guestbook")
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    log.Info("reconciling guestbook",
        "name", guestbook.Name,
        "desiredSize", guestbook.Spec.Size,
        "currentStatus", guestbook.Status.Phase)

    // Reconciliation logic...
}

Increase verbosity with make run ARGS="--zap-log-level=debug" to see every API interaction. The structured format makes it straightforward to filter logs by resource name or trace a single reconciliation through your logic. For complex debugging sessions, consider piping the output through jq to parse and filter the JSON-formatted logs programmatically.

Pro Tip: Add a unique reconciliation ID to your log context early in the Reconcile function. This makes correlating log lines trivial when multiple resources reconcile simultaneously.

With your tests green and local development smooth, you’re ready to package your operator for production deployment.

Deploying Your Operator to Production

Your operator is tested and ready. Now it’s time to ship it to a real cluster where it can manage workloads at scale.

Building and Pushing the Controller Image

Kubebuilder generates a Makefile with everything you need to build and push your container image. First, set your image registry:

IMG ?= ghcr.io/your-org/myapp-operator:v0.1.0

Build and push with a single command:

make docker-build docker-push IMG=ghcr.io/your-org/myapp-operator:v0.1.0

For production builds, use a specific tag rather than latest. Semantic versioning helps track which operator version manages your clusters. Consider implementing a CI/CD pipeline that automatically builds and tags images on each release, ensuring consistent and reproducible deployments across environments.

Installing with Kustomize

Kubebuilder scaffolds a complete kustomize setup in the config/ directory. Deploy your operator and all its dependencies:

make deploy IMG=ghcr.io/your-org/myapp-operator:v0.1.0

This command installs your CRDs, creates the operator namespace, deploys the controller, and applies RBAC rules. For more control, generate the manifests and pipe them through your GitOps tooling:

kustomize build config/default | kubectl apply -f -

If your organization uses Helm, you can wrap the kustomize output in a Helm chart or use tools like helmify to convert your manifests. Helm provides additional benefits like release management, rollback capabilities, and templated values for environment-specific configurations.

RBAC and Least-Privilege Principles

The controller needs permissions to watch and modify resources. Kubebuilder generates RBAC from the //+kubebuilder:rbac markers in your controller:

//+kubebuilder:rbac:groups=apps.example.com,resources=myapps,verbs=get;list;watch;create;update;patch;delete
//+kubebuilder:rbac:groups=apps.example.com,resources=myapps/status,verbs=get;update;patch
//+kubebuilder:rbac:groups=apps,resources=deployments,verbs=get;list;watch;create;update;patch;delete

These markers generate the ClusterRole in config/rbac/role.yaml:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: manager-role
rules:
- apiGroups: ["apps.example.com"]
  resources: ["myapps"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["apps.example.com"]
  resources: ["myapps/status"]
  verbs: ["get", "update", "patch"]
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

Pro Tip: Grant only the permissions your controller actually needs. If your operator only creates Deployments in a single namespace, use a Role instead of a ClusterRole. Audit your markers periodically—permissions tend to accumulate during development.

For multi-tenant clusters, consider namespace-scoped operators. Set the --namespace flag on your manager to restrict the controller’s watch scope, then use Role and RoleBinding instead of cluster-wide permissions. This approach reduces the blast radius of potential security vulnerabilities and simplifies compliance auditing.

Production Checklist

Before going live, verify these essentials:

Leader election enabled for high availability (--leader-elect=true)
Resource limits set on the controller Deployment
Pod disruption budgets configured
Metrics endpoint secured or disabled
Image pulled from a private registry with imagePullSecrets
Health and readiness probes properly configured
Logging configured with appropriate verbosity levels

Consider implementing alerting on controller restarts and reconciliation errors. Monitoring the controller_runtime_reconcile_total and controller_runtime_reconcile_errors_total metrics provides visibility into operator health and helps identify issues before they impact workloads.

Your operator is now running in production, continuously reconciling your custom resources. The patterns you’ve learned—the reconciliation loop, owner references, and RBAC scoping—apply to operators of any complexity, from simple application deployers to sophisticated database controllers managing stateful workloads.

Key Takeaways

Start with kubebuilder init and kubebuilder create api to scaffold your project, then focus on designing your CRD’s Spec and Status before writing reconciliation logic
Always make your Reconcile function idempotent—check current state before taking action, and use owner references to let Kubernetes handle cleanup automatically
Use envtest for integration testing during development, then validate against a real cluster before production deployment