Hero image for Running Spark at Scale: Production Kubernetes Deployments with the Spark Operator

Running Spark at Scale: Production Kubernetes Deployments with the Spark Operator


You’ve containerized your Spark jobs and pointed them at a Kubernetes cluster, but now you’re managing a sprawl of kubectl commands, YAML files, and custom scripts just to submit jobs. When a job fails at 3 AM, you’re SSH-ing into pods to debug, parsing driver logs manually, and trying to figure out which executor died and why. Your CI/CD pipeline has grown into a fragile shell script that templates out pod specs, manages service accounts, and prays that resource quotas don’t change between environments.

The promise of Kubernetes was declarative infrastructure—define your desired state, let the control plane handle the rest. But spark-submit in client mode treats Kubernetes like a dumb container scheduler. You’re back to imperative commands: “run this job now, with these arguments, on that cluster.” There’s no drift detection, no self-healing, no native integration with Kubernetes RBAC or pod security policies. Your Spark applications exist outside the Kubernetes API, invisible to your monitoring stack and GitOps workflows.

This impedance mismatch isn’t just an operational annoyance—it’s a scaling barrier. When you’re running dozens of jobs daily, each one becomes a snowflake. Different teams maintain different submission scripts. Resource limits are hardcoded. Retry logic lives in cron jobs. You’ve built a parallel orchestration layer on top of Kubernetes instead of leveraging what the platform already provides.

The Spark Operator flips this model. Instead of submitting jobs imperatively, you define SparkApplication resources—native Kubernetes objects that describe your job’s desired state. The operator watches these resources and manages the full lifecycle: creating driver and executor pods, handling failures, exposing metrics, cleaning up resources. Your Spark jobs become first-class citizens in your cluster, with all the operational benefits that come with it.

Why Spark Operator Changes the Game

When you run Spark on Kubernetes with spark-submit, you’re essentially firing off imperative commands and hoping for the best. Each job submission creates a driver pod, which then spawns executor pods, but you’re left managing this lifecycle through shell scripts and kubectl commands. There’s no native way to track job state, handle restarts, or integrate with your existing Kubernetes tooling. For one-off analytics queries, this works. For production data pipelines running hundreds of jobs per day, it becomes a maintenance nightmare.

Visual: Spark Operator architecture diagram showing declarative SparkApplication resources being reconciled by the operator controller

The Spark Operator fundamentally changes this model by introducing SparkApplication as a Kubernetes Custom Resource Definition (CRD). Instead of imperative job submission, you declare the desired state of your Spark workload in YAML and let the operator handle the heavy lifting. This architectural shift brings the same benefits to Spark that Deployments brought to stateless applications: declarative configuration, automatic reconciliation, and native Kubernetes integration.

From Imperative to Declarative

Consider what happens when a driver pod crashes with spark-submit. You need external monitoring to detect the failure, custom retry logic, and manual intervention to clean up orphaned executor pods. The Spark Operator handles this automatically through its reconciliation loop. It monitors SparkApplication resources, manages the complete pod lifecycle from creation through cleanup, and applies restart policies without custom orchestration code.

This becomes critical for production requirements like automatic retries with exponential backoff, dependency management through init containers, and integration with cluster autoscaling. When your Spark job needs to wait for data availability or pull artifacts from S3 before starting, you express these requirements declaratively rather than wrapping spark-submit in increasingly complex bash scripts.

Built-in Production Capabilities

The operator ships with production features that would take significant engineering effort to build yourself. It exposes Prometheus metrics for job state, execution time, and failure rates without custom exporters. It supports volume mounts for dependency JARs, ConfigMaps for application properties, and Secrets for credentials—all through standard Kubernetes primitives.

Resource management becomes transparent. You specify driver and executor resource requests and limits in the SparkApplication spec, and the operator translates these into appropriate pod configurations. When combined with cluster autoscaling, this enables true dynamic resource allocation: nodes scale up to handle large jobs and scale down during idle periods, all based on declarative resource requirements rather than manual capacity planning.

The monitoring story improves dramatically. Every SparkApplication has a status subresource that tracks submission state, driver pod name, and execution phase. You query job status through kubectl get sparkapplications rather than parsing driver logs. Failed jobs show termination reasons in the resource status, and the operator maintains an audit trail of state transitions.

For teams running GitOps workflows with ArgoCD or Flux, SparkApplication resources fit naturally into your existing deployment pipeline. Your Spark jobs live in git alongside your application code, go through the same review process, and deploy through the same continuous delivery mechanisms.

With the architectural foundation clear, the next step is getting the operator running in your cluster.

Installing the Spark Operator with Helm

The Spark Operator runs as a Kubernetes controller that watches for SparkApplication custom resources and manages their lifecycle. Helm provides the cleanest installation path, handling CRD registration, RBAC setup, and webhook configuration in a single deployment.

Adding the Repository and Installing

Start by adding the official Spark Operator Helm repository maintained by the Kubernetes Spark community:

install-operator.sh
helm repo add spark-operator https://kubeflow.github.io/spark-operator
helm repo update
helm install spark-operator spark-operator/spark-operator \
--namespace spark-operator \
--create-namespace \
--set webhook.enable=true \
--set webhook.port=8080 \
--set sparkJobNamespace=spark-jobs

This installs the operator in its own spark-operator namespace while configuring it to watch for SparkApplications in the spark-jobs namespace. The webhook component handles pod mutation—automatically injecting sidecars, mounting volumes, and setting environment variables based on your SparkApplication spec.

The sparkJobNamespace parameter defines which namespace the operator monitors for custom resources. For multi-tenant environments, you can configure the operator to watch multiple namespaces or use cluster-wide scope by omitting this parameter entirely. However, namespace isolation provides stronger security boundaries and simplifies resource quotas and network policies.

Configuring RBAC and Service Accounts

The Helm chart creates default service accounts, but production deployments require explicit RBAC boundaries. Create a dedicated service account for your Spark applications with minimal required permissions:

create-spark-rbac.sh
kubectl create namespace spark-jobs
kubectl create serviceaccount spark-sa -n spark-jobs
kubectl create role spark-role -n spark-jobs \
--verb=get,list,watch,create,delete,patch \
--resource=pods,configmaps,services
kubectl create rolebinding spark-role-binding -n spark-jobs \
--role=spark-role \
--serviceaccount=spark-jobs:spark-sa

Reference this service account in your SparkApplication manifests using spec.driver.serviceAccount: spark-sa. This prevents Spark jobs from accessing resources outside their designated namespace—critical for multi-tenant clusters.

The permissions granted here allow Spark drivers to create executor pods, manage ConfigMaps for application configuration, and establish headless services for executor communication. Avoid granting broader permissions like cluster-admin or wildcard resource access, as this violates least-privilege principles and can expose your cluster to privilege escalation attacks if a Spark job is compromised.

Enterprise Registry Integration

Production Spark images typically live in private registries like Harbor, ECR, or Artifactory. Configure image pull secrets for authentication:

configure-registry.sh
kubectl create secret docker-registry regcred \
--namespace spark-jobs \
--docker-server=registry.company.io \
--docker-username=spark-user \
--docker-password=apK8mN2xPqR5tY9z \
kubectl patch serviceaccount spark-sa -n spark-jobs \
-p '{"imagePullSecrets": [{"name": "regcred"}]}'

The operator automatically propagates these secrets to driver and executor pods when the service account references them. For AWS ECR, consider using IAM roles for service accounts (IRSA) instead of static credentials. This approach rotates tokens automatically and eliminates the need to store long-lived credentials in Kubernetes secrets. GKE offers similar functionality through Workload Identity, and Azure provides AAD Pod Identity for AKS clusters.

If your organization uses multiple registries—for example, base images from Docker Hub and custom Spark images from a private registry—create separate secrets and reference them all in the service account’s imagePullSecrets array.

Webhook Certificate Management

The admission webhook requires TLS certificates for secure communication with the Kubernetes API server. The Helm chart handles certificate generation by default, but production clusters using service meshes like Istio need manual certificate injection:

verify-webhook.sh
kubectl get validatingwebhookconfigurations spark-operator-webhook
kubectl get mutatingwebhookconfigurations spark-operator-webhook

Both webhooks should show CABundle populated and FailurePolicy: Fail to reject malformed SparkApplications before they reach the operator. The validating webhook enforces schema correctness and prevents invalid configurations from entering the cluster, while the mutating webhook applies defaults and injects required sidecars.

For clusters using cert-manager, configure the Helm chart to use cert-manager-generated certificates by setting webhook.certManager.enable=true. This automates certificate rotation and integrates with your existing PKI infrastructure. Without automated rotation, webhook certificates expire after one year by default, causing all SparkApplication submissions to fail until you manually regenerate them.

💡 Pro Tip: Set webhook.namespaceSelector in your Helm values to limit webhook scope to specific namespaces. This reduces API server overhead and prevents the webhook from processing every pod creation across your entire cluster.

Validation

Confirm the operator is running and watching the correct namespaces:

check-operator.sh
kubectl get pods -n spark-operator
kubectl logs -n spark-operator deployment/spark-operator | grep "Starting workers"

You should see log entries indicating the controller started successfully and registered the sparkapplications.sparkoperator.k8s.io CRD. Test the installation by submitting a simple SparkApplication—if the operator processes it and creates driver pods, your installation is functioning correctly. With the operator deployed and RBAC configured, you’re ready to define your first declarative Spark workload.

Your First SparkApplication Resource

The SparkApplication Custom Resource Definition (CRD) transforms how you deploy Spark jobs on Kubernetes. Instead of executing spark-submit with a long chain of command-line flags, you define your entire application specification in declarative YAML that can be version-controlled, reviewed, and deployed through standard Kubernetes workflows.

Anatomy of a SparkApplication

A SparkApplication resource contains three primary components: the driver specification, executor specification, and application dependencies. Here’s a complete example that processes customer transaction data:

transaction-processor.yaml
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: transaction-processor
namespace: data-pipelines
spec:
type: Scala
mode: cluster
image: "registry.company.io/spark-apps:3.5.0"
imagePullPolicy: Always
mainClass: com.company.analytics.TransactionProcessor
mainApplicationFile: "local:///opt/spark/jars/transaction-processor.jar"
sparkVersion: "3.5.0"
driver:
cores: 2
memory: "4096m"
labels:
version: "1.2.0"
team: "data-engineering"
serviceAccount: spark-driver-sa
executor:
cores: 4
instances: 5
memory: "8192m"
labels:
version: "1.2.0"
team: "data-engineering"

The driver section defines the Spark driver pod that coordinates job execution. Resource allocations (cores and memory) determine the pod’s resource requests and limits. The serviceAccount field controls RBAC permissions—the driver needs appropriate roles to create and manage executor pods. Labels propagate to all created pods, enabling tracking and grouping in monitoring systems.

The executor section specifies worker pods that process data partitions. The instances field sets the initial executor count, though this can be overridden by dynamic allocation. Each executor pod receives the specified CPU and memory resources. Unlike the driver, which runs as a single pod, executors scale horizontally to handle parallelism.

The mainApplicationFile path uses the local:// scheme because the JAR is pre-packaged in the container image at /opt/spark/jars/. Alternative schemes include s3://, gs://, or hdfs:// for applications stored in remote object storage or distributed filesystems.

Converting from spark-submit

This SparkApplication replaces an imperative spark-submit command that would look like:

Terminal window
spark-submit \
--master k8s://https://kubernetes.default.svc:443 \
--deploy-mode cluster \
--name transaction-processor \
--class com.company.analytics.TransactionProcessor \
--conf spark.executor.instances=5 \
--conf spark.executor.cores=4 \
--conf spark.executor.memory=8192m \
--conf spark.driver.cores=2 \
--conf spark.driver.memory=4096m \
--conf spark.kubernetes.container.image=registry.company.io/spark-apps:3.5.0 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-driver-sa \
local:///opt/spark/jars/transaction-processor.jar

The declarative approach provides immediate benefits: the YAML lives in Git alongside your application code, changes are tracked through pull requests, and deployments integrate with your existing CI/CD pipeline. Configuration drift between environments becomes visible in diff tools, and rollbacks are straightforward Git reverts rather than reconstructing command-line arguments from runbooks.

Volumes and Dependency Management

For applications requiring additional libraries, configuration files, or persistent storage, SparkApplication supports standard Kubernetes volume types:

volumes-example.yaml
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: ml-training
namespace: data-pipelines
spec:
type: Python
mode: cluster
image: "registry.company.io/pyspark:3.5.0"
mainApplicationFile: "local:///opt/spark/scripts/train_model.py"
sparkVersion: "3.5.0"
deps:
jars:
- "s3://company-artifacts/hadoop-aws-3.3.4.jar"
- "s3://company-artifacts/aws-java-sdk-bundle-1.12.262.jar"
pyFiles:
- "s3://company-artifacts/preprocessing.py"
- "s3://company-artifacts/feature_engineering.py"
driver:
cores: 2
memory: "4096m"
serviceAccount: spark-driver-sa
volumeMounts:
- name: config
mountPath: /etc/spark/conf
- name: checkpoint-dir
mountPath: /mnt/checkpoints
executor:
cores: 4
instances: 3
memory: "8192m"
volumeMounts:
- name: config
mountPath: /etc/spark/conf
- name: checkpoint-dir
mountPath: /mnt/checkpoints
volumes:
- name: config
configMap:
name: spark-config
- name: checkpoint-dir
persistentVolumeClaim:
claimName: spark-checkpoints

The deps section downloads JARs and Python files before application startup, distributing them to both driver and executors. ConfigMaps inject read-only configuration, while PersistentVolumeClaims provide durable storage for checkpoints or incremental results that survive pod restarts.

Dynamic Resource Allocation

For production workloads with variable data volumes, dynamic executor allocation optimizes costs by scaling executors based on actual demand:

dynamic-allocation.yaml
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: nightly-aggregation
namespace: data-pipelines
spec:
type: Python
mode: cluster
image: "registry.company.io/pyspark:3.5.0"
mainApplicationFile: "local:///opt/spark/scripts/aggregation.py"
sparkVersion: "3.5.0"
dynamicAllocation:
enabled: true
initialExecutors: 2
minExecutors: 2
maxExecutors: 20
shuffleTrackingTimeout: 60s
driver:
cores: 1
memory: "2048m"
serviceAccount: spark-driver-sa
executor:
cores: 4
memory: "8192m"

This configuration starts with two executors and scales up to twenty based on pending tasks, then scales down during idle periods. The shuffleTrackingTimeout ensures executors aren’t removed prematurely when shuffle data is still being read. Dynamic allocation is particularly effective for ETL pipelines with distinct phases—data loading might require minimal parallelism while join operations temporarily spike executor demand.

Note that when dynamic allocation is enabled, the instances field in the executor spec is ignored. The operator uses initialExecutors instead to determine the starting executor count.

Environment Variables and Secret Management

Production applications require credentials, API keys, and environment-specific configuration. The Spark Operator integrates seamlessly with Kubernetes secrets:

secure-application.yaml
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: database-export
namespace: data-pipelines
spec:
type: Scala
mode: cluster
image: "registry.company.io/spark-apps:3.5.0"
mainClass: com.company.exports.DatabaseExporter
mainApplicationFile: "local:///opt/spark/jars/db-exporter.jar"
sparkVersion: "3.5.0"
driver:
cores: 1
memory: "2048m"
env:
- name: ENVIRONMENT
value: "production"
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-credentials
key: password
- name: S3_BUCKET
value: "s3://company-data-lake/exports"
envSecretKeyRefs:
AWS_ACCESS_KEY_ID:
name: aws-credentials
key: access-key-id
AWS_SECRET_ACCESS_KEY:
name: aws-credentials
key: secret-access-key
serviceAccount: spark-driver-sa
executor:
cores: 2
memory: "4096m"
env:
- name: ENVIRONMENT
value: "production"
envSecretKeyRefs:
AWS_ACCESS_KEY_ID:
name: aws-credentials
key: access-key-id
AWS_SECRET_ACCESS_KEY:
name: aws-credentials
key: secret-access-key

Environment variables can be injected directly or referenced from Kubernetes secrets. The envSecretKeyRefs field provides a cleaner syntax when an entire secret maps to environment variables. Secrets are injected at pod creation time—updating a secret requires resubmitting the SparkApplication to propagate changes.

For applications reading from S3, the AWS credentials in this example enable the Hadoop S3A connector. Executors receive the same credentials since they directly read input splits from S3. Alternatively, use IAM Roles for Service Accounts (IRSA) on EKS to avoid managing long-lived credentials entirely.

With your SparkApplication resources defined, you need strategies for scheduling these jobs reliably and managing cluster resources efficiently across multiple workloads.

Production Patterns: Scheduling and Resource Management

Moving Spark workloads to production means dealing with cost optimization, resource contention, and operational reliability. The Spark Operator provides declarative patterns for scheduling recurring jobs, enforcing resource limits, and leveraging cheaper compute options like spot instances—all integrated with Kubernetes’ native scheduling primitives.

Scheduled Workloads with ScheduledSparkApplication

For batch ETL pipelines and reporting jobs, the ScheduledSparkApplication CRD replaces traditional cron orchestrators. The operator handles job lifecycle management, automatic cleanup of completed runs, and concurrency control—eliminating the need for external schedulers like Airflow for simple recurring Spark jobs.

scheduled-etl-pipeline.yaml
apiVersion: sparkoperator.k8s.io/v1beta2
kind: ScheduledSparkApplication
metadata:
name: daily-user-aggregation
namespace: data-pipelines
spec:
schedule: "0 2 * * *"
concurrencyPolicy: Forbid
successfulRunHistoryLimit: 3
failedRunHistoryLimit: 1
template:
type: Scala
mode: cluster
image: "registry.company.com/spark-jobs:3.5.0"
imagePullPolicy: Always
mainClass: com.company.pipelines.UserAggregation
mainApplicationFile: "local:///opt/spark/jars/pipelines.jar"
sparkVersion: "3.5.0"
restartPolicy:
type: OnFailure
onFailureRetries: 2
onFailureRetryInterval: 300
driver:
cores: 2
memory: "4g"
serviceAccount: spark-operator
executor:
cores: 4
memory: "8g"
instances: 10

The concurrencyPolicy: Forbid prevents overlapping runs when jobs exceed their expected duration—critical for jobs that write to the same output location. Alternative policies include Allow for parallel execution and Replace to terminate running jobs when a new schedule triggers. Set successfulRunHistoryLimit to retain recent executions for debugging without cluttering your namespace with hundreds of completed pods. The operator automatically creates a new SparkApplication resource for each scheduled run, appending a timestamp suffix to the name.

The schedule field uses standard cron syntax, but unlike traditional cron, the operator maintains execution history as Kubernetes resources. Query past runs with kubectl get sparkapplications -l schedule-name=daily-user-aggregation to audit job success rates and execution times. This native integration with Kubernetes means you can use standard monitoring tools to alert on failed scheduled jobs rather than parsing cron logs.

Preventing Resource Saturation

Without guardrails, a single misconfigured Spark job requesting 1000 executors can starve other workloads. Kubernetes ResourceQuotas and LimitRanges constrain resource consumption at the namespace level, providing hard caps that the Spark Operator respects during pod creation.

spark-namespace-quotas.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: spark-compute-quota
namespace: data-pipelines
spec:
hard:
requests.cpu: "500"
requests.memory: "2Ti"
limits.cpu: "800"
limits.memory: "3Ti"
pods: "300"
---
apiVersion: v1
kind: LimitRange
metadata:
name: spark-pod-limits
namespace: data-pipelines
spec:
limits:
- max:
cpu: "16"
memory: "64Gi"
min:
cpu: "1"
memory: "2Gi"
type: Pod

These quotas establish a resource ceiling for all Spark applications in the namespace. The LimitRange prevents individual executor pods from requesting excessive resources while ensuring minimum allocations that prevent underprovisioned executors from causing OOM failures. When a SparkApplication exceeds quota, the driver pod starts successfully but executor creation fails with quota errors visible in kubectl describe sparkapplication. The application remains in a pending state rather than consuming partial resources, making quota violations immediately visible.

The distinction between requests and limits matters for overcommit strategies. Setting limits.cpu higher than requests.cpu allows burstable workloads to use idle CPU cycles while guaranteeing baseline capacity through requests. Memory limits should typically equal requests for Spark executors since the JVM allocates heap upfront—overcommitting memory leads to OOMKilled pods when the kernel reclaims pages during memory pressure.

Heterogeneous Node Pools and Affinity

Production clusters often mix node types: memory-optimized instances for drivers processing large broadcast variables, CPU-optimized for executors running compute-intensive transformations, and GPU nodes for ML workloads. Node selectors and pod affinity rules direct Spark components to appropriate hardware, matching workload characteristics to instance capabilities.

multi-pool-spark-app.yaml
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: ml-training-job
spec:
driver:
cores: 4
memory: "16g"
nodeSelector:
workload-type: "memory-optimized"
node.kubernetes.io/instance-type: "r6i.2xlarge"
executor:
cores: 8
memory: "32g"
instances: 20
nodeSelector:
workload-type: "compute-optimized"
node.kubernetes.io/instance-type: "c6i.4xlarge"
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: spark-role
operator: In
values: ["executor"]
topologyKey: kubernetes.io/hostname

The anti-affinity rule spreads executors across nodes to improve fault tolerance and network bandwidth—losing a single node impacts fewer tasks. Use preferredDuringScheduling rather than requiredDuringScheduling to maintain scheduling flexibility when the cluster approaches capacity. A hard requirement would cause the application to hang indefinitely if insufficient nodes are available, while a preference degrades gracefully by co-locating executors when necessary.

For data-intensive pipelines reading from network-attached storage, use podAffinity to schedule executors near nodes with mounted volumes, reducing cross-AZ traffic costs. The topologyKey determines the scope of affinity rules: kubernetes.io/hostname spreads across nodes, while topology.kubernetes.io/zone spreads across availability zones for regional fault tolerance.

Cost Optimization with Spot Instances

Spot instances offer 60-80% cost savings but require handling interruptions gracefully. Tolerations allow Spark executors to run on spot node pools, while the driver remains on stable on-demand instances to prevent entire application failures during spot reclamation events.

spot-executor-config.yaml
spec:
driver:
cores: 2
memory: "8g"
nodeSelector:
capacity-type: "on-demand"
executor:
cores: 4
memory: "16g"
instances: 50
nodeSelector:
capacity-type: "spot"
tolerations:
- key: "spot-instance"
operator: "Equal"
value: "true"
effect: "NoSchedule"

When a spot node receives a termination notice, Kubernetes evicts the executor pods with a 2-minute grace period. Spark’s dynamic allocation automatically replaces lost executors on available nodes, and speculative execution re-runs tasks from interrupted executors—treating spot interruptions identically to hardware failures. Enable speculation with spark.speculation=true and tune spark.speculation.multiplier to balance redundant computation against recovery time.

This pattern works well for fault-tolerant batch workloads where the cost savings justify occasional task retries. Avoid spot instances for streaming jobs with tight SLA requirements or workloads with expensive shuffle operations that lose significant progress when executors fail. For maximum savings, combine spot executors with checkpointing to S3 or persistent volumes—completed stages survive executor loss, and Spark only recomputes lost partitions rather than restarting from scratch.

Monitor spot interruption rates through Kubernetes node events and correlate with job completion times. If interruptions cause excessive retries, reduce executor.instances to spread the workload across more spot capacity pools, decreasing the probability of simultaneous terminations.

With scheduling patterns, resource controls, and cost optimization strategies in place, the next critical piece is understanding what your Spark jobs are actually doing in production through comprehensive monitoring and observability.

Monitoring and Observability

Production Spark workloads demand comprehensive observability. Unlike traditional spark-submit deployments where monitoring requires custom tooling, the Spark Operator exposes metrics and status through native Kubernetes primitives, letting you leverage your existing monitoring stack. This integration provides real-time visibility into application health, resource utilization, and execution patterns without requiring separate monitoring infrastructure.

Prometheus Metrics Integration

The Spark Operator configures Prometheus metrics endpoints automatically when you enable the metrics exporter in your SparkApplication. Add this configuration to expose application-level metrics:

spark-job-with-metrics.yaml
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: data-processing-job
namespace: spark-jobs
spec:
monitoring:
exposeDriverMetrics: true
exposeExecutorMetrics: true
prometheus:
jmxExporterJar: "/prometheus/jmx_prometheus_javaagent-0.17.2.jar"
port: 8090
driver:
cores: 2
memory: "4g"
labels:
metrics: "enabled"
serviceAccount: spark-operator-spark
executor:
cores: 4
memory: "8g"
instances: 3
labels:
metrics: "enabled"

Configure Prometheus to scrape these endpoints using a ServiceMonitor resource. The Spark Operator creates services for driver pods automatically when metrics are enabled:

servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: spark-metrics
namespace: spark-jobs
spec:
selector:
matchLabels:
metrics: "enabled"
endpoints:
- port: metrics
interval: 30s
path: /metrics

This exposes critical JVM metrics, executor task metrics, and custom application metrics you define in your Spark code. Track memory pressure, task completion rates, and stage durations directly in your existing Grafana dashboards. Key metrics include garbage collection performance, heap utilization, task serialization time, and shuffle read/write throughput—all essential for identifying bottlenecks in distributed data processing pipelines.

The JMX exporter collects metrics from both the driver and executor JVMs, providing visibility into the entire application lifecycle. You can configure custom JMX rules to expose domain-specific metrics from your Spark applications, such as record processing rates, data quality scores, or business-specific KPIs that complement standard Spark metrics.

SparkApplication Status Tracking

The Spark Operator maintains detailed status conditions in the SparkApplication resource. Query application state programmatically:

check_spark_status.py
from kubernetes import client, config
config.load_kube_config()
custom_api = client.CustomObjectsApi()
app = custom_api.get_namespaced_custom_object(
group="sparkoperator.k8s.io",
version="v1beta2",
namespace="spark-jobs",
plural="sparkapplications",
name="data-processing-job"
)
status = app.get('status', {})
app_state = status.get('applicationState', {}).get('state')
driver_info = status.get('driverInfo', {})
print(f"Application State: {app_state}")
print(f"Driver Pod: {driver_info.get('podName')}")
print(f"Submission Time: {status.get('submissionTime')}")
print(f"Completion Time: {status.get('terminationTime')}")
## Check for failures
if app_state == 'FAILED':
error_message = status.get('applicationState', {}).get('errorMessage')
print(f"Failure reason: {error_message}")

The status includes phases like SUBMITTED, RUNNING, COMPLETED, and FAILED, along with timestamps and error messages. Use these fields to build automated workflows that react to job completion or trigger alerts on failures. The operator updates status conditions continuously, allowing you to track progress through intermediate states like PENDING_RERUN for jobs configured with restart policies.

Status tracking integrates seamlessly with Kubernetes-native tools like kubectl and argo workflows. Watch for status changes using kubectl events or trigger downstream jobs based on completion status in your orchestration pipelines. The operator maintains a complete audit trail of state transitions, invaluable for debugging failed jobs or analyzing execution patterns across multiple runs.

Structured Logging with Fluent Bit

Centralize Spark logs by deploying Fluent Bit as a DaemonSet to collect container logs. Configure log parsing for Spark’s output format:

fluent-bit-spark-parser.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: logging
data:
parsers.conf: |
[PARSER]
Name spark
Format regex
Regex ^(?<time>\d{2}\/\d{2}\/\d{2}\s+\d{2}:\d{2}:\d{2})\s+(?<level>\w+)\s+(?<class>[\w\.]+):\s+(?<message>.*)$
Time_Key time
Time_Format %y/%m/%d %H:%M:%S

Add namespace and pod labels to distinguish between driver and executor logs. Filter logs by application name or job ID to trace execution flow across distributed executors. Structured logging enables powerful queries in your log aggregation backend—search for specific exception types, correlate errors across executor pods, or analyze task failure patterns that span multiple containers.

Forward parsed logs to Elasticsearch, Loki, or CloudWatch Logs for long-term retention and analysis. Configure log retention policies based on job criticality, keeping detailed logs for production workloads while aggressively pruning development job logs to control storage costs.

Alerting on Critical Conditions

Configure PrometheusRules to alert on job failures and resource exhaustion:

spark-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: spark-job-alerts
namespace: spark-jobs
spec:
groups:
- name: spark
interval: 30s
rules:
- alert: SparkJobFailed
expr: kube_customresource_status{customresource_kind="SparkApplication",status_state="FAILED"} == 1
for: 1m
annotations:
summary: "Spark job {{ $labels.name }} failed"
- alert: SparkExecutorOOMKilled
expr: rate(kube_pod_container_status_terminated_reason{reason="OOMKilled",namespace="spark-jobs"}[5m]) > 0
annotations:
summary: "Spark executor OOM killed in {{ $labels.namespace }}"

Extend these base alerts with application-specific conditions. Monitor job duration to detect performance regressions, track executor churn rates to identify resource contention, or alert when data processing lag exceeds acceptable thresholds. Combine Kubernetes resource metrics with Spark-specific metrics for comprehensive coverage—alert not just when a pod fails, but when task completion rates drop below expected levels even while the application continues running.

These monitoring foundations enable you to treat Spark jobs as first-class Kubernetes workloads. With metrics, status tracking, and structured logging in place, you can integrate Spark deployments into your GitOps workflow and maintain the same operational standards you apply to microservices and other cloud-native applications.

GitOps Workflow with ArgoCD

Managing Spark workloads through GitOps brings the same benefits you’ve come to expect from infrastructure-as-code: version control, peer review, automated deployments, and rollback capabilities. SparkApplication manifests are declarative Kubernetes resources, making them natural candidates for GitOps workflows with ArgoCD.

Repository Structure

Organize your SparkApplication manifests in a Git repository with environment-specific overlays using Kustomize or Helm:

spark-apps-repo/
spark-apps/
├── base/
│ ├── kustomization.yaml
│ └── daily-etl.yaml
├── overlays/
│ ├── dev/
│ │ ├── kustomization.yaml
│ │ └── resources.yaml
│ ├── staging/
│ │ ├── kustomization.yaml
│ │ └── resources.yaml
│ └── prod/
│ ├── kustomization.yaml
│ └── resources.yaml

The base manifest defines your SparkApplication with environment-agnostic configurations, while overlays patch resource limits, executor counts, and environment-specific settings:

base/daily-etl.yaml
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: daily-etl
namespace: spark-apps
spec:
type: Scala
mode: cluster
image: "gcr.io/my-company/spark-etl:3.5.0"
mainClass: com.company.etl.DailyProcessor
mainApplicationFile: "local:///opt/spark/jars/etl-app.jar"
sparkVersion: "3.5.0"
restartPolicy:
type: OnFailure
onFailureRetries: 2
onFailureRetryInterval: 30
driver:
cores: 1
memory: "2g"
serviceAccount: spark-driver
executor:
instances: 5
cores: 2
memory: "4g"
overlays/prod/resources.yaml
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: daily-etl
spec:
executor:
instances: 20
cores: 4
memory: "16g"
driver:
cores: 2
memory: "8g"

The Kustomize approach keeps your base configuration DRY while allowing environment-specific customization without duplicating the entire manifest. For more complex scenarios with multiple applications sharing common patterns, Helm charts provide templating capabilities and values-based configuration.

ArgoCD Application Configuration

Create an ArgoCD Application resource for each environment. This example targets the production overlay:

argocd/spark-apps-prod.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: spark-apps-prod
namespace: argocd
spec:
project: data-platform
source:
repoURL: https://github.com/my-company/spark-applications.git
targetRevision: main
path: spark-apps/overlays/prod
destination:
server: https://kubernetes.default.svc
namespace: spark-apps
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true

With automated sync enabled, ArgoCD deploys SparkApplication changes within seconds of merging to your main branch. The operator then handles the actual Spark cluster provisioning. The prune: true setting ensures deleted manifests are removed from the cluster, while selfHeal: true automatically corrects any manual changes, maintaining Git as the single source of truth.

Environment Promotion Patterns

A typical promotion workflow moves changes through environments by updating ArgoCD’s targetRevision field or using branch-based deployments. For branch-based promotion, point development environments at a develop branch, staging at release/* branches, and production at tagged releases:

argocd/spark-apps-dev.yaml
spec:
source:
targetRevision: develop # Auto-deploy latest development changes
argocd/spark-apps-prod.yaml
spec:
source:
targetRevision: v1.2.3 # Only deploy tagged releases

This approach enforces a gated promotion process where production deployments require explicit version tags, while lower environments receive continuous updates. When a release candidate passes staging validation, tag the commit and update the production Application’s targetRevision.

For organizations requiring manual approval gates, disable automated sync in production and use ArgoCD’s CLI or UI to trigger deployments after validation:

syncPolicy:
automated: null # Disable auto-sync for manual promotion

ArgoCD Application Sets provide an alternative approach for managing identical applications across multiple environments. Instead of creating separate Application manifests, define a single ApplicationSet that generates Applications for each environment:

argocd/spark-apps-set.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: spark-apps
namespace: argocd
spec:
generators:
- list:
elements:
- env: dev
cluster: https://kubernetes.default.svc
- env: staging
cluster: https://kubernetes.default.svc
- env: prod
cluster: https://prod-cluster-api.company.com
template:
metadata:
name: 'spark-apps-{{env}}'
spec:
project: data-platform
source:
repoURL: https://github.com/my-company/spark-applications.git
targetRevision: main
path: 'spark-apps/overlays/{{env}}'
destination:
server: '{{cluster}}'
namespace: spark-apps

This pattern scales particularly well when managing dozens of SparkApplications with consistent deployment patterns across environments.

Managing Secrets

Never commit database credentials or API keys directly to Git. Use Sealed Secrets or External Secrets Operator to manage sensitive data:

base/secrets.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: spark-etl-secrets
namespace: spark-apps
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-manager
kind: SecretStore
target:
name: spark-etl-secrets
data:
- secretKey: database-password
remoteRef:
key: prod/spark/etl-db-credentials
property: password
- secretKey: api-key
remoteRef:
key: prod/spark/external-api
property: key

Reference these secrets in your SparkApplication using environment variables or mount them as files:

spec:
driver:
env:
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: spark-etl-secrets
key: database-password
- name: API_KEY
valueFrom:
secretKeyRef:
name: spark-etl-secrets
key: api-key

The External Secrets Operator synchronizes secrets from external providers like AWS Secrets Manager, HashiCorp Vault, or Azure Key Vault, keeping sensitive data out of Git while maintaining GitOps principles. The refreshInterval ensures credentials are automatically rotated when updated in the external system.

For Sealed Secrets, encrypt secrets using the kubeseal CLI before committing them to Git. The SealedSecret controller running in your cluster decrypts them into standard Kubernetes Secrets:

Terminal window
kubectl create secret generic spark-etl-secrets \
--from-literal=database-password=secret123 \
--dry-run=client -o yaml | \
kubeseal --controller-namespace kube-system -o yaml > sealed-secret.yaml

💡 Pro Tip: Use ArgoCD’s Application Sets to manage multiple SparkApplications across environments with a single manifest, reducing duplication and configuration drift.

This GitOps approach transforms Spark deployments from manual kubectl apply commands into auditable, peer-reviewed changes with full rollback capabilities. Every production deployment becomes a Git commit, giving you the same rigor for data processing workloads that you apply to application code. When issues arise, reverting to a previous working state is as simple as rolling back a Git commit and letting ArgoCD reconcile the cluster state.

With monitoring and GitOps in place, the final piece is handling the inevitable production issues. The next section covers troubleshooting techniques and performance tuning strategies for production Spark workloads.

Troubleshooting and Performance Tuning

Production Spark workloads fail in predictable ways. The Spark Operator surfaces diagnostics through Kubernetes-native primitives, making troubleshooting faster than traditional spark-submit deployments.

Reading Operator Events for Root Cause Analysis

When a SparkApplication fails, start with kubectl describe sparkapplication <name>. The Status section shows the current state (FAILED, SUBMISSION_FAILED, INVALIDATING) and the Events section reveals why. Common patterns:

Visual: Troubleshooting workflow showing kubectl commands to diagnose SparkApplication failures and common error patterns

OOMKilled executors appear as pod evictions in events with exit code 137. This indicates your executor memory settings (spark.executor.memory) are too close to the Kubernetes pod memory limit. The JVM heap plus off-heap memory (for shuffles, caching, and overhead) must fit within the pod’s resource request. A safe rule: set pod memory limits 10-15% higher than spark.executor.memory to account for off-heap overhead.

Missing dependencies manifest as ClassNotFoundException in driver logs. Unlike standalone clusters, Kubernetes doesn’t share a distributed filesystem by default. All JARs must be accessible via HTTP/S3, included in the container image, or mounted via ConfigMaps. The operator’s spec.deps.jars field handles remote dependencies, but building custom images with dependencies baked in eliminates runtime fetch failures.

Network timeouts during shuffle operations point to insufficient executor-to-executor bandwidth or aggressive timeout settings. The operator deploys executors as headless services, enabling direct pod-to-pod communication. Check that your CNI supports pod networking without NAT and increase spark.network.timeout from the default 120s to 300s or higher for large shuffles.

Right-Sizing Resource Requests

Start with executor cores equal to the number of CPU cores in your node’s allocatable capacity divided by desired executor count. For memory-intensive workloads, allocate 4-8GB per core. For CPU-bound jobs, 2-4GB per core suffices. Monitor actual usage with kubectl top pods during runs and adjust iteratively.

💡 Pro Tip: Enable dynamic allocation (spark.dynamicAllocation.enabled: true) in the SparkApplication spec to let Kubernetes scale executors based on workload, reducing costs during idle periods.

Optimizing Shuffle Storage

Shuffle data writes to emptyDir volumes by default, using node ephemeral storage. For shuffle-heavy jobs exceeding 100GB of intermediate data, mount persistent volumes to /tmp in executor pods. This prevents node disk exhaustion and improves performance on I/O-optimized storage classes. Configure shuffle compression (spark.io.compression.codec: lz4) to reduce both storage and network pressure.

With these diagnostic techniques in place, maintaining production Spark clusters becomes a repeatable operational practice rather than ad-hoc firefighting.

Key Takeaways

  • Deploy Spark Operator with Helm and configure RBAC for secure multi-tenant environments
  • Replace imperative spark-submit workflows with declarative SparkApplication CRDs managed through GitOps
  • Implement dynamic executor allocation and spot instance tolerations to reduce compute costs by 40-70%
  • Integrate Prometheus metrics and structured logging to get observability on par with other Kubernetes workloads