Running Data Workloads on Kubernetes

Spark on K8s, Airflow on K8s, resource requests, and storage patterns.

Updated May 23, 2026 48 views

Why Kubernetes for data

Kubernetes provides consistent infrastructure for spinning up ephemeral compute (Spark executors, Airflow workers) and managing long-running services (databases, Kafka) in a single control plane.

Spark on Kubernetes

Since Spark 2.3, Kubernetes is a native cluster manager. The driver submits executor pods directly to the K8s API. Use the Spark Operator (kubeflow) for declarative SparkApplication resources.

spark-submit 
  --master k8s://https://my-cluster:6443 
  --deploy-mode cluster 
  --conf spark.kubernetes.container.image=my-spark:3.5 
  s3a://bucket/jobs/transform.py

Airflow on Kubernetes

Use the KubernetesExecutor — each task runs in its own pod and is deleted on completion. No idle worker pods consuming resources. Full task isolation.

Storage

Use S3/GCS for large datasets — never store data on pod-local ephemeral storage. For stateful services (Postgres, Redis), use PersistentVolumeClaims backed by high-IOPS block storage.

Running Data Workloads on Kubernetes

Why Kubernetes for data

Spark on Kubernetes

Airflow on Kubernetes

Storage

Related articles

Data Governance — Principles and Practical Implementation

PostgreSQL Performance Tuning Fundamentals

Choosing a vector database: pgvector vs Pinecone vs Weaviate

Building a Data Quality Framework