Why Kubernetes for data

Kubernetes provides consistent infrastructure for spinning up ephemeral compute (Spark executors, Airflow workers) and managing long-running services (databases, Kafka) in a single control plane.

Spark on Kubernetes

Since Spark 2.3, Kubernetes is a native cluster manager. The driver submits executor pods directly to the K8s API. Use the Spark Operator (kubeflow) for declarative SparkApplication resources.

spark-submit 
  --master k8s://https://my-cluster:6443 
  --deploy-mode cluster 
  --conf spark.kubernetes.container.image=my-spark:3.5 
  s3a://bucket/jobs/transform.py

Airflow on Kubernetes

Use the KubernetesExecutor — each task runs in its own pod and is deleted on completion. No idle worker pods consuming resources. Full task isolation.

Storage

Use S3/GCS for large datasets — never store data on pod-local ephemeral storage. For stateful services (Postgres, Redis), use PersistentVolumeClaims backed by high-IOPS block storage.