Why Kubernetes for data
Kubernetes provides consistent infrastructure for spinning up ephemeral compute (Spark executors, Airflow workers) and managing long-running services (databases, Kafka) in a single control plane.
Spark on Kubernetes
Since Spark 2.3, Kubernetes is a native cluster manager. The driver submits executor pods directly to the K8s API. Use the Spark Operator (kubeflow) for declarative SparkApplication resources.
spark-submit
--master k8s://https://my-cluster:6443
--deploy-mode cluster
--conf spark.kubernetes.container.image=my-spark:3.5
s3a://bucket/jobs/transform.py
Airflow on Kubernetes
Use the KubernetesExecutor — each task runs in its own pod and is deleted on completion. No idle worker pods consuming resources. Full task isolation.
Storage
Use S3/GCS for large datasets — never store data on pod-local ephemeral storage. For stateful services (Postgres, Redis), use PersistentVolumeClaims backed by high-IOPS block storage.