What Spark is

Spark is a distributed compute engine for large-scale data processing. It parallelises work across a cluster, keeping intermediate data in memory where possible to avoid disk I/O.

DataFrames over RDDs

Always prefer the DataFrame API over raw RDDs for new code. DataFrames use the Catalyst query optimizer and Tungsten execution engine — they are automatically faster and easier to read.

Spark SQL

Register DataFrames as temporary views and query them with standard SQL. Great for ELT and teams that are more comfortable with SQL than Python.

df.createOrReplaceTempView("orders")
spark.sql("SELECT date, SUM(amount) FROM orders GROUP BY date").show()

When Spark is the right tool

  • Dataset does not fit in a single machine's RAM.
  • ML training on datasets > ~50 GB.
  • Complex multi-stage transformations on hundreds of millions of rows.

When Spark is overkill

For datasets under a few GB, DuckDB or pandas runs the same query in a fraction of the time with zero cluster overhead.