What DuckDB is

DuckDB is an embedded columnar OLAP database. It runs in-process (no server), reads Parquet/CSV/JSON directly, and executes vectorised SQL queries that saturate all CPU cores.

Why it is surprisingly fast

  • Columnar execution with SIMD vectorisation — processes thousands of values per instruction.
  • Parallel query execution using all available cores.
  • Pushes filters directly into Parquet reads — skips irrelevant row groups without loading them.

Query S3 directly

INSTALL httpfs; LOAD httpfs;
SELECT year, SUM(revenue)
FROM read_parquet('s3://my-bucket/data/year=*/sales.parquet')
GROUP BY year;

When to use DuckDB instead of Spark

  • Dataset fits on one machine (under ~500 GB depending on RAM).
  • Ad-hoc exploration — DuckDB starts in milliseconds; Spark takes minutes.
  • CI/CD pipeline unit tests on data transformations.

Limitations

Single node only — no distributed compute. For datasets requiring cluster processing or concurrent write workloads, use Spark or a warehouse.