What DuckDB is
DuckDB is an embedded columnar OLAP database. It runs in-process (no server), reads Parquet/CSV/JSON directly, and executes vectorised SQL queries that saturate all CPU cores.
Why it is surprisingly fast
- Columnar execution with SIMD vectorisation — processes thousands of values per instruction.
- Parallel query execution using all available cores.
- Pushes filters directly into Parquet reads — skips irrelevant row groups without loading them.
Query S3 directly
INSTALL httpfs; LOAD httpfs;
SELECT year, SUM(revenue)
FROM read_parquet('s3://my-bucket/data/year=*/sales.parquet')
GROUP BY year;
When to use DuckDB instead of Spark
- Dataset fits on one machine (under ~500 GB depending on RAM).
- Ad-hoc exploration — DuckDB starts in milliseconds; Spark takes minutes.
- CI/CD pipeline unit tests on data transformations.
Limitations
Single node only — no distributed compute. For datasets requiring cluster processing or concurrent write workloads, use Spark or a warehouse.