DuckDB — Blazing Fast Local Analytics

When to reach for DuckDB instead of Spark, and how to use it effectively.

Updated May 19, 2026 43 views

What DuckDB is

DuckDB is an embedded columnar OLAP database. It runs in-process (no server), reads Parquet/CSV/JSON directly, and executes vectorised SQL queries that saturate all CPU cores.

Why it is surprisingly fast

Columnar execution with SIMD vectorisation — processes thousands of values per instruction.
Parallel query execution using all available cores.
Pushes filters directly into Parquet reads — skips irrelevant row groups without loading them.

Query S3 directly

INSTALL httpfs; LOAD httpfs;
SELECT year, SUM(revenue)
FROM read_parquet('s3://my-bucket/data/year=*/sales.parquet')
GROUP BY year;

When to use DuckDB instead of Spark

Dataset fits on one machine (under ~500 GB depending on RAM).
Ad-hoc exploration — DuckDB starts in milliseconds; Spark takes minutes.
CI/CD pipeline unit tests on data transformations.

Limitations

Single node only — no distributed compute. For datasets requiring cluster processing or concurrent write workloads, use Spark or a warehouse.

DuckDB — Blazing Fast Local Analytics

What DuckDB is

Why it is surprisingly fast

Query S3 directly

When to use DuckDB instead of Spark

Limitations

Related articles

PostgreSQL Performance Tuning Fundamentals

Choosing a vector database: pgvector vs Pinecone vs Weaviate

Building a Data Quality Framework

Privacy-First Data Design — PII Handling Patterns