Row vs columnar storage
CSV stores all fields for row 1, then row 2, etc. Parquet stores all values for column 1 together, then column 2. For analytical queries that read 3 columns from a 200-column table, columnar storage reads ~1.5% of the data CSV would.
Parquet advantages
- Compression — columns of the same type compress far better than mixed-type rows. Typical 5–10× size reduction vs uncompressed CSV.
- Predicate pushdown — query engines skip row groups where the column statistics prove no rows match the filter. No data is read, not just ignored.
- Schema enforcement — data types are embedded in the file. No silent type coercion from CSV string parsing.
When CSV is still appropriate
- Human-readable interchange where the recipient has no Parquet tooling.
- Very small files where the Parquet overhead is not worth it.
- Source system export format you cannot control.
Converting CSV to Parquet
import pandas as pd
df = pd.read_csv("data.csv")
df.to_parquet("data.parquet", engine="pyarrow", compression="snappy")