Row vs columnar storage

CSV stores all fields for row 1, then row 2, etc. Parquet stores all values for column 1 together, then column 2. For analytical queries that read 3 columns from a 200-column table, columnar storage reads ~1.5% of the data CSV would.

Parquet advantages

  • Compression — columns of the same type compress far better than mixed-type rows. Typical 5–10× size reduction vs uncompressed CSV.
  • Predicate pushdown — query engines skip row groups where the column statistics prove no rows match the filter. No data is read, not just ignored.
  • Schema enforcement — data types are embedded in the file. No silent type coercion from CSV string parsing.

When CSV is still appropriate

  • Human-readable interchange where the recipient has no Parquet tooling.
  • Very small files where the Parquet overhead is not worth it.
  • Source system export format you cannot control.

Converting CSV to Parquet

import pandas as pd
df = pd.read_csv("data.csv")
df.to_parquet("data.parquet", engine="pyarrow", compression="snappy")