Why lineage matters

  • Debugging — a metric changed: which upstream table, which transformation, which source system caused it?
  • GDPR/compliance — an individual requests deletion: which tables and columns contain their PII? Lineage gives the answer in seconds.
  • Impact analysis — a source table is changing schema: which downstream models and dashboards break?

Table-level lineage

The baseline. Parse SQL in your transformation tool (dbt, Spark, Airflow) to extract source → target table relationships. DataHub and OpenLineage support this automatically.

Column-level lineage

More powerful but harder to derive. dbt exposes it natively. For Spark, tools like Spline capture it at runtime via the Spark listener API.

OpenLineage

OpenLineage is an open standard for lineage events. Airflow, Spark, dbt, and Flink all emit OpenLineage events. Marquez (open-source) and DataHub both consume them. Adopt the standard, not a vendor-specific format.