Monitoring and Alerting for Data Pipelines

What to monitor, SLIs/SLOs for data, and building effective alerting.

Updated July 23, 2026 96 views

Pipeline SLIs

Freshness — time since last successful run. Alert if > 1.5× the expected interval.
Row count delta — rows loaded this run vs 7-day average. Alert on > ±30%.
Error rate — failed tasks / total tasks. Alert on > 1%.
Duration — P95 job duration. Alert when jobs run 2× longer than baseline.

Infrastructure SLIs

Kafka consumer group lag — alert when lag grows unboundedly.
Dead letter queue depth — any messages in DLQ require investigation.
Replication lag — alert when replica falls behind primary.

Tooling

Prometheus + Grafana for infrastructure metrics. Airflow built-in metrics exportable to StatsD. Custom data quality metrics via Great Expectations or re_data pushed to Prometheus.

On-call runbooks

Every alert must have a linked runbook. The runbook answers: what is broken, what is the user impact, how do I diagnose, and how do I fix it. Without runbooks, alerts create panic instead of resolution.

Monitoring and Alerting for Data Pipelines

Pipeline SLIs

Infrastructure SLIs

Tooling

On-call runbooks

Related articles

Data Warehouse Modelling — Star Schema and Dimensional Design

PostgreSQL Performance Tuning Fundamentals

Apache Iceberg — The Open Table Format Explained

Choosing a vector database: pgvector vs Pinecone vs Weaviate