Pipeline SLIs
- Freshness — time since last successful run. Alert if > 1.5× the expected interval.
- Row count delta — rows loaded this run vs 7-day average. Alert on > ±30%.
- Error rate — failed tasks / total tasks. Alert on > 1%.
- Duration — P95 job duration. Alert when jobs run 2× longer than baseline.
Infrastructure SLIs
- Kafka consumer group lag — alert when lag grows unboundedly.
- Dead letter queue depth — any messages in DLQ require investigation.
- Replication lag — alert when replica falls behind primary.
Tooling
Prometheus + Grafana for infrastructure metrics. Airflow built-in metrics exportable to StatsD. Custom data quality metrics via Great Expectations or re_data pushed to Prometheus.
On-call runbooks
Every alert must have a linked runbook. The runbook answers: what is broken, what is the user impact, how do I diagnose, and how do I fix it. Without runbooks, alerts create panic instead of resolution.