Pipeline SLIs

  • Freshness — time since last successful run. Alert if > 1.5× the expected interval.
  • Row count delta — rows loaded this run vs 7-day average. Alert on > ±30%.
  • Error rate — failed tasks / total tasks. Alert on > 1%.
  • Duration — P95 job duration. Alert when jobs run 2× longer than baseline.

Infrastructure SLIs

  • Kafka consumer group lag — alert when lag grows unboundedly.
  • Dead letter queue depth — any messages in DLQ require investigation.
  • Replication lag — alert when replica falls behind primary.

Tooling

Prometheus + Grafana for infrastructure metrics. Airflow built-in metrics exportable to StatsD. Custom data quality metrics via Great Expectations or re_data pushed to Prometheus.

On-call runbooks

Every alert must have a linked runbook. The runbook answers: what is broken, what is the user impact, how do I diagnose, and how do I fix it. Without runbooks, alerts create panic instead of resolution.