The five pillars of data observability
- Freshness — is data arriving on schedule? A table not updated in 6 hours when it should update hourly is a silent failure.
- Volume — is the row count within expected bounds? A 90% drop is a pipeline failure; a 200% spike is worth investigating.
- Distribution — are column value distributions stable? Null rate jumping from 1% to 30% is a schema change or source bug.
- Schema — did columns appear, disappear, or change type?
- Lineage — when something breaks, where did it come from?
Implementation approach
Start with freshness and volume monitors — they catch 80% of issues. Add distribution monitors for high-value tables. Tools: Monte Carlo, Bigeye, Metaplane (commercial); re_data or custom Great Expectations checkpoints (open-source).
Alert fatigue
Tune thresholds with ML-based anomaly detection rather than static rules. A table that legitimately has zero rows on Sundays should not page on-call every Sunday morning.