The five pillars of data observability

  • Freshness — is data arriving on schedule? A table not updated in 6 hours when it should update hourly is a silent failure.
  • Volume — is the row count within expected bounds? A 90% drop is a pipeline failure; a 200% spike is worth investigating.
  • Distribution — are column value distributions stable? Null rate jumping from 1% to 30% is a schema change or source bug.
  • Schema — did columns appear, disappear, or change type?
  • Lineage — when something breaks, where did it come from?

Implementation approach

Start with freshness and volume monitors — they catch 80% of issues. Add distribution monitors for high-value tables. Tools: Monte Carlo, Bigeye, Metaplane (commercial); re_data or custom Great Expectations checkpoints (open-source).

Alert fatigue

Tune thresholds with ML-based anomaly detection rather than static rules. A table that legitimately has zero rows on Sundays should not page on-call every Sunday morning.