Principles
- Product over plumbing: data exists to drive outcomes, not just pipelines.
- Privacy by design: minimize, anonymize, and purpose-bind from day one.
- Contracts everywhere: typed schemas, versioned datasets, testable SLAs.
- Observability built-in: lineage, quality, and cost are first-class signals.
Pipelines
- Ingest: CDC/stream + batch; define source truth and retain raw, append-only.
- Transform: ELT with SQL-first models (dbt-style) + Python where needed.
- Serve: marts per domain (analytics), feature store (ML), and APIs for apps.
- SLOs: freshness, completeness, and schema-stability with alerts.
Vector Stores (for RAG/Similarity)
- Indexing: choose HNSW/IVF/ScaNN per workload; store
embedding,text,metadata,source_id,timestamp. - Hybrid: vector + BM25; re-rank with cross-encoder; keep doc chunk boundaries.
- Lifecycle: TTL for stale docs; re-embed on model change via backfill jobs.
- Governance: access tags/row-level filters; immutable source links for citations.
Data Modeling & Governance
- Layers: raw (immutable) → staging (cleaned) → curated (modeled) → marts.
- Schemas: JSON/Avro/Parquet with evolution rules; semantic layer for metrics.
- Ownership: domains own models + SLAs; platform provides tooling + guardrails.
- Catalog: searchable glossary, data contracts, lineage graphs, usage stats.
Privacy & Compliance
- Minimization: collect only what’s needed; drop sensitive columns at the edge.
- Pseudonymization: salted hashes; tokenization for joins; reversible keys vaulted.
- Differential privacy: bins + noise where aggregate reporting is enough.
- Policy-as-code: purpose-binding, consent flags, retention TTLs, subject request tooling.
Data Quality & Lineage
- Checks: schema, null %, range, uniqueness, referential integrity; block bad publishes.
- Lineage: column-level where possible; impact analysis before merges.
- Drift: monitor distribution shifts; canary new sources with backfills.
Security & Access
- AuthZ: least privilege, row/column-level security, attribute-based access control.
- Secrets: KMS + secrets manager; rotate; never in code or notebooks.
- Encryption: TLS in transit; at-rest with per-tenant keys where feasible.
Platform & Orchestration
Sources → Ingest (CDC/Streams/Batch) → Lake/Warehouse
→ Transform (dbt/SQL + Py) → Semantic Layer
→ Serving (Marts/Feature Store/Vector DB/APIs)
↔ Observability (lineage, quality, cost, usage)
- Orchestrate: DAGs with retries/backoff; idempotent tasks; data-aware scheduling.
- Envs: dev/stage/prod with promotion; reproducible containers; infra as code.
Cost & FinOps
- Visibility: per-table and per-query cost; tag by team/product/feature.
- Controls: partition/prune; materialize only hot sets; auto-vacuum/compaction.
- Budgets: alert on spend drift; kill switches for runaway jobs.
Data SRE & Reliability
- SLOs: freshness and availability by dataset; error budgets for change velocity.
- Runbooks: on-call dashboards, replay steps, backfill patterns.
- Incidents: clear severities; blameless postmortems with action items.
30-60-90 Day Ramp
- 30d: catalog + lineage on top 10 datasets; raw→staging→curated pattern; basic quality checks.
- 60d: semantic layer for key metrics; vector store wired to doc pipeline; dataset SLOs + alerts.
- 90d: privacy policy-as-code (retention, masking); per-team cost dashboards; replay/backfill tooling.