Data & Platform
Pipelines, vector stores, governance, and privacy-first data design.

Principles

  • Product over plumbing: data exists to drive outcomes, not just pipelines.
  • Privacy by design: minimize, anonymize, and purpose-bind from day one.
  • Contracts everywhere: typed schemas, versioned datasets, testable SLAs.
  • Observability built-in: lineage, quality, and cost are first-class signals.

Pipelines

  • Ingest: CDC/stream + batch; define source truth and retain raw, append-only.
  • Transform: ELT with SQL-first models (dbt-style) + Python where needed.
  • Serve: marts per domain (analytics), feature store (ML), and APIs for apps.
  • SLOs: freshness, completeness, and schema-stability with alerts.

Vector Stores (for RAG/Similarity)

  • Indexing: choose HNSW/IVF/ScaNN per workload; store embedding, text, metadata, source_id, timestamp.
  • Hybrid: vector + BM25; re-rank with cross-encoder; keep doc chunk boundaries.
  • Lifecycle: TTL for stale docs; re-embed on model change via backfill jobs.
  • Governance: access tags/row-level filters; immutable source links for citations.

Data Modeling & Governance

  • Layers: raw (immutable) → staging (cleaned) → curated (modeled) → marts.
  • Schemas: JSON/Avro/Parquet with evolution rules; semantic layer for metrics.
  • Ownership: domains own models + SLAs; platform provides tooling + guardrails.
  • Catalog: searchable glossary, data contracts, lineage graphs, usage stats.

Privacy & Compliance

  • Minimization: collect only what’s needed; drop sensitive columns at the edge.
  • Pseudonymization: salted hashes; tokenization for joins; reversible keys vaulted.
  • Differential privacy: bins + noise where aggregate reporting is enough.
  • Policy-as-code: purpose-binding, consent flags, retention TTLs, subject request tooling.

Data Quality & Lineage

  • Checks: schema, null %, range, uniqueness, referential integrity; block bad publishes.
  • Lineage: column-level where possible; impact analysis before merges.
  • Drift: monitor distribution shifts; canary new sources with backfills.

Security & Access

  • AuthZ: least privilege, row/column-level security, attribute-based access control.
  • Secrets: KMS + secrets manager; rotate; never in code or notebooks.
  • Encryption: TLS in transit; at-rest with per-tenant keys where feasible.

Platform & Orchestration

Sources → Ingest (CDC/Streams/Batch) → Lake/Warehouse
→ Transform (dbt/SQL + Py) → Semantic Layer
→ Serving (Marts/Feature Store/Vector DB/APIs)
↔ Observability (lineage, quality, cost, usage)
  • Orchestrate: DAGs with retries/backoff; idempotent tasks; data-aware scheduling.
  • Envs: dev/stage/prod with promotion; reproducible containers; infra as code.

Cost & FinOps

  • Visibility: per-table and per-query cost; tag by team/product/feature.
  • Controls: partition/prune; materialize only hot sets; auto-vacuum/compaction.
  • Budgets: alert on spend drift; kill switches for runaway jobs.

Data SRE & Reliability

  • SLOs: freshness and availability by dataset; error budgets for change velocity.
  • Runbooks: on-call dashboards, replay steps, backfill patterns.
  • Incidents: clear severities; blameless postmortems with action items.

30-60-90 Day Ramp

  1. 30d: catalog + lineage on top 10 datasets; raw→staging→curated pattern; basic quality checks.
  2. 60d: semantic layer for key metrics; vector store wired to doc pipeline; dataset SLOs + alerts.
  3. 90d: privacy policy-as-code (retention, masking); per-team cost dashboards; replay/backfill tooling.