Data & Platform

Data & Platform

Pipelines, vector stores, governance, and privacy-first data design.

Principles

Product over plumbing: data exists to drive outcomes, not just pipelines.
Privacy by design: minimize, anonymize, and purpose-bind from day one.
Contracts everywhere: typed schemas, versioned datasets, testable SLAs.
Observability built-in: lineage, quality, and cost are first-class signals.

Pipelines

Ingest: CDC/stream + batch; define source truth and retain raw, append-only.
Transform: ELT with SQL-first models (dbt-style) + Python where needed.
Serve: marts per domain (analytics), feature store (ML), and APIs for apps.
SLOs: freshness, completeness, and schema-stability with alerts.

Vector Stores (for RAG/Similarity)

Indexing: choose HNSW/IVF/ScaNN per workload; store embedding, text, metadata, source_id, timestamp.
Hybrid: vector + BM25; re-rank with cross-encoder; keep doc chunk boundaries.
Lifecycle: TTL for stale docs; re-embed on model change via backfill jobs.
Governance: access tags/row-level filters; immutable source links for citations.

Data Modeling & Governance

Layers: raw (immutable) → staging (cleaned) → curated (modeled) → marts.
Schemas: JSON/Avro/Parquet with evolution rules; semantic layer for metrics.
Ownership: domains own models + SLAs; platform provides tooling + guardrails.
Catalog: searchable glossary, data contracts, lineage graphs, usage stats.

Privacy & Compliance

Minimization: collect only what’s needed; drop sensitive columns at the edge.
Pseudonymization: salted hashes; tokenization for joins; reversible keys vaulted.
Differential privacy: bins + noise where aggregate reporting is enough.
Policy-as-code: purpose-binding, consent flags, retention TTLs, subject request tooling.

Data Quality & Lineage

Checks: schema, null %, range, uniqueness, referential integrity; block bad publishes.
Lineage: column-level where possible; impact analysis before merges.
Drift: monitor distribution shifts; canary new sources with backfills.

Security & Access

AuthZ: least privilege, row/column-level security, attribute-based access control.
Secrets: KMS + secrets manager; rotate; never in code or notebooks.
Encryption: TLS in transit; at-rest with per-tenant keys where feasible.

Platform & Orchestration

Sources → Ingest (CDC/Streams/Batch) → Lake/Warehouse

→ Transform (dbt/SQL + Py) → Semantic Layer

→ Serving (Marts/Feature Store/Vector DB/APIs)

↔ Observability (lineage, quality, cost, usage)

Orchestrate: DAGs with retries/backoff; idempotent tasks; data-aware scheduling.
Envs: dev/stage/prod with promotion; reproducible containers; infra as code.

Cost & FinOps

Visibility: per-table and per-query cost; tag by team/product/feature.
Controls: partition/prune; materialize only hot sets; auto-vacuum/compaction.
Budgets: alert on spend drift; kill switches for runaway jobs.

Data SRE & Reliability

SLOs: freshness and availability by dataset; error budgets for change velocity.
Runbooks: on-call dashboards, replay steps, backfill patterns.
Incidents: clear severities; blameless postmortems with action items.

30-60-90 Day Ramp

30d: catalog + lineage on top 10 datasets; raw→staging→curated pattern; basic quality checks.
60d: semantic layer for key metrics; vector store wired to doc pipeline; dataset SLOs + alerts.
90d: privacy policy-as-code (retention, masking); per-team cost dashboards; replay/backfill tooling.