Principles
- Answer questions, not wander: every prototype ties to a decision.
- Small bets, fast loops: hours → days, not weeks.
- Repro over heroics: one command to rerun; seeds logged.
- Show your work: notebooks, notes, and plots live with code & data refs.
Workflow at a Glance
Question → Hypothesis → Minimal Experiment → Measure → Decide/Next
Inputs
dataset slice, baseline, success metric
dataset slice, baseline, success metric
Experiment
1 variable at a time (ablation)
1 variable at a time (ablation)
Outputs
table, plot, conclusion, next step
table, plot, conclusion, next step
Experiments
- Design: pre-register the question, hypothesis, metric, stop rule.
- Baselines: always compare to a strong/simple baseline.
- Seeds: run 3–5 seeds for stochastic systems; report mean ± std and best.
- Artifacts: save configs, logs, and model/plot artifacts with run IDs.
Ablations (1-Var Changes)
- Change one factor per run: data size, feature set, architecture knob, prompt, learning rate, etc.
- Use grid for small spaces; Bayesian/ASHA for wider sweeps.
- Visualize with
del = metric - baselinewaterfall or spider plots.
Metrics & Rigor
- Choose once: primary metric per task (EM/F1/ROUGE/Acc/Latency/Cost).
- Confidence: bootstrap CIs or t-tests (paired when possible).
- Leakage checks: no overlap between train/dev/test (hash at ingest).
- Power: ensure sample size supports detecting your target delta.
Notebooks & Data Hygiene
- Determinism: fixed seeds; record package versions; pin Docker image.
- Datasets: immutable snapshots; manifest with checksums and license.
- Notebooks: keep idempotent; top cell sets env/paths; export to HTML/PDF for sharing.
Prototype → Product
- Promote only when the experiment improves the KPI on a representative eval set.
- Harden the path: configs → module → tests → CI runner → behind a flag.
- Instrument early: logs, metrics, traces; canary while watching error budgets.
Reporting That Decides
- One-page memo: question, approach, results table, plots, risks, decision/next.
- Always attach repro command and link to artifacts (runs, configs, data snapshot).
- Traffic-light summary: ✅ ship, 🟨 needs more data, 🔴 stop.
Repo Layout
research/
data/ # manifests, snapshots (read-only)
notebooks/ # EDA, reports (HTML exports in /reports)
experiments/ # configs, sweeps, ablations
src/ # reusable modules
runs/ # logs, metrics, artifacts
reports/ # one-pagers, charts
Makefile # make seed=1 exp=abl_lr run
Tooling Stack (suggested)
- Tracking: Weights & Biases / MLflow (runs, artifacts, params).
- Compute: dockerized; GPU when needed; job sweeps via Ray/SLURM.
- Data: DVC/LakeFS; parquet; dataset cards with licenses.
- Viz: Matplotlib/Altair/Plotly; seaborn for quick EDA.
Templates (copy/paste)
# Hypothesis
We believe that <change> will improve <metric> on <task> by >= <delta>.
# Repro
make exp=<name> seed=1..5 run
# Stop Rule
Stop after <n> runs or CI <threshold> crossed.
# Report Snippet (Markdown table)
| Run | Setting | Metric | Δ vs Base | Latency | Cost |
|-----|---------|--------|-----------|---------|------|
Field Checklist
- ❶ Clear question & success metric
- ❷ Baseline established
- ❸ Single-var ablation plan
- ❹ Seeds & CI logged
- ❺ CIs / stats reported
- ❻ One-page decision memo
30-60-90 Day Ramp
- 30d: repo scaffold, dataset snapshot, baseline runs, reporting template.
- 60d: automated sweeps + tracking, ablation library, decision memos adopted.
- 90d: CI for experiments, nightly evals, promotion path to product behind flags.