Retrieval (RAG)
Goal: minimize hallucinations, maximize answerability by grounding the model in your own data.
Core loop: Chunk → Embed → Index → Retrieve (hybrid BM25 + vector) → Re-rank → Generate.
- Chunk by semantic boundaries (headings/bullets), ~400 tokens (200–800 is fine) with ~50-token overlap.
- Dual retrieval: dense (vector) + sparse (BM25). Retrieve top-50, re-rank down to top-5 with a cross-encoder.
- Normalize/clean docs; store
source_id+ timestamps for traceability and citations. - Precompute document embeddings; compute query embeddings on the fly.
Metrics: recall@k, MRR/NDCG, grounding rate, citation precision.
Gotchas: stale indexes, over-chunking, embedding/model mismatch, multilingual drift.
Agents
Shape: function calling → planner → executor → memory/store.
- Use typed tool schemas, strict timeouts, and retry/backoff.
- Constrain planning horizon (max steps) and add reflection only when needed.
- Log every tool call + inputs/outputs; redact secrets at the edge.
Metrics: task success rate, steps-per-success, tool error rate, latency, cost/task.
Gotchas: wrappers that return untyped strings, infinite loops, prompt rot.
LLM Fine-Tuning
Use when: you need tone/style alignment, structured outputs, or domain jargon. Not for missing facts — use RAG for that.
- Quality beats quantity: 1–5k strong pairs > 50k noisy.
- Mix instructions, edge cases, and negative examples. Avoid data leakage.
- Prefer adapters/LoRA before full fine-tunes. Keep tokenizer frozen.
Watch: train/val loss split, exact-match, schema-F1; ensure base-model reasoning doesn’t degrade.
Evals (offline & online)
- Golden sets: hand-curated, versioned, cover key user journeys.
- Rubrics: LLM grading + human spot checks; canary sets for regressions.
- Online: shadow traffic → A/B with guardrails ON; rollback is first-class.
Metrics: EM/F1, citation accuracy, refusal correctness, jailbreak rate, latency, cost.
Guardrails (that actually hold)
- Input: PII scrubbing, prompt-injection filters, MIME/type checks, rate limiting.
- Generation: JSON Schema validation, regex/structured decoders, max-tokens & stop sequences.
- Policy: safety classifiers (pre/post-gen), allow/deny lists per tool, context budget caps.
- Observability: store sources, prompts, model/version, features used for each response.
Production Architecture (at a glance)
vector + BM25 + re-ranker
typed tools, timeouts
fallbacks, circuit breakers
Sane Defaults
- Hybrid search (BM25 + cosine); retrieve
k=50→ re-rank to5. - Chunk ~
400tokens, 50-token overlap. - Cross-encoder re-ranker (e.g., MiniLM) for precision.
- Enforce JSON output with schema validation + auto-repair pass.
- Latency budget: P50 < 1.5s, P95 < 4s for RAG answers.
- Rollout: 10% canary with guardrails ON; compare to holdout, then ramp.
Quick Roadmap
- Define success: 20–50 golden questions/user journeys.
- Build doc pipeline + hybrid index; wire re-ranker.
- Ship one agent with 2–3 high-value tools (search, DB, calendar/etc).
- Add offline evals → shadow → A/B.
- Consider LoRA fine-tune only after gaps persist beyond prompt/RAG fixes.
- Add observability & red-team canaries; automate regression checks in CI.
Common Traps (and exits)
- “Let’s fine-tune first.” → Do RAG/prompting + evals first.
- No source links. → Mandatory citations with doc IDs.
- Unbounded agents. → Cap steps; require tool schemas; kill switches.
- Silent failures. → Circuit breakers + fallbacks; graceful user errors.