Applied AI & ML in Production
Retrieval, agents, LLM fine-tuning, evals & guardrails that hold up in prod.

Retrieval (RAG)

Goal: minimize hallucinations, maximize answerability by grounding the model in your own data.

Core loop: Chunk → Embed → Index → Retrieve (hybrid BM25 + vector) → Re-rank → Generate.

  • Chunk by semantic boundaries (headings/bullets), ~400 tokens (200–800 is fine) with ~50-token overlap.
  • Dual retrieval: dense (vector) + sparse (BM25). Retrieve top-50, re-rank down to top-5 with a cross-encoder.
  • Normalize/clean docs; store source_id + timestamps for traceability and citations.
  • Precompute document embeddings; compute query embeddings on the fly.

Metrics: recall@k, MRR/NDCG, grounding rate, citation precision.

Gotchas: stale indexes, over-chunking, embedding/model mismatch, multilingual drift.

Agents

Shape: function calling → planner → executor → memory/store.

  • Use typed tool schemas, strict timeouts, and retry/backoff.
  • Constrain planning horizon (max steps) and add reflection only when needed.
  • Log every tool call + inputs/outputs; redact secrets at the edge.

Metrics: task success rate, steps-per-success, tool error rate, latency, cost/task.

Gotchas: wrappers that return untyped strings, infinite loops, prompt rot.

LLM Fine-Tuning

Use when: you need tone/style alignment, structured outputs, or domain jargon. Not for missing facts — use RAG for that.

  • Quality beats quantity: 1–5k strong pairs > 50k noisy.
  • Mix instructions, edge cases, and negative examples. Avoid data leakage.
  • Prefer adapters/LoRA before full fine-tunes. Keep tokenizer frozen.

Watch: train/val loss split, exact-match, schema-F1; ensure base-model reasoning doesn’t degrade.

Evals (offline & online)

  • Golden sets: hand-curated, versioned, cover key user journeys.
  • Rubrics: LLM grading + human spot checks; canary sets for regressions.
  • Online: shadow traffic → A/B with guardrails ON; rollback is first-class.

Metrics: EM/F1, citation accuracy, refusal correctness, jailbreak rate, latency, cost.

Guardrails (that actually hold)

  • Input: PII scrubbing, prompt-injection filters, MIME/type checks, rate limiting.
  • Generation: JSON Schema validation, regex/structured decoders, max-tokens & stop sequences.
  • Policy: safety classifiers (pre/post-gen), allow/deny lists per tool, context budget caps.
  • Observability: store sources, prompts, model/version, features used for each response.

Production Architecture (at a glance)

Gateway (auth, quotas)
→ Policy Filter
→ Orchestrator
Retrieval Service
vector + BM25 + re-ranker
Tool/Agent Runtime
typed tools, timeouts
LLM Clients
fallbacks, circuit breakers
Data Plane: embeddings store, doc pipeline, event/log bus
Control Plane: eval service, canary/rollback, model registry, prompt/version store, analytics

Sane Defaults

  • Hybrid search (BM25 + cosine); retrieve k=50 → re-rank to 5.
  • Chunk ~400 tokens, 50-token overlap.
  • Cross-encoder re-ranker (e.g., MiniLM) for precision.
  • Enforce JSON output with schema validation + auto-repair pass.
  • Latency budget: P50 < 1.5s, P95 < 4s for RAG answers.
  • Rollout: 10% canary with guardrails ON; compare to holdout, then ramp.

Quick Roadmap

  1. Define success: 20–50 golden questions/user journeys.
  2. Build doc pipeline + hybrid index; wire re-ranker.
  3. Ship one agent with 2–3 high-value tools (search, DB, calendar/etc).
  4. Add offline evals → shadow → A/B.
  5. Consider LoRA fine-tune only after gaps persist beyond prompt/RAG fixes.
  6. Add observability & red-team canaries; automate regression checks in CI.

Common Traps (and exits)

  • “Let’s fine-tune first.” → Do RAG/prompting + evals first.
  • No source links. → Mandatory citations with doc IDs.
  • Unbounded agents. → Cap steps; require tool schemas; kill switches.
  • Silent failures. → Circuit breakers + fallbacks; graceful user errors.