Applied AI & ML in Production

Retrieval (RAG)

Goal: minimize hallucinations, maximize answerability by grounding the model in your own data.

Core loop: Chunk → Embed → Index → Retrieve (hybrid BM25 + vector) → Re-rank → Generate.

Chunk by semantic boundaries (headings/bullets), ~400 tokens (200–800 is fine) with ~50-token overlap.
Dual retrieval: dense (vector) + sparse (BM25). Retrieve top-50, re-rank down to top-5 with a cross-encoder.
Normalize/clean docs; store source_id + timestamps for traceability and citations.
Precompute document embeddings; compute query embeddings on the fly.

Metrics: recall@k, MRR/NDCG, grounding rate, citation precision.

Gotchas: stale indexes, over-chunking, embedding/model mismatch, multilingual drift.

Shape: function calling → planner → executor → memory/store.

Metrics: task success rate, steps-per-success, tool error rate, latency, cost/task.

Gotchas: wrappers that return untyped strings, infinite loops, prompt rot.

Use when: you need tone/style alignment, structured outputs, or domain jargon. Not for missing facts — use RAG for that.

Watch: train/val loss split, exact-match, schema-F1; ensure base-model reasoning doesn’t degrade.

Metrics: EM/F1, citation accuracy, refusal correctness, jailbreak rate, latency, cost.

Input: PII scrubbing, prompt-injection filters, MIME/type checks, rate limiting.
Generation: JSON Schema validation, regex/structured decoders, max-tokens & stop sequences.
Policy: safety classifiers (pre/post-gen), allow/deny lists per tool, context budget caps.
Observability: store sources, prompts, model/version, features used for each response.

Gateway (auth, quotas)

→ Policy Filter

→ Orchestrator

Retrieval Service
vector + BM25 + re-ranker

Tool/Agent Runtime
typed tools, timeouts

LLM Clients
fallbacks, circuit breakers

Data Plane: embeddings store, doc pipeline, event/log bus

Control Plane: eval service, canary/rollback, model registry, prompt/version store, analytics