Overview

Retrieval-Augmented Generation (RAG) combines a large language model with a live search step so the model can cite specific, up-to-date documents instead of relying solely on its trained weights.

How it works

  1. Ingestion — Source documents are chunked, embedded with a vector model, and stored in a vector database (Pinecone, pgvector, Weaviate, etc.).
  2. Retrieval — At query time the question is embedded and a nearest-neighbour search returns the most relevant chunks.
  3. Generation — The LLM receives the retrieved chunks as context and synthesises a grounded answer.

When to use RAG vs fine-tuning

Use RAG when your corpus changes frequently or is too large to bake into weights. Fine-tuning is better for style/tone adaptation or narrow domain tasks where latency is critical.

Intersysop approach

We implement hybrid sparse+dense retrieval (BM25 + embeddings), add re-ranking, and instrument every pipeline with evals so you can track retrieval quality over time.