Overview
Retrieval-Augmented Generation (RAG) combines a large language model with a live search step so the model can cite specific, up-to-date documents instead of relying solely on its trained weights.
How it works
- Ingestion — Source documents are chunked, embedded with a vector model, and stored in a vector database (Pinecone, pgvector, Weaviate, etc.).
- Retrieval — At query time the question is embedded and a nearest-neighbour search returns the most relevant chunks.
- Generation — The LLM receives the retrieved chunks as context and synthesises a grounded answer.
When to use RAG vs fine-tuning
Use RAG when your corpus changes frequently or is too large to bake into weights. Fine-tuning is better for style/tone adaptation or narrow domain tasks where latency is critical.
Intersysop approach
We implement hybrid sparse+dense retrieval (BM25 + embeddings), add re-ranking, and instrument every pipeline with evals so you can track retrieval quality over time.