Retrieval-Augmented Generation (RAG) is one of the most important patterns in modern AI engineering. It solves one of the biggest limitations of LLMs โ their inability to know about information that wasn't in their training data or that has changed since training.
The Problem RAG Solves
LLMs are trained on data up to a cutoff date. They don't know about:
- Your company's internal documents and policies
- Events that happened after their training cutoff
- Your personal notes, emails, or files
- Real-time data like stock prices or weather
- Confidential information that wasn't published publicly
Without RAG, asking an LLM about these topics results in hallucinations (confidently wrong answers) or refusals.
The RAG Pipeline
RAG has two phases: an indexing phase (done once, offline) and a retrieval phase (done at query time, online).
Phase 1: Indexing (Offline)
- Load documents: PDFs, Word docs, web pages, databases, code files
- Chunk: Split documents into smaller pieces (500โ1000 tokens each)
- Embed: Convert each chunk into a vector (list of numbers) using an embedding model
- Store: Save vectors + original text in a vector database (Pinecone, Chroma, Weaviate)
Phase 2: Retrieval & Generation (Online, per query)
- Embed query: Convert the user's question into a vector using the same embedding model
- Search: Find the most similar chunks in the vector database (cosine similarity)
- Retrieve: Get the top-K most relevant chunks (typically 3โ10)
- Prompt: Add retrieved chunks as context to the LLM prompt
- Generate: The LLM answers based on retrieved context + its training
What Are Embeddings?
An embedding is a mathematical representation of text as a vector of numbers. Similar pieces of text produce similar vectors. This allows us to find relevant documents by measuring mathematical distance โ no keyword matching needed.
For example, "How do I reset my password?" and "I forgot my login credentials" would have very similar embeddings even though they share no words. A keyword search would miss this connection; vector search finds it.
What is a Vector Database?
A vector database is purpose-built for storing and searching high-dimensional vectors. It supports Approximate Nearest Neighbor (ANN) search โ finding the closest vectors to a query vector, extremely fast, even with millions of entries.
Pinecone
Fully managed cloud service. Production-ready with excellent scalability.
Chroma
Open-source, easy to embed locally. Great for development and small projects.
Weaviate
Open-source with rich metadata filtering and hybrid search.
pgvector
PostgreSQL extension. Best if you already use Postgres and want to stay simple.
RAG vs Fine-Tuning: When to Use Which?
Use RAG when:
โข Knowledge changes frequently
โข You need source attribution
โข Data is too large for context
โข You need to update knowledge without retraining
โข Cost and time are constraints
Use Fine-Tuning when:
โข Changing behavior/style/tone
โข Knowledge is static and specialized
โข You need consistent domain language
โข You want to reduce prompt length
โข Latency is critical
Real-World RAG Applications
- Customer support bots that answer from your documentation, not hallucinated information
- Legal research assistants that cite specific case law and statutes
- Internal knowledge bases that let employees query company policies and procedures
- Medical information systems grounded in clinical guidelines and research papers
- Code assistants that understand your codebase's specific patterns and conventions
- News and research tools that work with current information
Key Takeaways
- RAG = retrieve relevant documents โ add to prompt โ LLM generates grounded answer
- Solves LLM hallucination for domain-specific and up-to-date knowledge
- Indexing: chunk documents โ embed โ store in vector DB
- Retrieval: embed query โ similarity search โ fetch top-K chunks
- Embeddings capture semantic meaning; vector DBs enable fast similarity search
- Choose RAG for frequently-changing knowledge; fine-tuning for style/behavior changes