๐ŸŸข Basic

What is RAG?
Retrieval-Augmented Generation Explained

๐Ÿ” RAGโฑ 9 min read๐Ÿ—“ May 2026

Retrieval-Augmented Generation (RAG) is one of the most important patterns in modern AI engineering. It solves one of the biggest limitations of LLMs โ€” their inability to know about information that wasn't in their training data or that has changed since training.

The Problem RAG Solves

LLMs are trained on data up to a cutoff date. They don't know about:

Without RAG, asking an LLM about these topics results in hallucinations (confidently wrong answers) or refusals.

The RAG idea: Instead of relying on the model's internal "memory," we retrieve relevant information from an external knowledge base and provide it as context in the prompt. The model then uses this retrieved information to generate an accurate, grounded answer.

The RAG Pipeline

RAG has two phases: an indexing phase (done once, offline) and a retrieval phase (done at query time, online).

Phase 1: Indexing (Offline)

๐Ÿ“„ Documents
โ†’
โœ‚๏ธ Chunking
โ†’
๐Ÿ”ข Embedding
โ†’
๐Ÿ—ƒ๏ธ Vector Store
  1. Load documents: PDFs, Word docs, web pages, databases, code files
  2. Chunk: Split documents into smaller pieces (500โ€“1000 tokens each)
  3. Embed: Convert each chunk into a vector (list of numbers) using an embedding model
  4. Store: Save vectors + original text in a vector database (Pinecone, Chroma, Weaviate)

Phase 2: Retrieval & Generation (Online, per query)

โ“ User Query
โ†’
๐Ÿ”ข Embed Query
โ†’
๐Ÿ” Similarity Search
โ†’
๐Ÿ“‹ Top-K Chunks
โ†’
๐Ÿค– LLM + Context
โ†’
๐Ÿ’ฌ Answer
  1. Embed query: Convert the user's question into a vector using the same embedding model
  2. Search: Find the most similar chunks in the vector database (cosine similarity)
  3. Retrieve: Get the top-K most relevant chunks (typically 3โ€“10)
  4. Prompt: Add retrieved chunks as context to the LLM prompt
  5. Generate: The LLM answers based on retrieved context + its training

What Are Embeddings?

An embedding is a mathematical representation of text as a vector of numbers. Similar pieces of text produce similar vectors. This allows us to find relevant documents by measuring mathematical distance โ€” no keyword matching needed.

For example, "How do I reset my password?" and "I forgot my login credentials" would have very similar embeddings even though they share no words. A keyword search would miss this connection; vector search finds it.

What is a Vector Database?

A vector database is purpose-built for storing and searching high-dimensional vectors. It supports Approximate Nearest Neighbor (ANN) search โ€” finding the closest vectors to a query vector, extremely fast, even with millions of entries.

Pinecone

Fully managed cloud service. Production-ready with excellent scalability.

Chroma

Open-source, easy to embed locally. Great for development and small projects.

Weaviate

Open-source with rich metadata filtering and hybrid search.

pgvector

PostgreSQL extension. Best if you already use Postgres and want to stay simple.

RAG vs Fine-Tuning: When to Use Which?

Use RAG when:

โ€ข Knowledge changes frequently
โ€ข You need source attribution
โ€ข Data is too large for context
โ€ข You need to update knowledge without retraining
โ€ข Cost and time are constraints

Use Fine-Tuning when:

โ€ข Changing behavior/style/tone
โ€ข Knowledge is static and specialized
โ€ข You need consistent domain language
โ€ข You want to reduce prompt length
โ€ข Latency is critical

Real-World RAG Applications

Key Takeaways