What is RAG? — The Agentic AI Academy

Retrieval-Augmented Generation (RAG) is one of the most important patterns in modern AI engineering. It solves one of the biggest limitations of LLMs — their inability to know about information that wasn't in their training data or that has changed since training.

The Problem RAG Solves

LLMs are trained on data up to a cutoff date. They don't know about:

Your company's internal documents and policies
Events that happened after their training cutoff
Your personal notes, emails, or files
Real-time data like stock prices or weather
Confidential information that wasn't published publicly

Without RAG, asking an LLM about these topics results in hallucinations (confidently wrong answers) or refusals.

The RAG idea: Instead of relying on the model's internal "memory," we retrieve relevant information from an external knowledge base and provide it as context in the prompt. The model then uses this retrieved information to generate an accurate, grounded answer.

The RAG Pipeline

RAG has two phases: an indexing phase (done once, offline) and a retrieval phase (done at query time, online).

Phase 1: Indexing (Offline)

📄 Documents

→

✂️ Chunking

→

🔢 Embedding

→

🗃️ Vector Store

Load documents: PDFs, Word docs, web pages, databases, code files
Chunk: Split documents into smaller pieces (500–1000 tokens each)
Embed: Convert each chunk into a vector (list of numbers) using an embedding model
Store: Save vectors + original text in a vector database (Pinecone, Chroma, Weaviate)

Phase 2: Retrieval & Generation (Online, per query)

❓ User Query

→

🔢 Embed Query

→

🔍 Similarity Search

→

📋 Top-K Chunks

→

🤖 LLM + Context

→

💬 Answer

Embed query: Convert the user's question into a vector using the same embedding model
Search: Find the most similar chunks in the vector database (cosine similarity)
Retrieve: Get the top-K most relevant chunks (typically 3–10)
Prompt: Add retrieved chunks as context to the LLM prompt
Generate: The LLM answers based on retrieved context + its training

What Are Embeddings?

An embedding is a mathematical representation of text as a vector of numbers. Similar pieces of text produce similar vectors. This allows us to find relevant documents by measuring mathematical distance — no keyword matching needed.

For example, "How do I reset my password?" and "I forgot my login credentials" would have very similar embeddings even though they share no words. A keyword search would miss this connection; vector search finds it.

What is a Vector Database?

A vector database is purpose-built for storing and searching high-dimensional vectors. It supports Approximate Nearest Neighbor (ANN) search — finding the closest vectors to a query vector, extremely fast, even with millions of entries.

Pinecone

Fully managed cloud service. Production-ready with excellent scalability.

Chroma

Open-source, easy to embed locally. Great for development and small projects.

Weaviate

Open-source with rich metadata filtering and hybrid search.

pgvector

PostgreSQL extension. Best if you already use Postgres and want to stay simple.

RAG vs Fine-Tuning: When to Use Which?

Use RAG when:

• Knowledge changes frequently
• You need source attribution
• Data is too large for context
• You need to update knowledge without retraining
• Cost and time are constraints

Use Fine-Tuning when:

• Changing behavior/style/tone
• Knowledge is static and specialized
• You need consistent domain language
• You want to reduce prompt length
• Latency is critical

Real-World RAG Applications

Customer support bots that answer from your documentation, not hallucinated information
Legal research assistants that cite specific case law and statutes
Internal knowledge bases that let employees query company policies and procedures
Medical information systems grounded in clinical guidelines and research papers
Code assistants that understand your codebase's specific patterns and conventions
News and research tools that work with current information

Key Takeaways

RAG = retrieve relevant documents → add to prompt → LLM generates grounded answer
Solves LLM hallucination for domain-specific and up-to-date knowledge
Indexing: chunk documents → embed → store in vector DB
Retrieval: embed query → similarity search → fetch top-K chunks
Embeddings capture semantic meaning; vector DBs enable fast similarity search
Choose RAG for frequently-changing knowledge; fine-tuning for style/behavior changes

← Prompt Injection Next: Build a RAG Pipeline →

What is RAG?Retrieval-Augmented Generation Explained

The Problem RAG Solves

The RAG Pipeline

Phase 1: Indexing (Offline)

Phase 2: Retrieval & Generation (Online, per query)

What Are Embeddings?

What is a Vector Database?

Pinecone

Chroma

Weaviate

pgvector

RAG vs Fine-Tuning: When to Use Which?

Use RAG when:

Use Fine-Tuning when:

Real-World RAG Applications

Key Takeaways

What is RAG?
Retrieval-Augmented Generation Explained