Basic RAG โ embed, search, generate โ gets you 70% of the way. The remaining 30% of quality comes from advanced patterns that address specific failure modes: poor query-document alignment, multi-hop reasoning, information freshness, and answer grounding. This article covers the patterns that production RAG systems rely on.
Query-Side Patterns
1. HyDE โ Hypothetical Document Embeddings
Instead of embedding the query directly, ask the LLM to generate a hypothetical document that would answer the query โ then embed and search with that. The hypothetical document uses vocabulary and phrasing that better matches real documents.
from langchain_core.prompts import ChatPromptTemplate
from langchain_anthropic import ChatAnthropic
def hyde_retrieve(query: str, retriever, llm) -> list:
# Step 1: Generate a hypothetical document
hyde_prompt = ChatPromptTemplate.from_template(
"""Write a 2-paragraph technical document that directly answers:
'{query}'
Be specific and use domain-appropriate terminology."""
)
hypothetical_doc = llm.invoke(
hyde_prompt.format_messages(query=query)
).content
# Step 2: Retrieve using the hypothetical doc
results = retriever.invoke(hypothetical_doc)
# Optionally also retrieve using the original query and merge
original_results = retriever.invoke(query)
all_results = results + [r for r in original_results if r not in results]
return all_results[:4] # Top 4 deduplicated
When to use: Short/vague queries, technical domains, when queries use different vocabulary than documents.
2. Query Decomposition (Multi-Hop RAG)
Decompose the query into sub-questions, retrieve for each, then synthesize the final answer.
def multi_hop_rag(complex_query: str, retriever, llm) -> str:
# Step 1: Decompose into sub-questions
decompose_prompt = f"""Break this question into 2-4 simpler sub-questions
that can each be answered independently:
Question: {complex_query}
Sub-questions (one per line):"""
sub_questions = llm.invoke(decompose_prompt).content.strip().split("\n")
# Step 2: Retrieve and answer each sub-question
sub_answers = []
for sq in sub_questions:
docs = retriever.invoke(sq)
context = "\n\n".join([d.page_content for d in docs])
answer = llm.invoke(f"Context: {context}\n\nQuestion: {sq}\nAnswer:").content
sub_answers.append(f"Q: {sq}\nA: {answer}")
# Step 3: Synthesize final answer
synthesis = llm.invoke(f"""
Based on these research findings, answer the original question:
Research findings:
{chr(10).join(sub_answers)}
Original question: {complex_query}
Final answer:""").content
return synthesis
3. Step-Back Prompting
Ask a "step-back" question โ a more general version of the specific query โ retrieve for both, and combine contexts.
def stepback_retrieve(specific_query: str, retriever, llm) -> list:
# Generate a more abstract version of the query
stepback_prompt = f"""Given the specific question: '{specific_query}'
What is the underlying general principle or concept being asked about?
Write a more general question:"""
general_query = llm.invoke(stepback_prompt).content
# Retrieve for both
specific_docs = retriever.invoke(specific_query)
general_docs = retriever.invoke(general_query)
# Combine and deduplicate
all_docs = specific_docs + [d for d in general_docs if d not in specific_docs]
return all_docs[:6]
Retrieval-Side Patterns
4. Hybrid Search (Keyword + Semantic)
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
# BM25 retriever (keyword-based, no embeddings needed)
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 4
# Vector retriever (semantic)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
# Ensemble with weighted combination
hybrid_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.4, 0.6] # Weight semantic slightly higher
)
results = hybrid_retriever.invoke("GDPR Article 17 right to erasure")
When to use: Always โ hybrid search consistently outperforms either approach alone. Especially critical for queries with specific names, codes, or technical terms.
5. Re-Ranking with Cross-Encoders
from sentence_transformers import CrossEncoder
# Cross-encoder re-ranker โ evaluates query-document pairs jointly
# (much slower but much more accurate than bi-encoder similarity)
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def retrieve_and_rerank(query: str, retriever, top_k: int = 4) -> list:
# First pass: retrieve many candidates (25 is a common choice)
candidates = retriever.invoke(query)
# Second pass: re-rank with cross-encoder
pairs = [(query, doc.page_content) for doc in candidates]
scores = reranker.predict(pairs)
# Sort by re-rank score and return top_k
ranked = sorted(zip(scores, candidates), reverse=True)
return [doc for _, doc in ranked[:top_k]]
6. Parent-Child Chunking (Small-to-Big Retrieval)
Index small chunks for precise retrieval, but return the larger parent chunk to the LLM for full context.
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
# Child chunks (small, for precise retrieval)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=300)
# Parent chunks (large, for full context)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
doc_store = InMemoryStore()
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=doc_store,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
retriever.add_documents(docs)
# Now: retrieves small chunks โ returns large parents to LLM
Generation-Side Patterns
7. Self-RAG
Self-RAG trains (or prompts) the model to decide: should I retrieve? โ retrieve โ is this relevant? โ use it โ is my answer grounded? The model self-critiques at each step.
def self_rag(query: str, retriever, llm) -> str:
# Step 1: Decide if retrieval is needed
need_retrieval = llm.invoke(
f"Does answering '{query}' require external documents? (yes/no)"
).content.strip().lower()
if need_retrieval == "no":
return llm.invoke(query).content
# Step 2: Retrieve
docs = retriever.invoke(query)
# Step 3: Grade each retrieved doc for relevance
relevant_docs = []
for doc in docs:
relevance = llm.invoke(
f"Is this document relevant to '{query}'? (yes/no)\n\n{doc.page_content[:500]}"
).content.strip().lower()
if relevance == "yes":
relevant_docs.append(doc)
if not relevant_docs:
return "I couldn't find relevant information to answer this question."
# Step 4: Generate answer with relevant docs
context = "\n\n".join([d.page_content for d in relevant_docs])
answer = llm.invoke(f"Context: {context}\n\nQuestion: {query}\nAnswer:").content
# Step 5: Check for hallucination (grounding check)
is_grounded = llm.invoke(
f"Is this answer fully supported by the context? (yes/no)\n\nContext: {context}\n\nAnswer: {answer}"
).content.strip().lower()
if is_grounded == "no":
answer = llm.invoke(
f"Revise this answer to only include claims supported by:\n{context}\n\nOriginal answer: {answer}"
).content
return answer
8. GraphRAG (Microsoft)
GraphRAG builds a knowledge graph from documents (entities, relationships, communities) and uses graph traversal + community summarization for queries about themes, relationships, and global patterns.
# Install: pip install graphrag
# CLI approach:
# graphrag index --root ./my_project
# graphrag query --root ./my_project --method global "What are the main themes?"
# Programmatic approach (simplified)
import networkx as nx
def build_knowledge_graph(docs, llm):
G = nx.Graph()
for doc in docs:
# Extract entities and relationships with LLM
extraction_prompt = f"""Extract entities and relationships from this text.
Return as JSON: {{"entities": [...], "relationships": [{{from, to, type}}]}}
Text: {doc.page_content}"""
data = json.loads(llm.invoke(extraction_prompt).content)
for entity in data["entities"]:
G.add_node(entity["name"], type=entity["type"])
for rel in data["relationships"]:
G.add_edge(rel["from"], rel["to"], type=rel["type"])
return G
When to use: Large knowledge bases where you need to answer questions about themes, trends, or cross-document relationships. Not suitable for simple factual lookups.
All 15 Advanced RAG Patterns at a Glance
1. HyDE
Generate hypothetical answer to improve query-doc alignment
2. Query Decomposition
Split complex queries into sub-questions
3. Step-Back
Generalize query to retrieve foundational context
4. Hybrid Search
Combine keyword BM25 + semantic vector search
5. Re-Ranking
Cross-encoder second pass for accuracy
6. Parent-Child
Retrieve small, return large for context
7. Self-RAG
Model self-critiques retrieval and generation
8. GraphRAG
Knowledge graph for global reasoning
9. RAPTOR
Hierarchical clustering + summarization for multi-scale retrieval
10. Adaptive RAG
Route queries to different strategies based on type
11. Corrective RAG
Detect poor retrieval, web-search as fallback
12. Fusion RAG
Generate multiple queries, retrieve all, merge with RRF
13. FLARE
Active retrieval โ retrieve mid-generation when uncertain
14. Agentic RAG
LLM agent decides when/what to retrieve iteratively
15. Contextual Compression
LLM extracts only relevant sentences from retrieved chunks
Key Takeaways
- HyDE bridges the query-document vocabulary gap with hypothetical generation
- Query decomposition enables multi-hop reasoning across documents
- Hybrid search (BM25 + vectors) consistently beats either alone
- Re-ranking with cross-encoders adds significant precision at reasonable cost
- Parent-child chunking balances retrieval precision with answer context
- Self-RAG adds decision-making and grounding checks at generation time
- GraphRAG excels for global/thematic queries; overkill for simple lookups