Advanced RAG Patterns — The Agentic AI Academy

Basic RAG — embed, search, generate — gets you 70% of the way. The remaining 30% of quality comes from advanced patterns that address specific failure modes: poor query-document alignment, multi-hop reasoning, information freshness, and answer grounding. This article covers the patterns that production RAG systems rely on.

Query-Side Patterns

1. HyDE — Hypothetical Document Embeddings

Problem: User queries are often short and vague; documents are detailed and specific. The semantic gap causes retrieval misses.

Instead of embedding the query directly, ask the LLM to generate a hypothetical document that would answer the query — then embed and search with that. The hypothetical document uses vocabulary and phrasing that better matches real documents.

from langchain_core.prompts import ChatPromptTemplate
from langchain_anthropic import ChatAnthropic

def hyde_retrieve(query: str, retriever, llm) -> list:
    # Step 1: Generate a hypothetical document
    hyde_prompt = ChatPromptTemplate.from_template(
        """Write a 2-paragraph technical document that directly answers:
        '{query}'
        Be specific and use domain-appropriate terminology."""
    )
    hypothetical_doc = llm.invoke(
        hyde_prompt.format_messages(query=query)
    ).content

    # Step 2: Retrieve using the hypothetical doc
    results = retriever.invoke(hypothetical_doc)

    # Optionally also retrieve using the original query and merge
    original_results = retriever.invoke(query)
    all_results = results + [r for r in original_results if r not in results]
    return all_results[:4]  # Top 4 deduplicated

When to use: Short/vague queries, technical domains, when queries use different vocabulary than documents.

2. Query Decomposition (Multi-Hop RAG)

Problem: Complex queries require information from multiple documents that can't be retrieved in a single search.

Decompose the query into sub-questions, retrieve for each, then synthesize the final answer.

def multi_hop_rag(complex_query: str, retriever, llm) -> str:
    # Step 1: Decompose into sub-questions
    decompose_prompt = f"""Break this question into 2-4 simpler sub-questions
    that can each be answered independently:
    Question: {complex_query}
    Sub-questions (one per line):"""

    sub_questions = llm.invoke(decompose_prompt).content.strip().split("\n")

    # Step 2: Retrieve and answer each sub-question
    sub_answers = []
    for sq in sub_questions:
        docs = retriever.invoke(sq)
        context = "\n\n".join([d.page_content for d in docs])
        answer = llm.invoke(f"Context: {context}\n\nQuestion: {sq}\nAnswer:").content
        sub_answers.append(f"Q: {sq}\nA: {answer}")

    # Step 3: Synthesize final answer
    synthesis = llm.invoke(f"""
    Based on these research findings, answer the original question:

    Research findings:
    {chr(10).join(sub_answers)}

    Original question: {complex_query}
    Final answer:""").content

    return synthesis

3. Step-Back Prompting

Problem: Specific queries fail because the relevant document discusses the general principle, not the specific case.

Ask a "step-back" question — a more general version of the specific query — retrieve for both, and combine contexts.

def stepback_retrieve(specific_query: str, retriever, llm) -> list:
    # Generate a more abstract version of the query
    stepback_prompt = f"""Given the specific question: '{specific_query}'
    What is the underlying general principle or concept being asked about?
    Write a more general question:"""

    general_query = llm.invoke(stepback_prompt).content

    # Retrieve for both
    specific_docs = retriever.invoke(specific_query)
    general_docs = retriever.invoke(general_query)

    # Combine and deduplicate
    all_docs = specific_docs + [d for d in general_docs if d not in specific_docs]
    return all_docs[:6]

Retrieval-Side Patterns

4. Hybrid Search (Keyword + Semantic)

Problem: Pure vector search misses exact keyword matches; pure BM25 misses semantic matches. You need both.

from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever

# BM25 retriever (keyword-based, no embeddings needed)
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 4

# Vector retriever (semantic)
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# Ensemble with weighted combination
hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.4, 0.6]  # Weight semantic slightly higher
)

results = hybrid_retriever.invoke("GDPR Article 17 right to erasure")

When to use: Always — hybrid search consistently outperforms either approach alone. Especially critical for queries with specific names, codes, or technical terms.

5. Re-Ranking with Cross-Encoders

Problem: First-stage retrieval (bi-encoder) trades accuracy for speed. Re-ranking adds a slower, more accurate second pass.

from sentence_transformers import CrossEncoder

# Cross-encoder re-ranker — evaluates query-document pairs jointly
# (much slower but much more accurate than bi-encoder similarity)
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def retrieve_and_rerank(query: str, retriever, top_k: int = 4) -> list:
    # First pass: retrieve many candidates (25 is a common choice)
    candidates = retriever.invoke(query)

    # Second pass: re-rank with cross-encoder
    pairs = [(query, doc.page_content) for doc in candidates]
    scores = reranker.predict(pairs)

    # Sort by re-rank score and return top_k
    ranked = sorted(zip(scores, candidates), reverse=True)
    return [doc for _, doc in ranked[:top_k]]

6. Parent-Child Chunking (Small-to-Big Retrieval)

Problem: Small chunks retrieve precisely but lack context; large chunks have context but match poorly.

Index small chunks for precise retrieval, but return the larger parent chunk to the LLM for full context.

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

# Child chunks (small, for precise retrieval)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=300)

# Parent chunks (large, for full context)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)

doc_store = InMemoryStore()

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=doc_store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

retriever.add_documents(docs)
# Now: retrieves small chunks → returns large parents to LLM

Generation-Side Patterns

7. Self-RAG

Problem: The model blindly uses retrieved docs even when they're irrelevant — or generates without retrieval when it should retrieve.

Self-RAG trains (or prompts) the model to decide: should I retrieve? → retrieve → is this relevant? → use it → is my answer grounded? The model self-critiques at each step.

def self_rag(query: str, retriever, llm) -> str:
    # Step 1: Decide if retrieval is needed
    need_retrieval = llm.invoke(
        f"Does answering '{query}' require external documents? (yes/no)"
    ).content.strip().lower()

    if need_retrieval == "no":
        return llm.invoke(query).content

    # Step 2: Retrieve
    docs = retriever.invoke(query)

    # Step 3: Grade each retrieved doc for relevance
    relevant_docs = []
    for doc in docs:
        relevance = llm.invoke(
            f"Is this document relevant to '{query}'? (yes/no)\n\n{doc.page_content[:500]}"
        ).content.strip().lower()
        if relevance == "yes":
            relevant_docs.append(doc)

    if not relevant_docs:
        return "I couldn't find relevant information to answer this question."

    # Step 4: Generate answer with relevant docs
    context = "\n\n".join([d.page_content for d in relevant_docs])
    answer = llm.invoke(f"Context: {context}\n\nQuestion: {query}\nAnswer:").content

    # Step 5: Check for hallucination (grounding check)
    is_grounded = llm.invoke(
        f"Is this answer fully supported by the context? (yes/no)\n\nContext: {context}\n\nAnswer: {answer}"
    ).content.strip().lower()

    if is_grounded == "no":
        answer = llm.invoke(
            f"Revise this answer to only include claims supported by:\n{context}\n\nOriginal answer: {answer}"
        ).content

    return answer

8. GraphRAG (Microsoft)

Problem: Vector search finds locally relevant chunks but misses global relationships and themes across the knowledge base.

GraphRAG builds a knowledge graph from documents (entities, relationships, communities) and uses graph traversal + community summarization for queries about themes, relationships, and global patterns.

# Install: pip install graphrag
# CLI approach:
# graphrag index --root ./my_project
# graphrag query --root ./my_project --method global "What are the main themes?"

# Programmatic approach (simplified)
import networkx as nx

def build_knowledge_graph(docs, llm):
    G = nx.Graph()

    for doc in docs:
        # Extract entities and relationships with LLM
        extraction_prompt = f"""Extract entities and relationships from this text.
        Return as JSON: {{"entities": [...], "relationships": [{{from, to, type}}]}}
        Text: {doc.page_content}"""

        data = json.loads(llm.invoke(extraction_prompt).content)

        for entity in data["entities"]:
            G.add_node(entity["name"], type=entity["type"])

        for rel in data["relationships"]:
            G.add_edge(rel["from"], rel["to"], type=rel["type"])

    return G

When to use: Large knowledge bases where you need to answer questions about themes, trends, or cross-document relationships. Not suitable for simple factual lookups.

All 15 Advanced RAG Patterns at a Glance

1. HyDE

Generate hypothetical answer to improve query-doc alignment

2. Query Decomposition

Split complex queries into sub-questions

3. Step-Back

Generalize query to retrieve foundational context

4. Hybrid Search

Combine keyword BM25 + semantic vector search

5. Re-Ranking

Cross-encoder second pass for accuracy

6. Parent-Child

Retrieve small, return large for context

7. Self-RAG

Model self-critiques retrieval and generation

8. GraphRAG

Knowledge graph for global reasoning

9. RAPTOR

Hierarchical clustering + summarization for multi-scale retrieval

10. Adaptive RAG

Route queries to different strategies based on type

11. Corrective RAG

Detect poor retrieval, web-search as fallback

12. Fusion RAG

Generate multiple queries, retrieve all, merge with RRF

13. FLARE

Active retrieval — retrieve mid-generation when uncertain

14. Agentic RAG

LLM agent decides when/what to retrieve iteratively

15. Contextual Compression

LLM extracts only relevant sentences from retrieved chunks

Production guidance: Start with Hybrid Search + Re-Ranking — these give the biggest quality lift for the least complexity. Add HyDE and Query Decomposition next. GraphRAG and Self-RAG are for when you've maxed out simpler approaches and need the extra 5-10%.

Key Takeaways

HyDE bridges the query-document vocabulary gap with hypothetical generation
Query decomposition enables multi-hop reasoning across documents
Hybrid search (BM25 + vectors) consistently beats either alone
Re-ranking with cross-encoders adds significant precision at reasonable cost
Parent-child chunking balances retrieval precision with answer context
Self-RAG adds decision-making and grounding checks at generation time
GraphRAG excels for global/thematic queries; overkill for simple lookups

← Building RAG Pipeline Next: Intro to AI Agents →

Advanced RAG Patterns:HyDE, GraphRAG, Self-RAG & Beyond

Query-Side Patterns

1. HyDE — Hypothetical Document Embeddings

2. Query Decomposition (Multi-Hop RAG)

3. Step-Back Prompting

Retrieval-Side Patterns

4. Hybrid Search (Keyword + Semantic)

5. Re-Ranking with Cross-Encoders

6. Parent-Child Chunking (Small-to-Big Retrieval)

Generation-Side Patterns

7. Self-RAG

8. GraphRAG (Microsoft)

All 15 Advanced RAG Patterns at a Glance

1. HyDE

2. Query Decomposition

3. Step-Back

4. Hybrid Search

5. Re-Ranking

6. Parent-Child

7. Self-RAG

8. GraphRAG

9. RAPTOR

10. Adaptive RAG

11. Corrective RAG

12. Fusion RAG

13. FLARE

14. Agentic RAG

15. Contextual Compression

Key Takeaways

Advanced RAG Patterns:
HyDE, GraphRAG, Self-RAG & Beyond