๐Ÿ”ด Advanced

Production AI Systems in 2026:
The Full-Stack AI Engineer's Toolkit

๐Ÿ› ๏ธ AI Engineering Toolsโฑ 20 min read๐Ÿ—“ May 2026

Building production AI systems is now a distinct engineering discipline. This article covers the full toolkit that AI engineers use in 2026 โ€” from LLM gateways and observability to CI/CD for AI and model evaluation frameworks.

1. LLM Gateway: LiteLLM

LiteLLM provides a unified OpenAI-compatible API across 100+ LLM providers. It enables fallbacks, load balancing, cost tracking, and provider switching without changing application code.

# Deploy LiteLLM as a gateway server
# docker-compose.yml
version: '3'
services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports: ["4000:4000"]
    environment:
      OPENAI_API_KEY: ${OPENAI_API_KEY}
      ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
      GEMINI_API_KEY: ${GEMINI_API_KEY}
    command: --config /app/config.yaml

# litellm config.yaml
model_list:
  - model_name: "gpt-4o"
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

  - model_name: "claude-opus"
    litellm_params:
      model: anthropic/claude-opus-4-6
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: "smart-fallback"
    litellm_params:
      model: claude-opus-4-6
      fallbacks: ["gpt-4o", "gemini-pro"]  # Try in order on failure

router_settings:
  routing_strategy: "least-busy"  # or "latency-based", "cost-based"
  enable_pre_call_checks: true
# Your application code โ€” same API regardless of provider
from openai import OpenAI

# Point to LiteLLM instead of OpenAI
client = OpenAI(
    base_url="http://localhost:4000",
    api_key="your-litellm-master-key"
)

# Seamlessly switch between models
response = client.chat.completions.create(
    model="smart-fallback",  # LiteLLM handles provider selection
    messages=[{"role": "user", "content": "Hello!"}]
)

# Track costs (LiteLLM logs to SQLite/Redis automatically)
# curl http://localhost:4000/spend returns per-user, per-model costs

2. Observability: Langfuse

Langfuse is an open-source LLM observability platform. It traces every LLM call, tool use, and chain execution โ€” giving you full visibility into production AI behavior.

# pip install langfuse
from langfuse.decorators import observe, langfuse_context
from langfuse import Langfuse

langfuse = Langfuse(
    public_key="pk-lf-...",
    secret_key="sk-lf-...",
    host="https://cloud.langfuse.com"  # or self-hosted
)

@observe(name="rag-pipeline")  # Trace this function automatically
def answer_question(user_id: str, question: str) -> str:
    # Add custom context to this trace
    langfuse_context.update_current_trace(
        user_id=user_id,
        tags=["rag", "production"],
        metadata={"question_length": len(question)}
    )

    # Retrieval step โ€” automatic child span
    docs = retrieve_documents(question)

    # Generation step โ€” automatic child span
    answer = generate_answer(question, docs)

    # Score the quality (can be done async by evaluators)
    langfuse_context.score_current_trace(
        name="answer_quality",
        value=0.9,  # From automated evaluation
        comment="High relevance, well-cited"
    )

    return answer

# Run evaluations on production data
def run_evals():
    # Fetch recent traces
    traces = langfuse.fetch_traces(limit=100, tags=["rag"]).data

    for trace in traces:
        # Score each trace for faithfulness using an LLM judge
        score = llm_judge_faithfulness(
            question=trace.input,
            answer=trace.output,
            context=trace.metadata.get("retrieved_docs")
        )
        langfuse.score(
            trace_id=trace.id,
            name="faithfulness",
            value=score
        )

3. Evaluation Framework: RAGAS + Braintrust

# RAGAS for RAG-specific evaluation
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
from datasets import Dataset

# Your test dataset
test_data = {
    "question": ["What's the refund policy?", "How to cancel?"],
    "answer": [rag_system.answer(q) for q in questions],
    "contexts": [rag_system.retrieve(q) for q in questions],
    "ground_truth": ["30-day full refund...", "Cancel in settings..."]
}

results = evaluate(
    Dataset.from_dict(test_data),
    metrics=[faithfulness, answer_relevancy, context_recall]
)
print(results.to_pandas())
# Braintrust for A/B testing prompts
import braintrust

@braintrust.traced
def generate_with_prompt_v1(question: str) -> str:
    return llm.invoke(PROMPT_V1.format(question=question)).content

@braintrust.traced
def generate_with_prompt_v2(question: str) -> str:
    return llm.invoke(PROMPT_V2.format(question=question)).content

# Run experiment
experiment = braintrust.init(project="rag-system", experiment="prompt-v1-vs-v2")
for question, ground_truth in test_cases:
    with experiment.start_span():
        v1_output = generate_with_prompt_v1(question)
        v2_output = generate_with_prompt_v2(question)

        # Compare outputs automatically
        experiment.log(
            input=question,
            output=v1_output,
            expected=ground_truth,
            scores={"accuracy": score_answer(v1_output, ground_truth)}
        )

4. CI/CD for AI Systems

# .github/workflows/ai-eval.yml
name: AI System Evaluation

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'rag/**'
      - 'agents/**'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run evaluation suite
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          python -m pytest tests/evals/ -v \
            --benchmark-enabled \
            --benchmark-compare=baseline.json

      - name: Check regression threshold
        run: |
          python scripts/check_regression.py \
            --metric faithfulness \
            --threshold 0.85 \
            --fail-below

      - name: Run safety checks
        run: |
          python scripts/safety_eval.py \
            --prompt-dir prompts/ \
            --adversarial-dataset tests/adversarial_prompts.jsonl

      - name: Cost estimation
        run: |
          python scripts/estimate_costs.py \
            --usage-profile tests/usage_profile.json \
            --alert-if-above 1000  # Alert if estimated monthly cost > $1000

5. Cost Optimization Strategies

# Strategy 1: Prompt caching (Anthropic)
import anthropic

client = anthropic.Anthropic()

# Cache the large system prompt (1000+ tokens) โ†’ 90% cost reduction on repeated calls
response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LARGE_SYSTEM_PROMPT,  # 2000+ token system prompt
            "cache_control": {"type": "ephemeral"}  # Cache for 5 minutes
        }
    ],
    messages=[{"role": "user", "content": user_message}]
)
# Cache hit: 10ร— cheaper + 3ร— faster

# Strategy 2: Model routing by task complexity
def route_to_model(task_type: str, complexity: str) -> str:
    routing = {
        ("classification", "simple"): "claude-haiku-4-5-20251001",    # $0.00025/1K in
        ("extraction", "moderate"): "claude-sonnet-4-6",   # $0.003/1K in
        ("reasoning", "complex"): "claude-opus-4-6",       # $0.015/1K in
    }
    return routing.get((task_type, complexity), "claude-sonnet-4-6")

# Strategy 3: Batch API for async workloads (50% discount)
# Use for: nightly report generation, embedding updates, bulk analysis
batch = client.messages.batches.create(
    requests=[
        {"custom_id": f"req-{i}", "params": {
            "model": "claude-haiku-4-5-20251001",
            "max_tokens": 256,
            "messages": [{"role": "user", "content": text}]
        }}
        for i, text in enumerate(large_dataset)
    ]
)
# Results available within 24 hours, at 50% of regular price

6. The AI Engineer's 2026 Stack

The AI engineering mindset: LLMs are probabilistic components in deterministic systems. Every non-deterministic component needs: input validation, output validation, retry logic, fallbacks, observability, and regression testing. Build them like you'd build any distributed system โ€” except the failure mode is "plausible wrong answer" rather than "error message."

Key Takeaways