Building production AI systems is now a distinct engineering discipline. This article covers the full toolkit that AI engineers use in 2026 โ from LLM gateways and observability to CI/CD for AI and model evaluation frameworks.
1. LLM Gateway: LiteLLM
LiteLLM provides a unified OpenAI-compatible API across 100+ LLM providers. It enables fallbacks, load balancing, cost tracking, and provider switching without changing application code.
# Deploy LiteLLM as a gateway server
# docker-compose.yml
version: '3'
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
ports: ["4000:4000"]
environment:
OPENAI_API_KEY: ${OPENAI_API_KEY}
ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
GEMINI_API_KEY: ${GEMINI_API_KEY}
command: --config /app/config.yaml
# litellm config.yaml
model_list:
- model_name: "gpt-4o"
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
- model_name: "claude-opus"
litellm_params:
model: anthropic/claude-opus-4-6
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: "smart-fallback"
litellm_params:
model: claude-opus-4-6
fallbacks: ["gpt-4o", "gemini-pro"] # Try in order on failure
router_settings:
routing_strategy: "least-busy" # or "latency-based", "cost-based"
enable_pre_call_checks: true
# Your application code โ same API regardless of provider
from openai import OpenAI
# Point to LiteLLM instead of OpenAI
client = OpenAI(
base_url="http://localhost:4000",
api_key="your-litellm-master-key"
)
# Seamlessly switch between models
response = client.chat.completions.create(
model="smart-fallback", # LiteLLM handles provider selection
messages=[{"role": "user", "content": "Hello!"}]
)
# Track costs (LiteLLM logs to SQLite/Redis automatically)
# curl http://localhost:4000/spend returns per-user, per-model costs
2. Observability: Langfuse
Langfuse is an open-source LLM observability platform. It traces every LLM call, tool use, and chain execution โ giving you full visibility into production AI behavior.
# pip install langfuse
from langfuse.decorators import observe, langfuse_context
from langfuse import Langfuse
langfuse = Langfuse(
public_key="pk-lf-...",
secret_key="sk-lf-...",
host="https://cloud.langfuse.com" # or self-hosted
)
@observe(name="rag-pipeline") # Trace this function automatically
def answer_question(user_id: str, question: str) -> str:
# Add custom context to this trace
langfuse_context.update_current_trace(
user_id=user_id,
tags=["rag", "production"],
metadata={"question_length": len(question)}
)
# Retrieval step โ automatic child span
docs = retrieve_documents(question)
# Generation step โ automatic child span
answer = generate_answer(question, docs)
# Score the quality (can be done async by evaluators)
langfuse_context.score_current_trace(
name="answer_quality",
value=0.9, # From automated evaluation
comment="High relevance, well-cited"
)
return answer
# Run evaluations on production data
def run_evals():
# Fetch recent traces
traces = langfuse.fetch_traces(limit=100, tags=["rag"]).data
for trace in traces:
# Score each trace for faithfulness using an LLM judge
score = llm_judge_faithfulness(
question=trace.input,
answer=trace.output,
context=trace.metadata.get("retrieved_docs")
)
langfuse.score(
trace_id=trace.id,
name="faithfulness",
value=score
)
3. Evaluation Framework: RAGAS + Braintrust
# RAGAS for RAG-specific evaluation
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
from datasets import Dataset
# Your test dataset
test_data = {
"question": ["What's the refund policy?", "How to cancel?"],
"answer": [rag_system.answer(q) for q in questions],
"contexts": [rag_system.retrieve(q) for q in questions],
"ground_truth": ["30-day full refund...", "Cancel in settings..."]
}
results = evaluate(
Dataset.from_dict(test_data),
metrics=[faithfulness, answer_relevancy, context_recall]
)
print(results.to_pandas())
# Braintrust for A/B testing prompts
import braintrust
@braintrust.traced
def generate_with_prompt_v1(question: str) -> str:
return llm.invoke(PROMPT_V1.format(question=question)).content
@braintrust.traced
def generate_with_prompt_v2(question: str) -> str:
return llm.invoke(PROMPT_V2.format(question=question)).content
# Run experiment
experiment = braintrust.init(project="rag-system", experiment="prompt-v1-vs-v2")
for question, ground_truth in test_cases:
with experiment.start_span():
v1_output = generate_with_prompt_v1(question)
v2_output = generate_with_prompt_v2(question)
# Compare outputs automatically
experiment.log(
input=question,
output=v1_output,
expected=ground_truth,
scores={"accuracy": score_answer(v1_output, ground_truth)}
)
4. CI/CD for AI Systems
# .github/workflows/ai-eval.yml
name: AI System Evaluation
on:
pull_request:
paths:
- 'prompts/**'
- 'rag/**'
- 'agents/**'
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run evaluation suite
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
python -m pytest tests/evals/ -v \
--benchmark-enabled \
--benchmark-compare=baseline.json
- name: Check regression threshold
run: |
python scripts/check_regression.py \
--metric faithfulness \
--threshold 0.85 \
--fail-below
- name: Run safety checks
run: |
python scripts/safety_eval.py \
--prompt-dir prompts/ \
--adversarial-dataset tests/adversarial_prompts.jsonl
- name: Cost estimation
run: |
python scripts/estimate_costs.py \
--usage-profile tests/usage_profile.json \
--alert-if-above 1000 # Alert if estimated monthly cost > $1000
5. Cost Optimization Strategies
# Strategy 1: Prompt caching (Anthropic)
import anthropic
client = anthropic.Anthropic()
# Cache the large system prompt (1000+ tokens) โ 90% cost reduction on repeated calls
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": LARGE_SYSTEM_PROMPT, # 2000+ token system prompt
"cache_control": {"type": "ephemeral"} # Cache for 5 minutes
}
],
messages=[{"role": "user", "content": user_message}]
)
# Cache hit: 10ร cheaper + 3ร faster
# Strategy 2: Model routing by task complexity
def route_to_model(task_type: str, complexity: str) -> str:
routing = {
("classification", "simple"): "claude-haiku-4-5-20251001", # $0.00025/1K in
("extraction", "moderate"): "claude-sonnet-4-6", # $0.003/1K in
("reasoning", "complex"): "claude-opus-4-6", # $0.015/1K in
}
return routing.get((task_type, complexity), "claude-sonnet-4-6")
# Strategy 3: Batch API for async workloads (50% discount)
# Use for: nightly report generation, embedding updates, bulk analysis
batch = client.messages.batches.create(
requests=[
{"custom_id": f"req-{i}", "params": {
"model": "claude-haiku-4-5-20251001",
"max_tokens": 256,
"messages": [{"role": "user", "content": text}]
}}
for i, text in enumerate(large_dataset)
]
)
# Results available within 24 hours, at 50% of regular price
6. The AI Engineer's 2026 Stack
- LLM Gateway: LiteLLM โ provider abstraction, fallbacks, cost tracking
- Frameworks: LangChain + LangGraph โ chains, agents, stateful workflows
- Vector DB: Pinecone (managed) or Weaviate (self-hosted)
- Embeddings: Voyage AI (quality) or BAAI/bge (free/local)
- Observability: Langfuse โ tracing, evaluation, cost monitoring
- Evaluation: RAGAS (RAG quality) + Braintrust (A/B experiments)
- CI/CD: GitHub Actions with automated evals on every PR
- Serving: FastAPI + Celery + Redis for async agent execution
- Deployment: Docker + Kubernetes with HPA on GPU/CPU
- Secrets: AWS Secrets Manager or Vault โ never hardcode API keys
The AI engineering mindset: LLMs are probabilistic components in deterministic systems. Every non-deterministic component needs: input validation, output validation, retry logic, fallbacks, observability, and regression testing. Build them like you'd build any distributed system โ except the failure mode is "plausible wrong answer" rather than "error message."
Key Takeaways
- LiteLLM gives you provider-agnostic LLM access with automatic fallbacks and cost tracking
- Langfuse traces every AI call โ essential for debugging and quality monitoring
- Run automated evaluations on every prompt/code change via CI/CD
- Prompt caching (Anthropic) reduces repeated-system-prompt costs by 90%
- Route tasks to cheaper models โ use Haiku for classification, Opus for complex reasoning
- Batch API offers 50% discount for async, non-time-critical workloads
- The full AI engineering stack is distinct from ML engineering โ it's systems engineering with probabilistic components