How Large Language Models Work — The Agentic AI Academy

Large Language Models (LLMs) like Claude, GPT-4, and Gemini have fundamentally changed what software can do. But how do they actually work? Understanding the mechanics — from tokenization to attention to RLHF — will make you a far more effective AI practitioner.

Step 1: Tokenization

LLMs don't read text the way humans do. They process tokens — chunks of text that are typically 3–4 characters on average. Words like "cat" might be one token, while "tokenization" might be split into ["token", "ization"].

# Example tokenization with tiktoken (OpenAI's tokenizer)
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode("The Agentic AI Academy teaches advanced AI skills.")
print(tokens)       # [791, 8125, 292, 15592, 17174, 45696, 11084, 15592, 7512, 13]
print(len(tokens))  # 10 tokens for this sentence

Modern LLMs use a vocabulary of ~32,000–100,000 tokens. The tokenizer maps text to integer IDs, which become the raw input to the model.

Step 2: Embeddings

Each token ID is converted into a high-dimensional vector called an embedding. This transforms discrete tokens into a continuous numerical space where similar concepts cluster together.

For example, in a well-trained embedding space, the vectors for "king" − "man" + "woman" ≈ "queen". The model learns these relationships from statistical patterns across billions of text examples.

Why embeddings matter: They allow the model to represent meaning mathematically. The geometry of the embedding space encodes semantic relationships that drive the model's understanding.

Step 3: The Transformer Architecture

The Transformer (introduced in "Attention Is All You Need", 2017) is the engine of every modern LLM. It has two core innovations:

Self-Attention

Self-attention lets every token in the input "look at" every other token simultaneously and decide how relevant each is. This is computed as:

Each token creates a Query (what am I looking for?)
Each token creates a Key (what do I contain?)
Each token creates a Value (what should I contribute?)
Attention scores = softmax(Q·Kᵀ / √d) — a probability distribution over all tokens
Output = weighted sum of Values

# Conceptual self-attention (simplified)
import torch
import torch.nn.functional as F

def attention(Q, K, V, d_k):
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
    weights = F.softmax(scores, dim=-1)  # probability distribution
    return torch.matmul(weights, V)      # weighted combination of values

Multi-Head Attention

Instead of one attention function, Transformers use multiple "heads" in parallel — each learning to attend to different kinds of relationships (syntax, coreference, semantics, etc.). Their outputs are concatenated and projected.

Feed-Forward Layers

After attention, each position passes through a position-wise feed-forward network — two linear transformations with a ReLU/GELU activation. These layers are where most of the model's "knowledge" is stored.

Step 4: Pre-training (Next Token Prediction)

LLMs are pre-trained on vast corpora of internet text using a simple objective: predict the next token. Given "The sky is", the model learns to predict "blue" (or similar tokens) with high probability.

Collect data: Billions of web pages, books, code, scientific papers (~trillions of tokens)

Initialize: Random weights across hundreds of transformer layers and billions of parameters

Train: For each sequence, compute predicted probabilities; minimize cross-entropy loss via gradient descent

Scale: More data + more parameters + more compute → emergent capabilities appear (reasoning, coding, etc.)

GPT-4 is estimated to have ~1 trillion parameters and was trained on thousands of GPUs for months. The cost: hundreds of millions of dollars.

Step 5: Fine-tuning & RLHF

A pre-trained model is a raw language predictor — it would complete "Tell me how to make a bomb" with enthusiastic instructions. Fine-tuning shapes it into a helpful, safe assistant.

Supervised Fine-Tuning (SFT)

Human trainers write example conversations showing the desired assistant behavior. The model is fine-tuned on this dataset.

Reinforcement Learning from Human Feedback (RLHF)

The model generates multiple responses to the same prompt
Human raters rank which response is better
A reward model is trained to predict human preferences
The LLM is fine-tuned using RL (PPO) to maximize the reward model's score

RLHF is why ChatGPT felt different: Prior LLMs were raw predictors. RLHF aligned them with human preferences — making them helpful, harmless, and honest. Claude uses Constitutional AI (CAI), Anthropic's refinement of RLHF.

Step 6: Inference — How LLMs Generate Text

At inference time, the model generates text one token at a time. Each new token becomes part of the context for predicting the next token — this is called autoregressive generation.

# Pseudocode for autoregressive generation
context = tokenize(prompt)
for _ in range(max_tokens):
    logits = model(context)           # shape: [vocab_size]
    probs = softmax(logits / temp)    # temperature controls randomness
    next_token = sample(probs)        # or argmax for greedy decoding
    context.append(next_token)
    if next_token == EOS_TOKEN:
        break
return detokenize(context)

Temperature controls creativity: lower (0.0–0.3) = more deterministic, higher (0.8–1.5) = more diverse/creative.

Context Window & KV Cache

The context window is the maximum number of tokens the model can process at once. Early GPT models had 2K tokens; modern models support 128K–2M tokens.

The KV (key-value) cache is an optimization: previously computed attention keys and values are cached so they don't need recomputation for each new token. This makes inference dramatically faster.

Why Do LLMs "Hallucinate"?

LLMs generate the statistically most likely next token — they don't "look up facts." When queried about something outside their training distribution, they confidently generate plausible-sounding text that may be wrong. Solutions include:

RAG — Retrieval-Augmented Generation: fetch real documents before generating
Tool use — Give the model a search tool to look up current facts
Structured outputs — Force JSON schema to reduce freeform hallucination
Temperature = 0 — More deterministic outputs for factual tasks

The Scaling Laws

Kaplan et al. (2020) at OpenAI discovered that LLM performance follows predictable power laws with respect to model size (N), dataset size (D), and compute (C). Chinchilla (2022) refined this: for a given compute budget, train a smaller model on more data rather than a large model on less data.

GPT-2 (2019)

1.5B parameters. Shocked researchers with coherent long-form text.

GPT-3 (2020)

175B parameters. Few-shot learning emerged as a surprising capability.

Claude 3 (2024)

Undisclosed size. State-of-the-art reasoning, coding, and safety.

Llama 3.1 (2024)

8B–405B parameters. Open-source, competitive with proprietary models.

Key Takeaways

LLMs tokenize text → embed tokens → pass through transformer layers → predict next tokens
Self-attention is the core mechanism: every token attends to every other token
Pre-training on massive corpora gives broad knowledge; RLHF aligns the model to be helpful
LLMs generate text autoregressively — one token at a time
Hallucination stems from statistical generation, not factual recall — RAG and tools are the fix
Scaling laws: bigger models + more data + more compute → more capable AI

← ML vs Deep Learning Next: Prompt Engineering →

How Large Language Models Work:From Tokens to Intelligence

Step 1: Tokenization

Step 2: Embeddings

Step 3: The Transformer Architecture

Self-Attention

Multi-Head Attention

Feed-Forward Layers

Step 4: Pre-training (Next Token Prediction)

Step 5: Fine-tuning & RLHF

Supervised Fine-Tuning (SFT)

Reinforcement Learning from Human Feedback (RLHF)

Step 6: Inference — How LLMs Generate Text

Context Window & KV Cache

Why Do LLMs "Hallucinate"?

The Scaling Laws

GPT-2 (2019)

GPT-3 (2020)

Claude 3 (2024)

Llama 3.1 (2024)

Key Takeaways

How Large Language Models Work:
From Tokens to Intelligence