๐ŸŸก Intermediate

How Large Language Models Work:
From Tokens to Intelligence

๐Ÿ“š AI Foundations โฑ 14 min read ๐Ÿ—“ May 2026

Large Language Models (LLMs) like Claude, GPT-4, and Gemini have fundamentally changed what software can do. But how do they actually work? Understanding the mechanics โ€” from tokenization to attention to RLHF โ€” will make you a far more effective AI practitioner.

Step 1: Tokenization

LLMs don't read text the way humans do. They process tokens โ€” chunks of text that are typically 3โ€“4 characters on average. Words like "cat" might be one token, while "tokenization" might be split into ["token", "ization"].

# Example tokenization with tiktoken (OpenAI's tokenizer)
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode("The Agentic AI Academy teaches advanced AI skills.")
print(tokens)       # [791, 8125, 292, 15592, 17174, 45696, 11084, 15592, 7512, 13]
print(len(tokens))  # 10 tokens for this sentence

Modern LLMs use a vocabulary of ~32,000โ€“100,000 tokens. The tokenizer maps text to integer IDs, which become the raw input to the model.

Step 2: Embeddings

Each token ID is converted into a high-dimensional vector called an embedding. This transforms discrete tokens into a continuous numerical space where similar concepts cluster together.

For example, in a well-trained embedding space, the vectors for "king" โˆ’ "man" + "woman" โ‰ˆ "queen". The model learns these relationships from statistical patterns across billions of text examples.

Why embeddings matter: They allow the model to represent meaning mathematically. The geometry of the embedding space encodes semantic relationships that drive the model's understanding.

Step 3: The Transformer Architecture

The Transformer (introduced in "Attention Is All You Need", 2017) is the engine of every modern LLM. It has two core innovations:

Self-Attention

Self-attention lets every token in the input "look at" every other token simultaneously and decide how relevant each is. This is computed as:

# Conceptual self-attention (simplified)
import torch
import torch.nn.functional as F

def attention(Q, K, V, d_k):
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
    weights = F.softmax(scores, dim=-1)  # probability distribution
    return torch.matmul(weights, V)      # weighted combination of values

Multi-Head Attention

Instead of one attention function, Transformers use multiple "heads" in parallel โ€” each learning to attend to different kinds of relationships (syntax, coreference, semantics, etc.). Their outputs are concatenated and projected.

Feed-Forward Layers

After attention, each position passes through a position-wise feed-forward network โ€” two linear transformations with a ReLU/GELU activation. These layers are where most of the model's "knowledge" is stored.

Step 4: Pre-training (Next Token Prediction)

LLMs are pre-trained on vast corpora of internet text using a simple objective: predict the next token. Given "The sky is", the model learns to predict "blue" (or similar tokens) with high probability.

1
Collect data: Billions of web pages, books, code, scientific papers (~trillions of tokens)
2
Initialize: Random weights across hundreds of transformer layers and billions of parameters
3
Train: For each sequence, compute predicted probabilities; minimize cross-entropy loss via gradient descent
4
Scale: More data + more parameters + more compute โ†’ emergent capabilities appear (reasoning, coding, etc.)

GPT-4 is estimated to have ~1 trillion parameters and was trained on thousands of GPUs for months. The cost: hundreds of millions of dollars.

Step 5: Fine-tuning & RLHF

A pre-trained model is a raw language predictor โ€” it would complete "Tell me how to make a bomb" with enthusiastic instructions. Fine-tuning shapes it into a helpful, safe assistant.

Supervised Fine-Tuning (SFT)

Human trainers write example conversations showing the desired assistant behavior. The model is fine-tuned on this dataset.

Reinforcement Learning from Human Feedback (RLHF)

  1. The model generates multiple responses to the same prompt
  2. Human raters rank which response is better
  3. A reward model is trained to predict human preferences
  4. The LLM is fine-tuned using RL (PPO) to maximize the reward model's score
RLHF is why ChatGPT felt different: Prior LLMs were raw predictors. RLHF aligned them with human preferences โ€” making them helpful, harmless, and honest. Claude uses Constitutional AI (CAI), Anthropic's refinement of RLHF.

Step 6: Inference โ€” How LLMs Generate Text

At inference time, the model generates text one token at a time. Each new token becomes part of the context for predicting the next token โ€” this is called autoregressive generation.

# Pseudocode for autoregressive generation
context = tokenize(prompt)
for _ in range(max_tokens):
    logits = model(context)           # shape: [vocab_size]
    probs = softmax(logits / temp)    # temperature controls randomness
    next_token = sample(probs)        # or argmax for greedy decoding
    context.append(next_token)
    if next_token == EOS_TOKEN:
        break
return detokenize(context)

Temperature controls creativity: lower (0.0โ€“0.3) = more deterministic, higher (0.8โ€“1.5) = more diverse/creative.

Context Window & KV Cache

The context window is the maximum number of tokens the model can process at once. Early GPT models had 2K tokens; modern models support 128Kโ€“2M tokens.

The KV (key-value) cache is an optimization: previously computed attention keys and values are cached so they don't need recomputation for each new token. This makes inference dramatically faster.

Why Do LLMs "Hallucinate"?

LLMs generate the statistically most likely next token โ€” they don't "look up facts." When queried about something outside their training distribution, they confidently generate plausible-sounding text that may be wrong. Solutions include:

The Scaling Laws

Kaplan et al. (2020) at OpenAI discovered that LLM performance follows predictable power laws with respect to model size (N), dataset size (D), and compute (C). Chinchilla (2022) refined this: for a given compute budget, train a smaller model on more data rather than a large model on less data.

GPT-2 (2019)

1.5B parameters. Shocked researchers with coherent long-form text.

GPT-3 (2020)

175B parameters. Few-shot learning emerged as a surprising capability.

Claude 3 (2024)

Undisclosed size. State-of-the-art reasoning, coding, and safety.

Llama 3.1 (2024)

8Bโ€“405B parameters. Open-source, competitive with proprietary models.

Key Takeaways