Large Language Models (LLMs) like Claude, GPT-4, and Gemini have fundamentally changed what software can do. But how do they actually work? Understanding the mechanics โ from tokenization to attention to RLHF โ will make you a far more effective AI practitioner.
Step 1: Tokenization
LLMs don't read text the way humans do. They process tokens โ chunks of text that are typically 3โ4 characters on average. Words like "cat" might be one token, while "tokenization" might be split into ["token", "ization"].
# Example tokenization with tiktoken (OpenAI's tokenizer)
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode("The Agentic AI Academy teaches advanced AI skills.")
print(tokens) # [791, 8125, 292, 15592, 17174, 45696, 11084, 15592, 7512, 13]
print(len(tokens)) # 10 tokens for this sentence
Modern LLMs use a vocabulary of ~32,000โ100,000 tokens. The tokenizer maps text to integer IDs, which become the raw input to the model.
Step 2: Embeddings
Each token ID is converted into a high-dimensional vector called an embedding. This transforms discrete tokens into a continuous numerical space where similar concepts cluster together.
For example, in a well-trained embedding space, the vectors for "king" โ "man" + "woman" โ "queen". The model learns these relationships from statistical patterns across billions of text examples.
Step 3: The Transformer Architecture
The Transformer (introduced in "Attention Is All You Need", 2017) is the engine of every modern LLM. It has two core innovations:
Self-Attention
Self-attention lets every token in the input "look at" every other token simultaneously and decide how relevant each is. This is computed as:
- Each token creates a Query (what am I looking for?)
- Each token creates a Key (what do I contain?)
- Each token creates a Value (what should I contribute?)
- Attention scores = softmax(QยทKแต / โd) โ a probability distribution over all tokens
- Output = weighted sum of Values
# Conceptual self-attention (simplified)
import torch
import torch.nn.functional as F
def attention(Q, K, V, d_k):
scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
weights = F.softmax(scores, dim=-1) # probability distribution
return torch.matmul(weights, V) # weighted combination of values
Multi-Head Attention
Instead of one attention function, Transformers use multiple "heads" in parallel โ each learning to attend to different kinds of relationships (syntax, coreference, semantics, etc.). Their outputs are concatenated and projected.
Feed-Forward Layers
After attention, each position passes through a position-wise feed-forward network โ two linear transformations with a ReLU/GELU activation. These layers are where most of the model's "knowledge" is stored.
Step 4: Pre-training (Next Token Prediction)
LLMs are pre-trained on vast corpora of internet text using a simple objective: predict the next token. Given "The sky is", the model learns to predict "blue" (or similar tokens) with high probability.
GPT-4 is estimated to have ~1 trillion parameters and was trained on thousands of GPUs for months. The cost: hundreds of millions of dollars.
Step 5: Fine-tuning & RLHF
A pre-trained model is a raw language predictor โ it would complete "Tell me how to make a bomb" with enthusiastic instructions. Fine-tuning shapes it into a helpful, safe assistant.
Supervised Fine-Tuning (SFT)
Human trainers write example conversations showing the desired assistant behavior. The model is fine-tuned on this dataset.
Reinforcement Learning from Human Feedback (RLHF)
- The model generates multiple responses to the same prompt
- Human raters rank which response is better
- A reward model is trained to predict human preferences
- The LLM is fine-tuned using RL (PPO) to maximize the reward model's score
Step 6: Inference โ How LLMs Generate Text
At inference time, the model generates text one token at a time. Each new token becomes part of the context for predicting the next token โ this is called autoregressive generation.
# Pseudocode for autoregressive generation
context = tokenize(prompt)
for _ in range(max_tokens):
logits = model(context) # shape: [vocab_size]
probs = softmax(logits / temp) # temperature controls randomness
next_token = sample(probs) # or argmax for greedy decoding
context.append(next_token)
if next_token == EOS_TOKEN:
break
return detokenize(context)
Temperature controls creativity: lower (0.0โ0.3) = more deterministic, higher (0.8โ1.5) = more diverse/creative.
Context Window & KV Cache
The context window is the maximum number of tokens the model can process at once. Early GPT models had 2K tokens; modern models support 128Kโ2M tokens.
The KV (key-value) cache is an optimization: previously computed attention keys and values are cached so they don't need recomputation for each new token. This makes inference dramatically faster.
Why Do LLMs "Hallucinate"?
LLMs generate the statistically most likely next token โ they don't "look up facts." When queried about something outside their training distribution, they confidently generate plausible-sounding text that may be wrong. Solutions include:
- RAG โ Retrieval-Augmented Generation: fetch real documents before generating
- Tool use โ Give the model a search tool to look up current facts
- Structured outputs โ Force JSON schema to reduce freeform hallucination
- Temperature = 0 โ More deterministic outputs for factual tasks
The Scaling Laws
Kaplan et al. (2020) at OpenAI discovered that LLM performance follows predictable power laws with respect to model size (N), dataset size (D), and compute (C). Chinchilla (2022) refined this: for a given compute budget, train a smaller model on more data rather than a large model on less data.
GPT-2 (2019)
1.5B parameters. Shocked researchers with coherent long-form text.
GPT-3 (2020)
175B parameters. Few-shot learning emerged as a surprising capability.
Claude 3 (2024)
Undisclosed size. State-of-the-art reasoning, coding, and safety.
Llama 3.1 (2024)
8Bโ405B parameters. Open-source, competitive with proprietary models.
Key Takeaways
- LLMs tokenize text โ embed tokens โ pass through transformer layers โ predict next tokens
- Self-attention is the core mechanism: every token attends to every other token
- Pre-training on massive corpora gives broad knowledge; RLHF aligns the model to be helpful
- LLMs generate text autoregressively โ one token at a time
- Hallucination stems from statistical generation, not factual recall โ RAG and tools are the fix
- Scaling laws: bigger models + more data + more compute โ more capable AI