Embeddings are the foundation of modern semantic search, RAG, recommendation systems, and clustering. They're also one of the most elegant ideas in machine learning. This article explains what they are, why they work, and how to use them practically.
What is an Embedding?
An embedding is a fixed-size vector of numbers that represents a piece of data (text, image, audio) in a mathematical space where similar things are close together.
For text, a well-trained embedding model converts "The dog ran fast" into something like [0.12, -0.45, 0.87, ..., 0.33] โ a list of 384, 768, or 1536 numbers, depending on the model. The magic: semantically similar sentences produce numerically similar vectors.
The Famous Word2Vec Analogy
Early embedding models demonstrated remarkable algebraic properties:
- embedding("king") โ embedding("man") + embedding("woman") โ embedding("queen")
- embedding("Paris") โ embedding("France") + embedding("Italy") โ embedding("Rome")
- embedding("walking") โ embedding("walk") โ embedding("swimming") โ embedding("swim")
These relationships emerge from the statistical co-occurrence patterns in the training data โ the model learns that "king" and "queen" appear in similar contexts, just with different gendered words nearby.
How Embedding Models Work
Modern text embedding models use Transformer encoders (like BERT). The model processes the input text and outputs a dense vector that captures its semantic meaning in a high-dimensional space.
from sentence_transformers import SentenceTransformer
import numpy as np
# Load a pre-trained embedding model
model = SentenceTransformer("BAAI/bge-small-en-v1.5") # 33M params, free, fast
# Embed a sentence
sentence = "The quick brown fox jumps over the lazy dog"
embedding = model.encode(sentence)
print(f"Shape: {embedding.shape}") # (384,) โ 384 dimensions
print(f"Type: {embedding.dtype}") # float32
print(f"First 5 values: {embedding[:5]}") # [-0.12, 0.45, 0.03, ...]
# Embed multiple sentences at once (batched for speed)
sentences = [
"I love cats.",
"My favorite animal is a cat.",
"I'm learning Python programming.",
"The stock market fell today.",
]
embeddings = model.encode(sentences, batch_size=32)
print(f"Shape: {embeddings.shape}") # (4, 384)
Measuring Similarity: Cosine Similarity
The most common similarity metric is cosine similarity โ it measures the angle between two vectors, ignoring their magnitude. A score of 1 means identical direction (same meaning), 0 means orthogonal (unrelated), -1 means opposite.
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
sentences = [
"I love cats.", # Reference
"My favorite animal is a cat.", # Should be high similarity
"Dogs are great pets.", # Medium similarity
"The economy is growing rapidly.", # Should be low similarity
]
embeddings = model.encode(sentences)
# Compare all to the first sentence
reference = embeddings[0].reshape(1, -1)
similarities = cosine_similarity(reference, embeddings[1:])
for sent, score in zip(sentences[1:], similarities[0]):
print(f"{score:.3f} | {sent}")
# Output:
# 0.923 | My favorite animal is a cat. โ Very similar
# 0.687 | Dogs are great pets. โ Moderately similar
# 0.143 | The economy is growing rapidly. โ Unrelated
Building a Simple Semantic Search System
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-small-en-v1.5")
# Your knowledge base
documents = [
"Return policy: Items can be returned within 30 days for a full refund.",
"Shipping: We offer free shipping on orders over $50.",
"Payment: We accept Visa, Mastercard, PayPal, and Apple Pay.",
"Customer service is available Monday-Friday 9am-6pm EST.",
"Our premium membership costs $9.99/month and includes free returns.",
]
# Embed all documents (do this once, store the result)
doc_embeddings = model.encode(documents, normalize_embeddings=True)
def semantic_search(query: str, top_k: int = 3) -> list[dict]:
"""Find the most relevant documents for a query."""
query_embedding = model.encode([query], normalize_embeddings=True)
scores = (query_embedding @ doc_embeddings.T)[0] # Dot product = cosine sim
top_indices = np.argsort(scores)[::-1][:top_k]
return [
{"document": documents[i], "score": float(scores[i])}
for i in top_indices
]
# Test it
results = semantic_search("How do I send something back?")
for r in results:
print(f"Score: {r['score']:.3f} | {r['document'][:80]}")
Choosing an Embedding Model
BAAI/bge-small-en
๐ข Free, 33M params, 384 dims. Fast and good quality. Best for local/offline use.
text-embedding-3-small
OpenAI. $0.02/1M tokens. 1536 dims. Excellent quality. Best balance of cost and performance.
text-embedding-3-large
OpenAI. $0.13/1M tokens. 3072 dims. Highest quality from OpenAI.
voyage-3
Anthropic/Voyage AI. Best performing model for RAG tasks in most benchmarks (MTEB).
Embeddings Beyond Text
- Images: CLIP embeddings allow text and image search in the same vector space
- Code: CodeBERT, StarCoder embeddings for semantic code search
- Audio: Whisper + sentence embeddings for audio search
- Multi-modal: Models like GPT-4V can embed text+image pairs together
Key Takeaways
- Embeddings are fixed-size vectors where similar content = similar vectors
- Cosine similarity measures semantic similarity between any two embeddings
- Use
normalize_embeddings=Trueโ enables faster dot product instead of cosine calculation - BAAI/bge-small-en-v1.5 for free/local; OpenAI text-embedding-3-small for production quality
- Always embed with the same model you indexed with โ models are not interchangeable
- Embeddings are the engine behind semantic search, RAG, recommendations, and clustering