Embeddings and Semantic Search

Here is a question that sounds philosophical but is actually an engineering problem: how do you teach a computer what words mean?

Not what they look like — computers have handled character encoding since the 1960s. Not how they are spelled — spell checkers have been around since the 1970s. What they mean. That "king" relates to "queen" the way "man" relates to "woman." That "bank" near "river" means something different from "bank" near "deposit." That a document about "canine nutrition" is relevant to a search for "what to feed my dog."

The answer, it turns out, involves converting language into geometry. You represent words, sentences, and documents as points in high-dimensional space, arranged so that things with similar meanings are near each other and things with different meanings are far apart. These representations are called embeddings, and they are the foundation of modern semantic search.

This chapter explains what embeddings are, how they work, and how to use them to build search systems that understand meaning rather than merely matching keywords.

A Brief History of Word Representations

One-Hot Encoding: The Naive Approach

The simplest way to represent words numerically: assign each word in your vocabulary a unique index, and represent it as a vector with a 1 at that index and 0s everywhere else. If your vocabulary has 50,000 words, each word is a 50,000-dimensional vector with exactly one non-zero entry.

This works for some purposes, but it encodes zero semantic information. The vectors for "cat" and "dog" are exactly as far apart as the vectors for "cat" and "democracy." Every word is equally different from every other word. For knowledge management, where the entire point is understanding relationships between concepts, this is useless.

Word2Vec: The Revolution

In 2013, Tomas Mikolov and colleagues at Google published a paper that changed natural language processing. Word2Vec trains a shallow neural network to predict either a word from its context (CBOW) or the context from a word (Skip-gram). The hidden layer weights, once trained, serve as dense vector representations of words — typically 100 to 300 dimensions rather than 50,000.

The magic of Word2Vec is that these learned representations capture semantic relationships as geometric relationships. The famous example:

vector("king") - vector("man") + vector("woman") ≈ vector("queen")

This is not a parlor trick. It reflects the fact that the model has learned, purely from co-occurrence patterns in text, that "king" and "queen" have the same relationship as "man" and "woman." Gender, tense, plurality, geography — all of these semantic relationships map to directions in the vector space.

Word2Vec has limitations. Each word gets exactly one vector, regardless of context. The word "bank" has the same representation whether it appears in "river bank" or "bank account." This is a significant problem for polysemous words and for knowledge bases that cover multiple domains.

GloVe and FastText

GloVe (Global Vectors for Word Representation), developed at Stanford in 2014, takes a different training approach — factorizing the word co-occurrence matrix — but produces similar results. FastText, from Facebook in 2016, extends Word2Vec by representing words as bags of character n-grams, which allows it to generate embeddings for words it has never seen (by composing embeddings of their subword components).

These are historically important, but for practical knowledge management applications today, they have been superseded by transformer-based models.

Transformer-Based Embeddings: The Modern Era

The transformer architecture, introduced in 2017's "Attention Is All You Need," changed everything. Models like BERT (2018) produce contextualized embeddings — the same word gets different vectors depending on its surrounding context. "Bank" in "river bank" and "bank" in "bank account" now have different representations. This is an enormous improvement for semantic understanding.

But BERT and its siblings were designed for classification and token-level tasks, not for generating sentence or document embeddings. Naively using the average of BERT's token embeddings as a sentence embedding produces results that are often worse than simpler methods. This gap was addressed by Sentence-BERT (SBERT) in 2019, which fine-tunes BERT using siamese and triplet networks to produce semantically meaningful sentence embeddings.

Modern embedding models — the ones you will actually use in production — build on this foundation with further architectural improvements, larger training sets, and optimization specifically for retrieval tasks.

The Geometry of Meaning

Understanding embeddings geometrically is not just an academic exercise. It directly informs how you design and debug semantic search systems.

Directions Encode Relationships

In a well-trained embedding space, semantic relationships correspond to directions. The direction from "Paris" to "France" is approximately the same as the direction from "Berlin" to "Germany." The direction from "walk" to "walked" is approximately the same as "swim" to "swam."

This means you can discover relationships by doing vector arithmetic. More practically, it means that a search for "French cuisine" will naturally find documents about "Parisian restaurants" because they occupy nearby regions of the embedding space.

Clusters Encode Categories

Words and documents with similar topics or themes cluster together. Medical terminology occupies one region, legal terminology another, cooking vocabulary a third. Within the cooking cluster, baking terms cluster separately from grilling terms.

This clustering behavior is what makes semantic search work. When you search for "chocolate cake recipe," the query embedding lands in the baking sub-cluster, and the nearest documents are other baking-related content — even if they use different specific words.

Distance Encodes Similarity

The distance between two points in embedding space reflects their semantic similarity. Nearby points are semantically related; distant points are unrelated. This is a continuous measure — unlike keyword search, which gives you a binary match/no-match, embeddings give you a gradient of relevance.

This continuous similarity has practical implications. You can set a similarity threshold below which results are considered irrelevant. You can rank results by similarity score. You can detect near-duplicates by finding documents with very high similarity.

Sentence Embeddings vs. Word Embeddings

For knowledge management and search, you almost always want sentence or document embeddings, not word embeddings. The distinction matters.

Word embeddings represent individual tokens. To get a representation of a sentence or paragraph, you need to somehow combine the word embeddings — averaging them, using a weighted sum, or applying some other aggregation. These approaches lose word order information ("dog bites man" and "man bites dog" produce similar aggregated embeddings) and handle negation poorly ("this is good" and "this is not good" are very close in averaged embedding space).

Sentence embeddings are produced by models trained specifically to embed entire text spans. They capture word order, negation, and compositional meaning. The embedding for "this product is not what I expected" correctly differs from "this product is exactly what I expected" in ways that word embedding averages cannot capture.

Modern embedding models operate at the sentence or passage level. You feed them a text span (typically up to 512 tokens, though some models handle longer inputs) and receive a single dense vector. This is what you want for RAG, semantic search, and knowledge base retrieval.

The embedding model landscape evolves quickly, but as of early 2026, these are the models worth knowing about.

OpenAI Embedding Models

OpenAI's text-embedding-3-small (1536 dimensions) and text-embedding-3-large (3072 dimensions) are the default choice for many production systems. They are accessible via API, perform well across domains, and support Matryoshka representation learning — you can truncate the embeddings to fewer dimensions with graceful degradation rather than catastrophic loss.

from openai import OpenAI

client = OpenAI()

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="How do I configure the database connection?"
)

embedding = response.data[0].embedding  # 1536-dimensional vector

Pros: easy to use, good general performance, well-documented. Cons: requires API calls (latency, cost, data privacy concerns), not open-source.

Cohere Embed

Cohere's embedding models support multiple languages and offer a distinction between search_document and search_query input types, allowing the model to optimize embeddings differently depending on whether the text is a document being indexed or a query being searched.

import cohere

co = cohere.Client("your-api-key")

response = co.embed(
    texts=["How do I configure the database?"],
    model="embed-english-v3.0",
    input_type="search_query"
)

embedding = response.embeddings[0]

The separate input types are a meaningful improvement for retrieval quality, as documents and queries have different linguistic characteristics.

BGE (BAAI General Embedding)

The BGE family from the Beijing Academy of Artificial Intelligence represents the state of the art in open-source embeddings. bge-large-en-v1.5 (1024 dimensions) offers near-commercial quality without API dependencies.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-large-en-v1.5")

# BGE models benefit from a query prefix
query_embedding = model.encode(
    "Represent this sentence for searching relevant passages: "
    "How do I configure the database?"
)

doc_embedding = model.encode(
    "To configure the database, edit the config.yaml file..."
)

Note the query prefix — BGE models are trained with specific prefixes for queries versus documents, similar to Cohere's approach but embedded in the input text rather than the API.

E5 (EmbEddings from bidirEctional Encoder rEpresentations)

Microsoft's E5 models are another strong open-source option. The e5-large-v2 model performs competitively with commercial offerings.

model = SentenceTransformer("intfloat/e5-large-v2")

# E5 uses "query: " and "passage: " prefixes
query_embedding = model.encode("query: How do I configure the database?")
doc_embedding = model.encode("passage: Edit config.yaml to set database parameters...")

Nomic Embed

Nomic's nomic-embed-text-v1.5 deserves attention for its long context support (up to 8192 tokens) and its strong performance at modest dimensionality (768 dimensions). It is fully open-source with open training data and code.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "nomic-ai/nomic-embed-text-v1.5",
    trust_remote_code=True
)

query_embedding = model.encode(
    "search_query: How do I configure the database?"
)

Choosing an Embedding Model

The choice depends on your constraints:

ConstraintRecommended
Need simplicity, budget availableOpenAI text-embedding-3-small
Need maximum quality, budget availableOpenAI text-embedding-3-large or Cohere
Need to run locally / data privacyBGE-large or E5-large
Need long context supportNomic-embed-text
Need multilingualCohere embed-multilingual
Need to minimize storage / latencyAny model with Matryoshka support, truncated

Always benchmark on your own data. General benchmarks (MTEB) are useful for shortlisting, but your specific domain and query distribution will determine which model actually performs best for your use case.

Dimensionality and Similarity Metrics

Dimensionality

Embedding dimensions range from 384 (MiniLM models) to 3072 (OpenAI text-embedding-3-large). Higher dimensions capture more nuance but require more storage, more compute for similarity calculations, and can suffer from the curse of dimensionality at extreme scales.

For most knowledge management applications, 768-1536 dimensions is the sweet spot. Going below 768 sacrifices meaningful quality. Going above 1536 provides diminishing returns unless your corpus is unusually large or your queries require fine-grained discrimination.

Models with Matryoshka representation learning (including OpenAI's v3 models and Nomic) can be truncated to lower dimensions with controlled quality loss. This is useful for trading quality against storage and speed:

import numpy as np

# Full 1536-dimensional embedding
full_embedding = get_embedding(text)

# Truncate to 512 dimensions
truncated = full_embedding[:512]

# Normalize after truncation
truncated = truncated / np.linalg.norm(truncated)

Cosine Similarity

The most commonly used similarity metric for embeddings. Cosine similarity measures the angle between two vectors, ignoring their magnitudes:

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Cosine similarity ranges from -1 (opposite) to 1 (identical). For normalized vectors, cosine similarity equals the dot product.

Cosine similarity is the default choice for text embeddings because it is insensitive to vector magnitude. Two documents about the same topic will have high cosine similarity regardless of their length or the number of times each concept is mentioned.

Dot Product

For normalized vectors, the dot product is equivalent to cosine similarity. Many vector databases internally normalize vectors and use dot product for efficiency, since it avoids the division by norms.

If your vectors are not normalized, dot product incorporates magnitude, which can be useful when magnitude encodes information (such as confidence or document importance).

Euclidean Distance

The straight-line distance between two points in embedding space. Less commonly used for text embeddings because it is sensitive to magnitude — two vectors pointing in the same direction but with different magnitudes will have a large Euclidean distance despite representing similar meanings.

Euclidean distance is useful when magnitude is meaningful or when you need triangle inequality properties (the distance from A to C is at most the distance from A to B plus the distance from B to C). Some clustering algorithms require Euclidean distance.

def euclidean_distance(a, b):
    return np.linalg.norm(a - b)

Practical Recommendation

Use cosine similarity. If your vector database requires a different metric, normalize your vectors and use dot product (which becomes equivalent). Use Euclidean distance only if you have a specific reason.

Building a Semantic Search System from Scratch

Let us build a complete semantic search system, step by step, without leaning on a framework like LangChain. Understanding each component matters.

import numpy as np
from sentence_transformers import SentenceTransformer
from typing import List, Tuple
import json

class SemanticSearchEngine:
    def __init__(self, model_name: str = "BAAI/bge-large-en-v1.5"):
        self.model = SentenceTransformer(model_name)
        self.documents: List[str] = []
        self.embeddings: np.ndarray = np.array([])
        self.metadata: List[dict] = []

    def index_documents(
        self,
        documents: List[str],
        metadata: List[dict] = None,
        batch_size: int = 32
    ):
        """Embed and store documents."""
        self.documents = documents
        self.metadata = metadata or [{} for _ in documents]

        # Embed in batches to manage memory
        all_embeddings = []
        for i in range(0, len(documents), batch_size):
            batch = documents[i:i + batch_size]
            batch_embeddings = self.model.encode(
                batch,
                normalize_embeddings=True,
                show_progress_bar=True
            )
            all_embeddings.append(batch_embeddings)

        self.embeddings = np.vstack(all_embeddings)

    def search(
        self,
        query: str,
        top_k: int = 5,
        threshold: float = 0.0
    ) -> List[Tuple[str, float, dict]]:
        """Search for documents similar to the query."""
        # Encode query with retrieval prefix
        query_embedding = self.model.encode(
            "Represent this sentence for searching relevant passages: "
            + query,
            normalize_embeddings=True
        )

        # Compute cosine similarities (dot product of normalized vectors)
        similarities = self.embeddings @ query_embedding

        # Get top-k indices
        top_indices = np.argsort(similarities)[::-1][:top_k]

        # Filter by threshold and return results
        results = []
        for idx in top_indices:
            score = float(similarities[idx])
            if score >= threshold:
                results.append((
                    self.documents[idx],
                    score,
                    self.metadata[idx]
                ))

        return results

    def save(self, path: str):
        """Persist the index to disk."""
        np.save(f"{path}/embeddings.npy", self.embeddings)
        with open(f"{path}/documents.json", "w") as f:
            json.dump({
                "documents": self.documents,
                "metadata": self.metadata
            }, f)

    def load(self, path: str):
        """Load a persisted index."""
        self.embeddings = np.load(f"{path}/embeddings.npy")
        with open(f"{path}/documents.json", "r") as f:
            data = json.load(f)
            self.documents = data["documents"]
            self.metadata = data["metadata"]

Usage:

# Initialize
engine = SemanticSearchEngine()

# Index some documents
documents = [
    "PostgreSQL is a powerful open-source relational database system.",
    "To configure the database connection, edit the DATABASE_URL "
    "environment variable in your .env file.",
    "Redis is an in-memory data structure store used as a cache.",
    "Machine learning models require training data to learn patterns.",
    "The application uses connection pooling to manage database connections "
    "efficiently. Set POOL_SIZE in config.yaml to control the pool size.",
]

metadata = [
    {"source": "docs/overview.md", "section": "databases"},
    {"source": "docs/setup.md", "section": "configuration"},
    {"source": "docs/overview.md", "section": "caching"},
    {"source": "docs/ml.md", "section": "training"},
    {"source": "docs/performance.md", "section": "connections"},
]

engine.index_documents(documents, metadata)

# Search
results = engine.search("How do I set up the database?", top_k=3)
for text, score, meta in results:
    print(f"[{score:.3f}] ({meta['source']}) {text[:80]}...")

This is roughly 80 lines of code. It is missing many things you would want in production — persistence beyond numpy files, approximate nearest neighbor search for large corpora, metadata filtering, concurrent access — but it demonstrates the core mechanics clearly.

Evaluation Metrics

You have built a semantic search system. How do you know if it is any good? You need evaluation metrics, and you need a labeled evaluation set.

Building an Evaluation Set

An evaluation set consists of queries paired with their relevant documents. For each query, you identify which documents in your corpus are relevant (and ideally, how relevant they are on a graded scale).

There is no shortcut here. Someone with domain expertise needs to create these query-document relevance pairs. Fifty to a hundred well-chosen queries with labeled relevant documents is a reasonable starting point. AI can help generate candidate queries, but a human must validate the relevance judgments.

Recall@k

Of all the relevant documents in the corpus, what fraction appears in the top k results?

def recall_at_k(relevant_docs, retrieved_docs, k):
    retrieved_set = set(retrieved_docs[:k])
    relevant_set = set(relevant_docs)
    return len(retrieved_set & relevant_set) / len(relevant_set)

Recall@k tells you whether the system finds the relevant documents. A recall@5 of 0.8 means that 80% of relevant documents appear in the top 5 results. For RAG applications, recall is arguably the most important metric — if the relevant document is not retrieved, the model cannot use it.

Mean Reciprocal Rank (MRR)

The reciprocal of the rank at which the first relevant document appears, averaged across queries:

def reciprocal_rank(relevant_docs, retrieved_docs):
    relevant_set = set(relevant_docs)
    for i, doc in enumerate(retrieved_docs):
        if doc in relevant_set:
            return 1.0 / (i + 1)
    return 0.0

def mrr(queries_results):
    return np.mean([
        reciprocal_rank(relevant, retrieved)
        for relevant, retrieved in queries_results
    ])

MRR tells you how quickly the system surfaces a relevant result. An MRR of 0.5 means that, on average, the first relevant document appears at position 2. For user-facing search, MRR is critical — users rarely scroll past the first few results.

Normalized Discounted Cumulative Gain (NDCG)

NDCG accounts for both the relevance grade of each result and its position in the ranking. Results at the top of the list contribute more to the score than results further down, and highly relevant results contribute more than marginally relevant ones:

def dcg_at_k(relevance_scores, k):
    relevance_scores = relevance_scores[:k]
    return sum(
        rel / np.log2(i + 2)  # i+2 because log2(1) = 0
        for i, rel in enumerate(relevance_scores)
    )

def ndcg_at_k(relevance_scores, k):
    actual_dcg = dcg_at_k(relevance_scores, k)
    ideal_dcg = dcg_at_k(
        sorted(relevance_scores, reverse=True), k
    )
    return actual_dcg / ideal_dcg if ideal_dcg > 0 else 0.0

NDCG ranges from 0 to 1, with 1 being a perfect ranking. It is the most informative single metric for search quality, but it requires graded relevance judgments (not just binary relevant/not-relevant), which are more expensive to produce.

Practical Evaluation

For a knowledge management semantic search system, track at minimum:

  • Recall@5: Are the relevant documents being found?
  • MRR: Is the most relevant document near the top?
  • Latency: How fast is the search? (Users expect sub-second responses.)

Evaluate whenever you change the embedding model, chunking strategy, or retrieval parameters. Small changes to any of these can have outsized effects on search quality.

Common Pitfalls

Mixing embedding models. If you indexed with model A, you must search with model A. Embeddings from different models live in different vector spaces and are not compatible. This sounds obvious, but it is a surprisingly common source of bugs when upgrading models.

Ignoring normalization. Some models return normalized vectors; others do not. If you use cosine similarity, it does not matter (normalization is built into the formula). If you use dot product for efficiency, you must normalize explicitly or your similarity scores will be meaningless.

Embedding long documents whole. Most embedding models have a maximum input length (typically 512 tokens). Text beyond this limit is silently truncated. If you embed a 5,000-word document without chunking, you are embedding only the first 512 tokens and ignoring the rest.

Over-indexing. Embedding every sentence individually creates a noisy, high-cardinality index where retrieval returns many marginally relevant fragments. Embedding at the paragraph or section level usually produces better results.

Ignoring domain mismatch. General-purpose embedding models perform well on general-purpose queries. If your knowledge base is highly specialized (medical literature, legal documents, code), a domain-specific or fine-tuned model may dramatically outperform general models. At minimum, benchmark on your actual data before committing to a model.

Embeddings transform the problem of understanding meaning into the problem of computing distances between points. It is a profound reduction — from the vast complexity of human language to the clean geometry of vector spaces. The reduction is lossy, imperfect, and occasionally misleading. But it works well enough to be useful, and in engineering, "well enough to be useful" is the only standard that matters.