Local AI for Knowledge Retrieval

Full-text search, as we built in the previous chapter, is powerful. It finds exactly what you asked for. But knowledge retrieval has a harder problem: finding what you meant but did not know how to ask for.

You write a note about "the difficulty of transferring tacit expertise between team members." A month later, you search for "organizational learning barriers." Full-text search returns nothing — those words do not appear in your note. But the concepts are deeply related. A system that understood meaning, not just keywords, would surface that connection instantly.

This is what embedding models and semantic search provide. And thanks to remarkable progress in model compression and open-source tooling, you can run the entire pipeline — embedding model, vector database, and large language model — on your own hardware, with no data leaving your machine.

This chapter shows you how.

The Architecture: Local RAG

RAG — Retrieval-Augmented Generation — is the pattern of retrieving relevant documents and feeding them to a language model as context for answering questions. The cloud-based version sends your data to OpenAI or Anthropic. The local version keeps everything on your hardware:

┌─────────────┐     ┌──────────────────┐     ┌─────────────┐
│  Your Notes │────▶│ Embedding Model  │────▶│ Vector Store │
│  (Markdown) │     │  (local, ~100MB) │     │ (ChromaDB /  │
└─────────────┘     └──────────────────┘     │  SQLite-vec) │
                                              └──────┬──────┘
                                                     │
                    ┌──────────────────┐              │
                    │   Local LLM      │◀─── query ───┘
                    │  (Ollama, ~4GB)  │
                    └──────────────────┘

The components:

Embedding model — Converts text into dense vectors (arrays of floating-point numbers) that encode meaning. Similar texts produce similar vectors.
Vector store — Stores the vectors and supports fast similarity search. When you query, it finds the vectors (and thus the notes) most similar to your query.
Local LLM — Reads the retrieved notes and generates a coherent answer to your question, grounded in your own knowledge base.

Each of these runs entirely on your machine. Let us set them up.

Setting Up Ollama

Ollama is the easiest way to run large language models locally. It handles model downloading, quantization, GPU acceleration, and API serving with minimal configuration.

Installation

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Or download directly from https://ollama.com/download

Start the Ollama server:

ollama serve

This runs in the background and exposes an API at http://localhost:11434.

Pulling Models

You need two types of models: an embedding model for converting text to vectors, and a language model for generating answers.

# Embedding model — small, fast, excellent quality
ollama pull nomic-embed-text

# Language model — good balance of quality and speed
ollama pull llama3.2:3b

# If you have more RAM/VRAM (16GB+), use the larger model
ollama pull llama3.1:8b

# For machines with 32GB+ RAM
ollama pull llama3.1:70b-q4_0

A note on model sizes and hardware requirements:

Model	Size on Disk	RAM Required	Quality	Speed
llama3.2:3b	~2 GB	4 GB	Good for simple queries	Fast, even on CPU
llama3.1:8b	~4.7 GB	8 GB	Very good	Fast on GPU, usable on CPU
llama3.1:70b-q4_0	~40 GB	48 GB	Excellent	Needs serious hardware
nomic-embed-text	~274 MB	1 GB	Excellent for embeddings	Very fast

For a personal knowledge base, the 8B parameter model is the sweet spot. It is smart enough to synthesize information from multiple notes and generate coherent answers, while running comfortably on a modern laptop with 16GB of RAM.

Testing the Setup

# Test the language model
ollama run llama3.2:3b "What is knowledge management? Answer in two sentences."

# Test the embedding model via API
curl http://localhost:11434/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "knowledge management systems"
}'

The embedding endpoint returns a JSON object with an embedding field containing a vector of 768 floating-point numbers. These numbers encode the semantic meaning of the input text. Two texts with similar meaning produce vectors that are close together in this 768-dimensional space.

Local Embedding Models: Your Options

Ollama's nomic-embed-text is an excellent default, but you have choices:

nomic-embed-text — 768 dimensions, 137M parameters. Excellent quality for its size, trained with a contrastive objective that makes it particularly good at retrieval tasks. Supports 8,192 token context. This is the one to start with.

all-MiniLM-L6-v2 — 384 dimensions, 22M parameters. The classic lightweight embedding model. Smaller vectors mean faster search and less storage, at the cost of some accuracy. Available through sentence-transformers in Python.

BGE (BAAI General Embedding) — Available in small (33M), base (109M), and large (335M) variants. The large variant is competitive with commercial embedding APIs. Available through Ollama or sentence-transformers.

mxbai-embed-large — 1,024 dimensions, 335M parameters. High quality, available through Ollama. Good choice if you want maximum retrieval accuracy and have the hardware to support slightly larger vectors.

For most personal knowledge bases (under 100,000 notes), the difference in retrieval quality between these models is marginal. Pick nomic-embed-text and move on. Optimization is for later, if ever.

Building the Embedding Pipeline

Here is a complete Python script that embeds your vault and stores the vectors in ChromaDB:

#!/usr/bin/env python3
"""embed_vault.py — Embed your markdown vault into a local vector database."""

import os
import re
import hashlib
from pathlib import Path
import chromadb
import requests
from chromadb.config import Settings

VAULT_PATH = os.environ.get("VAULT_PATH", os.path.expanduser("~/vault"))
CHROMA_PATH = os.environ.get("CHROMA_PATH", os.path.expanduser("~/vault/.chroma"))
OLLAMA_URL = "http://localhost:11434"
EMBED_MODEL = "nomic-embed-text"

def get_embedding(text: str) -> list[float]:
    """Get embedding vector from Ollama."""
    response = requests.post(
        f"{OLLAMA_URL}/api/embeddings",
        json={"model": EMBED_MODEL, "prompt": text}
    )
    response.raise_for_status()
    return response.json()["embedding"]

def chunk_document(content: str, filepath: str,
                   max_chunk_size: int = 1000,
                   overlap: int = 200) -> list[dict]:
    """Split a document into overlapping chunks for embedding.

    Uses heading-aware splitting: tries to break at heading boundaries
    first, then falls back to paragraph boundaries, then sentence
    boundaries.
    """
    # Remove YAML frontmatter
    if content.startswith('---'):
        parts = content.split('---', 2)
        if len(parts) >= 3:
            content = parts[2]

    chunks = []
    # Split by headings first
    sections = re.split(r'(^#{1,6}\s+.+$)', content, flags=re.MULTILINE)

    current_chunk = ""
    current_heading = ""

    for section in sections:
        if re.match(r'^#{1,6}\s+', section):
            current_heading = section.strip()
            continue

        paragraphs = section.split('\n\n')
        for para in paragraphs:
            para = para.strip()
            if not para:
                continue

            if len(current_chunk) + len(para) > max_chunk_size:
                if current_chunk:
                    chunks.append({
                        "text": current_chunk.strip(),
                        "heading": current_heading,
                        "source": filepath
                    })
                # Start new chunk with overlap
                words = current_chunk.split()
                overlap_words = words[-overlap // 5:] if len(words) > overlap // 5 else []
                current_chunk = " ".join(overlap_words) + "\n\n" + para
            else:
                current_chunk += "\n\n" + para

    if current_chunk.strip():
        chunks.append({
            "text": current_chunk.strip(),
            "heading": current_heading,
            "source": filepath
        })

    # If the entire document is short enough, return it as a single chunk
    if not chunks and content.strip():
        chunks.append({
            "text": content.strip(),
            "heading": "",
            "source": filepath
        })

    return chunks

def content_hash(text: str) -> str:
    """Create a hash of content for deduplication."""
    return hashlib.sha256(text.encode()).hexdigest()[:16]

def embed_vault():
    """Process the vault and create embeddings."""
    client = chromadb.PersistentClient(
        path=CHROMA_PATH,
        settings=Settings(anonymized_telemetry=False)
    )

    collection = client.get_or_create_collection(
        name="vault_notes",
        metadata={"hnsw:space": "cosine"}
    )

    vault = Path(VAULT_PATH)
    all_chunks = []
    file_count = 0

    print("Scanning vault for markdown files...")

    for md_file in vault.rglob("*.md"):
        if any(part.startswith('.') for part in md_file.parts):
            continue

        rel_path = str(md_file.relative_to(vault))
        content = md_file.read_text(encoding="utf-8", errors="replace")

        if len(content.strip()) < 50:  # Skip near-empty files
            continue

        chunks = chunk_document(content, rel_path)
        all_chunks.extend(chunks)
        file_count += 1

    print(f"Found {file_count} files, {len(all_chunks)} chunks to embed.")

    # Check which chunks already exist (by content hash)
    existing_ids = set(collection.get()["ids"]) if collection.count() > 0 else set()
    new_chunks = []

    for chunk in all_chunks:
        chunk_id = f"{chunk['source']}::{content_hash(chunk['text'])}"
        if chunk_id not in existing_ids:
            new_chunks.append((chunk_id, chunk))

    print(f"Skipping {len(all_chunks) - len(new_chunks)} already-embedded chunks.")
    print(f"Embedding {len(new_chunks)} new chunks...")

    # Batch embedding
    batch_size = 50
    for i in range(0, len(new_chunks), batch_size):
        batch = new_chunks[i:i + batch_size]

        ids = [c[0] for c in batch]
        documents = [c[1]["text"] for c in batch]
        metadatas = [
            {"source": c[1]["source"], "heading": c[1]["heading"]}
            for c in batch
        ]

        # Get embeddings for the batch
        embeddings = [get_embedding(doc) for doc in documents]

        collection.add(
            ids=ids,
            documents=documents,
            metadatas=metadatas,
            embeddings=embeddings
        )

        done = min(i + batch_size, len(new_chunks))
        print(f"  Embedded {done}/{len(new_chunks)} chunks")

    total = collection.count()
    print(f"\nDone. Total chunks in database: {total}")

if __name__ == "__main__":
    embed_vault()

Install the dependencies:

pip install chromadb requests

Chunking Strategy

The chunking function above deserves some discussion. Embedding models have a limited context window (typically 512 to 8,192 tokens), and even within that window, shorter texts tend to produce better embeddings — the meaning is more concentrated, less diluted by tangential content.

The strategy is:

Split at heading boundaries first. A section under an H2 heading is a natural semantic unit.
Respect paragraph boundaries within sections. Do not split mid-paragraph if possible.
Overlap chunks slightly (200 characters by default). This prevents information at chunk boundaries from being lost — if a key sentence straddles two chunks, the overlap ensures it appears fully in at least one.
Keep chunks around 1,000 characters (~150-200 words). This is long enough to capture a complete thought but short enough to produce focused embeddings.

If you are using a Zettelkasten-style vault with atomic notes, your notes may already be the right size for embedding. In that case, you can skip chunking entirely and embed each note as a single unit. This is actually the ideal scenario — we will explore it further in the next chapter.

Querying Your Knowledge Base

With embeddings stored, you can now search by meaning:

#!/usr/bin/env python3
"""query_vault.py — Semantic search over your vault with local AI."""

import sys
import os
import requests
import chromadb
from chromadb.config import Settings
import textwrap

VAULT_PATH = os.environ.get("VAULT_PATH", os.path.expanduser("~/vault"))
CHROMA_PATH = os.environ.get("CHROMA_PATH", os.path.expanduser("~/vault/.chroma"))
OLLAMA_URL = "http://localhost:11434"
EMBED_MODEL = "nomic-embed-text"
CHAT_MODEL = "llama3.1:8b"

def get_embedding(text: str) -> list[float]:
    response = requests.post(
        f"{OLLAMA_URL}/api/embeddings",
        json={"model": EMBED_MODEL, "prompt": text}
    )
    response.raise_for_status()
    return response.json()["embedding"]

def search_similar(query: str, n_results: int = 5) -> list[dict]:
    """Find the most semantically similar chunks to the query."""
    client = chromadb.PersistentClient(
        path=CHROMA_PATH,
        settings=Settings(anonymized_telemetry=False)
    )
    collection = client.get_collection("vault_notes")

    query_embedding = get_embedding(query)

    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results,
        include=["documents", "metadatas", "distances"]
    )

    formatted = []
    for i in range(len(results["ids"][0])):
        formatted.append({
            "id": results["ids"][0][i],
            "text": results["documents"][0][i],
            "source": results["metadatas"][0][i]["source"],
            "heading": results["metadatas"][0][i].get("heading", ""),
            "distance": results["distances"][0][i]
        })

    return formatted

def ask_with_context(question: str, context_chunks: list[dict]) -> str:
    """Send the question and retrieved context to the local LLM."""
    context = "\n\n---\n\n".join([
        f"[Source: {c['source']}]\n{c['text']}"
        for c in context_chunks
    ])

    prompt = f"""You are a helpful assistant answering questions based on the
user's personal notes. Use ONLY the provided context to answer. If the
context does not contain enough information to answer fully, say so.
Be specific and reference which notes contain the relevant information.

Context from notes:
{context}

Question: {question}

Answer:"""

    response = requests.post(
        f"{OLLAMA_URL}/api/generate",
        json={
            "model": CHAT_MODEL,
            "prompt": prompt,
            "stream": False,
            "options": {
                "temperature": 0.3,
                "num_ctx": 4096
            }
        }
    )
    response.raise_for_status()
    return response.json()["response"]

def main():
    if len(sys.argv) < 2:
        print("Usage: query_vault.py <question>")
        print('Example: query_vault.py "What have I written about embedding models?"')
        sys.exit(1)

    question = " ".join(sys.argv[1:])

    print(f"\nSearching for: {question}\n")

    # Retrieve relevant chunks
    results = search_similar(question, n_results=5)

    print("Relevant notes found:")
    print("-" * 60)
    for r in results:
        similarity = 1 - r["distance"]  # Convert distance to similarity
        print(f"  [{similarity:.2f}] {r['source']}")
        if r["heading"]:
            print(f"         Section: {r['heading']}")

    print(f"\n{'='*60}")
    print("Generating answer...\n")

    answer = ask_with_context(question, results)
    print(textwrap.fill(answer, width=72))

if __name__ == "__main__":
    main()

What This Gets You

Run it:

python3 query_vault.py "What are the main differences between tacit and explicit knowledge?"

The system:

Embeds your question using the same model that embedded your notes.
Finds the 5 most semantically similar chunks in your vault.
Passes those chunks as context to the local LLM.
The LLM generates an answer grounded in your notes — not in its training data, but in what you have written and collected.

This is qualitatively different from keyword search. The query "main differences between tacit and explicit knowledge" will find notes about "knowledge that cannot be easily articulated" and "codified information in documents" even if they never use the words "tacit" or "explicit."

Alternative Vector Store: SQLite-vec

ChromaDB is convenient but adds a dependency. If you prefer to keep everything in SQLite (and there are good reasons to — simplicity, single-file storage, no separate process), you can use sqlite-vec, a SQLite extension for vector search:

#!/usr/bin/env python3
"""sqlite_vec_store.py — Vector storage using sqlite-vec."""

import sqlite3
import struct
import sqlite_vec

def create_vec_db(db_path: str, dimensions: int = 768) -> sqlite3.Connection:
    """Create a SQLite database with vector search capability."""
    conn = sqlite3.connect(db_path)
    conn.enable_load_extension(True)
    sqlite_vec.load(conn)
    conn.enable_load_extension(False)

    conn.executescript(f"""
        CREATE TABLE IF NOT EXISTS chunks (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            source TEXT NOT NULL,
            heading TEXT,
            content TEXT NOT NULL,
            content_hash TEXT UNIQUE
        );

        CREATE VIRTUAL TABLE IF NOT EXISTS chunks_vec USING vec0(
            embedding float[{dimensions}]
        );
    """)
    return conn

def serialize_vector(vec: list[float]) -> bytes:
    """Convert a list of floats to bytes for sqlite-vec."""
    return struct.pack(f'{len(vec)}f', *vec)

def insert_chunk(conn: sqlite3.Connection, source: str, heading: str,
                 content: str, content_hash: str,
                 embedding: list[float]) -> int:
    """Insert a chunk with its embedding."""
    cursor = conn.execute(
        """INSERT OR IGNORE INTO chunks (source, heading, content, content_hash)
           VALUES (?, ?, ?, ?)""",
        (source, heading, content, content_hash)
    )
    if cursor.rowcount == 0:
        return -1  # Already exists

    chunk_id = cursor.lastrowid
    conn.execute(
        "INSERT INTO chunks_vec (rowid, embedding) VALUES (?, ?)",
        (chunk_id, serialize_vector(embedding))
    )
    conn.commit()
    return chunk_id

def search_similar(conn: sqlite3.Connection, query_vec: list[float],
                   limit: int = 5) -> list[dict]:
    """Find the most similar chunks to the query vector."""
    results = conn.execute("""
        SELECT chunks.source, chunks.heading, chunks.content,
               chunks_vec.distance
        FROM chunks_vec
        JOIN chunks ON chunks.id = chunks_vec.rowid
        WHERE embedding MATCH ?
          AND k = ?
        ORDER BY distance
    """, (serialize_vector(query_vec), limit)).fetchall()

    return [
        {"source": r[0], "heading": r[1], "text": r[2], "distance": r[3]}
        for r in results
    ]

Install with:

pip install sqlite-vec

The sqlite-vec approach stores everything — your FTS5 full-text index and your vector embeddings — in a single SQLite database file. One file to back up, one file to copy, one file to version. There is an austere beauty to that.

Performance Considerations

GPU vs. CPU

Ollama automatically uses your GPU if one is available (NVIDIA CUDA, Apple Metal, AMD ROCm). The difference is significant:

Embedding with nomic-embed-text: ~100 chunks/second on CPU, ~500+ chunks/second on GPU. For a 5,000-note vault producing ~15,000 chunks, this is the difference between 2.5 minutes and 30 seconds.
LLM inference with llama3.1:8b: ~10 tokens/second on CPU (M1 MacBook), ~40 tokens/second on Apple Metal, ~80+ tokens/second on a decent NVIDIA GPU. CPU is usable but noticeably slow for long responses.

For embedding (which you do once and then incrementally), CPU is fine — patience suffices. For interactive LLM queries, GPU acceleration makes the experience dramatically better.

Quantization

Ollama models are already quantized (typically Q4_0 or Q4_K_M), reducing model size by 4x with minimal quality loss. You generally do not need to worry about quantization yourself — Ollama handles it.

If you want maximum quality and have the hardware, look for Q8_0 or F16 quantizations. If you are on constrained hardware, Q3_K_S or Q2_K trade more quality for smaller size.

Context Window

The context window determines how much text you can feed to the LLM in a single query. For RAG, this matters because you need room for both the retrieved context and the model's response:

# In Ollama, set the context window in the options
response = requests.post(
    f"{OLLAMA_URL}/api/generate",
    json={
        "model": "llama3.1:8b",
        "prompt": prompt,
        "options": {
            "num_ctx": 8192  # 8K context window
        }
    }
)

With a 4,096-token context window, you can comfortably fit 3-5 chunks of ~200 words each, plus the system prompt and question. With 8,192 tokens, you can include more context. The trade-off is speed — larger context windows require more computation.

Memory Management

Running both an embedding model and a language model simultaneously consumes RAM. Ollama keeps loaded models in memory and unloads them when not in use (after a configurable timeout). If you are tight on RAM:

# Set Ollama to keep models loaded for only 60 seconds
OLLAMA_KEEP_ALIVE=60s ollama serve

This ensures that after you finish querying, the models are unloaded and the memory is reclaimed.

Privacy Advantages

Everything described in this chapter runs on your machine. Your notes never leave your filesystem. Your queries never leave your network. No API keys, no usage logs, no terms of service that grant a company the right to train on your data.

This is not a theoretical advantage. Consider what a personal knowledge base might contain:

Journal entries and personal reflections.
Client work and business strategies.
Medical notes and health tracking.
Financial information and investment research.
Half-formed ideas you would never share publicly.

Sending this to a cloud API — even one with a strong privacy policy — involves trust. Running locally involves physics: data that never leaves your machine cannot be intercepted, subpoenaed, or leaked by a third party.

For many professionals, the privacy advantage alone justifies the modest effort of setting up a local system. For some — lawyers, therapists, journalists working with sources — it is a professional obligation.

The Complete Local Stack

To summarize, here is the complete local AI knowledge retrieval stack:

Component	Tool	Purpose
Note-taking	Obsidian	Write and organize notes
Full-text search	SQLite FTS5	Keyword search with ranking
Embedding model	nomic-embed-text via Ollama	Convert text to semantic vectors
Vector store	ChromaDB or sqlite-vec	Store and search vectors
Language model	llama3.1:8b via Ollama	Generate answers from context
CLI search	Python scripts	Command-line interface
Web UI	FastAPI	Browser-based interface

Total disk space: approximately 6-8 GB (mostly the language model). Total cost: free. Total data sent to third parties: zero bytes.

In the next chapter, we will take this infrastructure and apply it to something genuinely exciting: combining Luhmann's Zettelkasten method with vector search to build a system that discovers connections in your notes that you did not know existed.

Knowledge Management