Haystack Tutorial (Python): caching embeddings for beginners

By Cyprian AaronsUpdated 2026-04-21
haystackcaching-embeddings-for-beginnerspython

This tutorial shows you how to cache document embeddings in Haystack so you do not recompute them every time your pipeline runs. That matters when you are ingesting the same documents repeatedly, because embedding calls are slow and expensive compared to loading vectors from disk.

What You'll Need

  • Python 3.10+
  • haystack-ai
  • sentence-transformers
  • numpy
  • A local machine with enough disk space to store cached vectors
  • Basic familiarity with Haystack Document objects and pipelines

Install the packages:

pip install haystack-ai sentence-transformers numpy

Step-by-Step

  1. Start by creating a small set of documents and an embedder. For beginners, the easiest cache is just a local .npz file that stores the text and embedding vector together.
from pathlib import Path
import hashlib
import json
import numpy as np

from haystack import Document
from haystack.components.embedders import SentenceTransformersDocumentEmbedder

docs = [
    Document(content="Haystack is a framework for building LLM applications."),
    Document(content="Embedding caching avoids recomputing vectors for unchanged documents."),
]

embedder = SentenceTransformersDocumentEmbedder(
    model="sentence-transformers/all-MiniLM-L6-v2"
)
embedder.warm_up()
  1. Next, define a stable cache key for each document. In production, use content plus metadata that affects embedding output, because changing either should invalidate the cache.
def doc_cache_key(doc: Document) -> str:
    payload = {
        "content": doc.content,
        "meta": doc.meta or {},
    }
    raw = json.dumps(payload, sort_keys=True).encode("utf-8")
    return hashlib.sha256(raw).hexdigest()

cache_dir = Path("./embedding_cache")
cache_dir.mkdir(exist_ok=True)
  1. Now write two helpers: one to save embeddings after generation, and one to load them if they already exist. This keeps the caching logic outside your Haystack pipeline, which is easier to reason about for beginners.
def cache_path_for(doc: Document) -> Path:
    return cache_dir / f"{doc_cache_key(doc)}.npz"

def save_embedding(doc: Document) -> None:
    path = cache_path_for(doc)
    np.savez_compressed(
        path,
        content=doc.content,
        meta=json.dumps(doc.meta or {}),
        embedding=np.array(doc.embedding, dtype=np.float32),
    )

def load_embedding(doc: Document):
    path = cache_path_for(doc)
    if not path.exists():
        return None

    data = np.load(path, allow_pickle=False)
    cached_doc = Document(
        content=str(data["content"]),
        meta=json.loads(str(data["meta"]))
    )
    cached_doc.embedding = data["embedding"].tolist()
    return cached_doc
  1. Use the cache before calling the embedder. If the embedding exists on disk, reuse it; otherwise compute it once and persist it for later runs.
embedded_docs = []

for doc in docs:
    cached = load_embedding(doc)
    if cached is not None:
        embedded_docs.append(cached)
        continue

    result = embedder.run([doc])
    embedded_doc = result["documents"][0]
    save_embedding(embedded_doc)
    embedded_docs.append(embedded_doc)

for doc in embedded_docs:
    print(doc.content)
    print(len(doc.embedding), "dimensions")
  1. Finally, verify that the second run does not recompute embeddings. You should see the same output, but the code should finish faster because it loads vectors from disk instead of calling the model again.
import time

start = time.time()

for doc in docs:
    cached = load_embedding(doc)
    if cached is None:
        result = embedder.run([doc])
        save_embedding(result["documents"][0])

elapsed = time.time() - start
print(f"Run completed in {elapsed:.3f} seconds")

Testing It

Run the script twice in a row. The first run should create .npz files under ./embedding_cache, and the second run should hit the cache for both documents.

To confirm it is working correctly, delete one cached file and run again. Only that document should be re-embedded while the others are loaded from disk.

If you want a stronger check, log whether each document came from cache or from the embedder. In a real ingestion job, that metric is useful for spotting accidental reprocessing after content changes.

Next Steps

  • Move this cache into Redis or PostgreSQL if you need shared storage across workers.
  • Add invalidation rules based on model name and embedding dimension.
  • Wire this pattern into a Haystack indexing pipeline so embeddings are reused during batch ingestion too

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides