Haystack Tutorial (Python): caching embeddings for intermediate developers
This tutorial shows how to cache document embeddings in Haystack so you stop recomputing vectors every time your pipeline runs. You need this when your corpus is stable, your embedding model is expensive, and repeated indexing is burning time and API calls.
What You'll Need
- •Python 3.10+
- •
haystack-ai - •
sentence-transformers - •
numpy - •A local machine with enough RAM to hold your document set
- •Optional: an OpenAI API key if you want to swap in a hosted embedder later
Install the packages:
pip install haystack-ai sentence-transformers numpy
Step-by-Step
- •Start with a small document store and a deterministic embedding model. The key idea is simple: compute embeddings once, persist them, and reuse them until the source text changes.
from haystack import Document
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.document_stores.in_memory import InMemoryDocumentStore
documents = [
Document(content="Haystack is a framework for building LLM applications."),
Document(content="Caching embeddings avoids repeated computation during indexing."),
Document(content="Intermediate pipelines benefit from stable document IDs."),
]
document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
embedder = SentenceTransformersDocumentEmbedder(
model="sentence-transformers/all-MiniLM-L6-v2"
)
embedder.warm_up()
- •Embed documents once and write them into the store. If you rerun this step with the same content, you should treat it as a cache hit path rather than recomputing everything blindly.
embedded_documents = embedder.run(documents)["documents"]
document_store.write_documents(embedded_documents)
print(f"Stored {len(embedded_documents)} embedded documents")
for doc in embedded_documents:
print(doc.id, len(doc.embedding))
- •Add a lightweight cache layer keyed by document content hash. This gives you control over when embeddings are reused, which matters when your source files are reprocessed often but only a few lines change.
import hashlib
embedding_cache = {}
def content_hash(text: str) -> str:
return hashlib.sha256(text.encode("utf-8")).hexdigest()
def get_or_embed(document: Document):
key = content_hash(document.content)
if key in embedding_cache:
return embedding_cache[key], True
embedded = embedder.run([document])["documents"][0]
embedding_cache[key] = embedded.embedding
return embedded.embedding, False
for doc in documents:
embedding, cached = get_or_embed(doc)
print(doc.content[:30], "cached=" + str(cached), "dims=" + str(len(embedding)))
- •Reuse cached embeddings during incremental indexing. In production, this is where you avoid recomputing unchanged records and only embed new or modified ones.
updated_docs = [
Document(content="Haystack is a framework for building LLM applications."),
Document(content="Caching embeddings avoids repeated computation during indexing."),
Document(content="A new document was added to the corpus."),
]
to_write = []
for doc in updated_docs:
key = content_hash(doc.content)
if key in embedding_cache:
doc.embedding = embedding_cache[key]
else:
doc = embedder.run([doc])["documents"][0]
embedding_cache[key] = doc.embedding
to_write.append(doc)
document_store.write_documents(to_write)
print("Incremental indexing complete")
- •Query the store to confirm the cached embeddings still support retrieval correctly. If the cache is wired properly, search results should behave the same as freshly embedded documents.
from haystack.components.embedders import SentenceTransformersTextEmbedder
query_embedder = SentenceTransformersTextEmbedder(
model="sentence-transformers/all-MiniLM-L6-v2"
)
query_embedder.warm_up()
query_result = query_embedder.run(text="How do I avoid recomputing embeddings?")["embedding"]
retrieved_docs = document_store.search(query_embedding=query_result, top_k=2)
for doc in retrieved_docs:
print(doc.content)
Testing It
Run the script twice without changing any document text. On the second run, your cache path should report more hits and fewer calls to embedder.run(). If you change one document string and rerun, only that document should miss the cache and get a new embedding.
Also check that retrieval still returns sensible matches after cached writes. A good sanity test is querying for “embedding caching” and confirming that the relevant document ranks near the top.
If you want stricter verification, log the number of cache hits versus misses and compare runtime between first run and second run. On any non-trivial corpus, the second run should be noticeably faster.
Next Steps
- •Move the in-memory cache into Redis or SQLite so it survives process restarts.
- •Store both
content_hashandmodel_namein your cache key so model upgrades invalidate old vectors cleanly. - •Wrap this pattern inside a Haystack pipeline with separate indexing and query paths.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit