LangChain Tutorial (Python): caching embeddings for intermediate developers

By Cyprian AaronsUpdated 2026-04-21
langchaincaching-embeddings-for-intermediate-developerspython

This tutorial shows you how to cache embeddings in a LangChain Python app so repeated documents do not get re-embedded on every run. You need this when your pipeline processes the same files often, because embedding calls are one of the most expensive and slowest parts of retrieval workflows.

What You'll Need

  • Python 3.10+
  • langchain
  • langchain-openai
  • langchain-community
  • faiss-cpu
  • An OpenAI API key set as OPENAI_API_KEY
  • A local folder where you can persist cached embeddings and vector indexes

Step-by-Step

  1. Install the dependencies and set up your environment.
    We are using LangChain’s built-in cache wrapper around an embedding model, plus a local FAISS index to make the example realistic.
pip install langchain langchain-openai langchain-community faiss-cpu
export OPENAI_API_KEY="your-api-key"
  1. Create a cached embedding model.
    CacheBackedEmbeddings stores embeddings in a byte store so repeated texts reuse prior results instead of calling the provider again.
from langchain_openai import OpenAIEmbeddings
from langchain.embeddings import CacheBackedEmbeddings
from langchain.storage import LocalFileStore

underlying_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
store = LocalFileStore("./embedding_cache")

cached_embeddings = CacheBackedEmbeddings.from_bytes_store(
    underlying_embeddings=underlying_embeddings,
    document_embedding_cache=store,
)
  1. Build a small document set and create a vector store from it.
    The first run will compute embeddings and write them to disk. Later runs with the same text will hit the cache.
from langchain_core.documents import Document
from langchain_community.vectorstores import FAISS

docs = [
    Document(page_content="Claims processing requires accurate document classification."),
    Document(page_content="Fraud detection often uses retrieval over historical cases."),
    Document(page_content="Policy wording should be searchable by semantic meaning."),
]

vectorstore = FAISS.from_documents(docs, cached_embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
  1. Query the retriever and inspect results.
    This proves the vector store is working, and it also gives you a repeatable path for testing whether cached embeddings are being reused across runs.
query = "How do I search policy language by meaning?"
results = retriever.invoke(query)

for i, doc in enumerate(results, start=1):
    print(f"{i}. {doc.page_content}")
  1. Reuse the same cache in another run or process.
    If you run the same script again with the same texts, LangChain will load embeddings from ./embedding_cache instead of recomputing them.
from langchain_openai import OpenAIEmbeddings
from langchain.embeddings import CacheBackedEmbeddings
from langchain.storage import LocalFileStore

underlying_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
store = LocalFileStore("./embedding_cache")

cached_embeddings = CacheBackedEmbeddings.from_bytes_store(
    underlying_embeddings=underlying_embeddings,
    document_embedding_cache=store,
)

texts = [
    "Claims processing requires accurate document classification.",
    "Fraud detection often uses retrieval over historical cases.",
]

vectors = cached_embeddings.embed_documents(texts)
print(len(vectors), "embeddings loaded or computed")

Testing It

Run the script once and watch it complete normally while creating files under ./embedding_cache. Then run it again with the same inputs; you should see faster execution because those embeddings are now coming from disk instead of the API.

If you want a more explicit test, add timing around embed_documents() on the first and second run and compare durations. For production work, this matters most when you re-index unchanged source documents during scheduled jobs or app restarts.

You can also change one document slightly and rerun it. Only the modified text should require a fresh embedding; everything else should still come from cache.

Next Steps

  • Add a persistent vector database like Chroma or pgvector if your retrieval layer needs multi-process access.
  • Learn how to combine embedding caching with document hashing so you can invalidate stale entries cleanly.
  • Wrap ingestion in a background job so indexing new content never blocks user-facing requests.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides