LangChain Tutorial (Python): caching embeddings for advanced developers

By Cyprian AaronsUpdated 2026-04-21
langchaincaching-embeddings-for-advanced-developerspython

This tutorial shows how to cache embeddings in a LangChain Python pipeline so repeated document chunks do not get re-embedded on every run. You need this when your ingestion job is expensive, your vector store rebuilds often, or you want deterministic local development without burning API calls.

What You'll Need

  • Python 3.10+
  • langchain
  • langchain-openai
  • langchain-community
  • faiss-cpu
  • An OpenAI API key set as OPENAI_API_KEY
  • A writable local directory for the embedding cache

Install the packages:

pip install langchain langchain-openai langchain-community faiss-cpu

Step-by-Step

  1. Start with a persistent embedding cache backed by SQLite.
    LangChain exposes a built-in cache wrapper for embeddings, and SQLite is the simplest durable option for production-like local testing.
import os
from langchain_openai import OpenAIEmbeddings
from langchain_community.cache import SQLAlchemyCache
from langchain.globals import set_llm_cache

os.environ["OPENAI_API_KEY"] = os.environ["OPENAI_API_KEY"]

# Persist cached embeddings in a local SQLite file
cache = SQLAlchemyCache(database_uri="sqlite:///embedding_cache.sqlite")
set_llm_cache(cache)

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
print("Cache initialized")
  1. Build a small document set and split it into chunks.
    Caching only helps if repeated text hits the same embedding request, so chunking needs to be stable across runs.
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter

docs = [
    Document(page_content="Policy A: Claims must be submitted within 30 days."),
    Document(page_content="Policy B: Claims must include supporting invoices."),
]

splitter = RecursiveCharacterTextSplitter(chunk_size=40, chunk_overlap=0)
chunks = splitter.split_documents(docs)

for i, chunk in enumerate(chunks):
    print(i, chunk.page_content)
  1. Wrap the embedding function with a local memoization layer for repeat calls inside the same process.
    The SQLite cache helps across runs, but an in-memory layer removes duplicate work during one ingestion job.
from functools import lru_cache

@lru_cache(maxsize=1024)
def embed_text(text: str):
    return tuple(embeddings.embed_query(text))

sample = "Policy A: Claims must be submitted within 30 days."
vec1 = embed_text(sample)
vec2 = embed_text(sample)

print(len(vec1), vec1 == vec2)
  1. Create a FAISS vector store from the chunks using the cached embeddings.
    This is where you get the actual savings: repeated chunks across re-indexing runs reuse cached vectors instead of calling the model again.
from langchain_community.vectorstores import FAISS

texts = [chunk.page_content for chunk in chunks]

vectorstore = FAISS.from_texts(
    texts=texts,
    embedding=embeddings,
)

print("Vector store created with", len(texts), "chunks")
  1. Rebuild the same index again and confirm cached behavior at the application level.
    The second run should be faster because identical text inputs are already stored in SQLite and identical Python-process calls are memoized.
import time

start = time.time()
_ = FAISS.from_texts(texts=texts, embedding=embeddings)
first_run = time.time() - start

start = time.time()
_ = FAISS.from_texts(texts=texts, embedding=embeddings)
second_run = time.time() - start

print(f"First run: {first_run:.4f}s")
print(f"Second run: {second_run:.4f}s")

Testing It

Run the script twice without deleting embedding_cache.sqlite. On the second execution, you should see less API activity and typically lower latency when rebuilding the vector store.

If you want stronger proof, add logging around your OpenAI requests or inspect your provider dashboard for reduced embedding calls. For a more direct check, change one chunk by a single character and confirm only that new text triggers fresh embedding work.

The important test is consistency: same input text should produce the same cached vector lookup every time. If your chunking changes between runs, your cache hit rate will drop fast.

Next Steps

  • Add Redis-backed caching if you need shared cache access across multiple workers.
  • Persist and version your chunking strategy so cache keys stay stable across deployments.
  • Move from FAISS to your production vector database and keep the same cached embedding layer in front of it.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides