LlamaIndex Tutorial (Python): caching embeddings for intermediate developers

By Cyprian AaronsUpdated 2026-04-21

llamaindexcaching-embeddings-for-intermediate-developerspython

This tutorial shows you how to cache LlamaIndex embeddings in Python so repeated runs stop paying the embedding API cost for the same text. You need this when you’re iterating on retrieval pipelines, rebuilding indexes often, or processing documents in batches where duplicate chunks show up across runs.

What You'll Need

•Python 3.10+
•llama-index
•
An embedding provider package, for example:
- •llama-index-embeddings-openai
- •openai
•An OpenAI API key set in OPENAI_API_KEY
•A local project directory where you can write cache files
•Basic familiarity with VectorStoreIndex, Document, and Settings

Step-by-Step

•Start by installing the packages you need. The cache is just a local file-backed store, so there’s no extra infrastructure to set up.

pip install llama-index llama-index-embeddings-openai openai

•Configure your embedding model and create a persistent cache wrapper around it. The important part is wrapping the real embed model with CacheEmbedding, which stores vectors keyed by the input text and model configuration.

import os
from llama_index.core import Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.embeddings import CacheEmbedding
from llama_index.core.storage.kvstore.simple_kvstore import SimpleKVStore

os.environ["OPENAI_API_KEY"] = os.environ["OPENAI_API_KEY"]

base_embedding = OpenAIEmbedding(model="text-embedding-3-small")
kvstore = SimpleKVStore()

cached_embedding = CacheEmbedding.from_defaults(
    base_embed_model=base_embedding,
    kvstore=kvstore,
    collection="embeddings_cache",
)

Settings.embed_model = cached_embedding

•Build an index from documents using that cached embed model. On the first run, LlamaIndex will call the embedding API; on later runs with the same text and cache store, it will reuse cached vectors.

from llama_index.core import Document, VectorStoreIndex

docs = [
    Document(text="LlamaIndex makes retrieval pipelines easier to compose."),
    Document(text="Caching embeddings reduces repeated API calls during development."),
]

index = VectorStoreIndex.from_documents(docs)

query_engine = index.as_query_engine()
response = query_engine.query("Why cache embeddings?")
print(response)

•Persist the cache to disk so it survives process restarts. Without persistence, your cache only lives in memory and disappears when the script exits.

import pickle
from pathlib import Path

cache_path = Path("embedding_cache.pkl")

with cache_path.open("wb") as f:
    pickle.dump(kvstore.data, f)

print(f"Saved cache entries: {len(kvstore.data)}")

•Load the cache on startup before building new indexes. This is what makes caching useful across separate script runs or CI jobs.

import pickle
from pathlib import Path
from llama_index.core.storage.kvstore.simple_kvstore import SimpleKVStore

cache_path = Path("embedding_cache.pkl")

kvstore2 = SimpleKVStore()
if cache_path.exists():
    with cache_path.open("rb") as f:
        kvstore2.data = pickle.load(f)

cached_embedding2 = CacheEmbedding.from_defaults(
    base_embed_model=OpenAIEmbedding(model="text-embedding-3-small"),
    kvstore=kvstore2,
    collection="embeddings_cache",
)

Settings.embed_model = cached_embedding2

•Rebuild the index with the loaded cache and confirm it still works. If your documents are unchanged, this second run should hit the local cache instead of recomputing embeddings for every chunk.

from llama_index.core import Document, VectorStoreIndex

docs = [
    Document(text="LlamaIndex makes retrieval pipelines easier to compose."),
    Document(text="Caching embeddings reduces repeated API calls during development."),
]

index = VectorStoreIndex.from_documents(docs)
print(index.as_query_engine().query("What does caching reduce?"))

Testing It

Run the script twice with logging enabled or by watching your OpenAI usage dashboard. On the first run, you should see normal embedding activity; on the second run, repeated texts should come from the local cache.

A practical check is to add more duplicate documents and confirm that only new or changed chunks trigger fresh embedding calls. If you want stronger proof, print the number of entries stored in kvstore.data before and after indexing.

If you’re using this in a service, restart the process between runs and verify that loading embedding_cache.pkl restores reuse behavior. That tells you your caching is actually persistent, not just memoized in memory.

Next Steps

•Add a real disk-backed KV store instead of pickling SimpleKVStore.data for better durability.
•Learn how chunking strategy affects cache hit rate when document text changes slightly.
•Combine embedding caching with vector store persistence so both indexing stages survive restarts.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit