LlamaIndex Tutorial (Python): caching embeddings for advanced developers

By Cyprian AaronsUpdated 2026-04-21

llamaindexcaching-embeddings-for-advanced-developerspython

This tutorial shows you how to cache LlamaIndex embeddings in Python so repeated index builds stop hammering your embedding provider. You need this when your documents change slowly, your dev loop is expensive, or you want predictable latency and lower API spend.

What You'll Need

•Python 3.10+
•llama-index
•
An embedding provider package:
- •llama-index-embeddings-openai for OpenAI
•
An API key set in your environment:
- •OPENAI_API_KEY
•A local project directory where you can write cache files
•Basic familiarity with VectorStoreIndex, Document, and Settings

Step-by-Step

•First, install the packages and set up a clean environment. The cache pattern below uses a file-backed store, so you can reuse embeddings across process restarts.

pip install llama-index llama-index-embeddings-openai
export OPENAI_API_KEY="your-key-here"

•Create a persistent cache for embeddings using SimpleCache. The key idea is to wrap the actual embedding model with a cache layer, then point LlamaIndex at that wrapped object through Settings.embed_model.

from pathlib import Path

from llama_index.core import Settings
from llama_index.core.embeddings import resolve_embed_model
from llama_index.core.storage.kvstore.simple_kvstore import SimpleKVStore
from llama_index.core.storage.kvstore.types import BaseKVStore

cache_dir = Path("./embedding_cache")
cache_dir.mkdir(exist_ok=True)

kvstore: BaseKVStore = SimpleKVStore.from_persist_path(
    str(cache_dir / "embeddings.json")
)

embed_model = resolve_embed_model("default")
Settings.embed_model = embed_model

•Build a small helper that hashes text and checks the cache before calling the model. This keeps the implementation explicit and makes it easy to swap in Redis or another backend later.

import hashlib
import json

def embed_text_with_cache(text: str) -> list[float]:
    cache_key = hashlib.sha256(text.encode("utf-8")).hexdigest()
    cached = kvstore.get(cache_key)
    if cached is not None:
        return json.loads(cached)

    vector = Settings.embed_model.get_text_embedding(text)
    kvstore.put(cache_key, json.dumps(vector))
    kvstore.persist(str(cache_dir / "embeddings.json"))
    return vector

•Use the cached embedding function while building your index. For advanced workflows, this is useful when you want deterministic reuse across re-indexing jobs instead of recomputing every chunk.

from llama_index.core import Document, VectorStoreIndex

docs = [
    Document(text="LlamaIndex can build retrieval indexes from documents."),
    Document(text="Embedding caching reduces repeated API calls."),
]

for doc in docs:
    _ = embed_text_with_cache(doc.text)

index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()
response = query_engine.query("What does embedding caching reduce?")
print(response)

•If you want stronger control, cache chunk-level embeddings instead of whole documents. That matters when documents are edited frequently but most chunks stay unchanged.

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=128, chunk_overlap=20)
nodes = splitter.get_nodes_from_documents(docs)

for node in nodes:
    _ = embed_text_with_cache(node.text)

print(f"Cached {len(nodes)} chunk embeddings.")

Testing It

Run the script twice with the same input documents. On the first run, each unique text should populate the JSON cache file; on the second run, those texts should hit the local store instead of calling the embedding model again.

If you want to verify it more aggressively, add a print statement inside embed_text_with_cache() for "cache hit" and "cache miss". You should see misses on the first pass and hits on subsequent runs.

Check that ./embedding_cache/embeddings.json exists after execution and contains hashed keys with serialized vectors. If that file is empty or missing, your persist step is not firing.

Next Steps

•Move from SimpleKVStore to Redis if you need shared caching across workers.
•Add document versioning to your cache key so stale embeddings are invalidated when content changes.
•Cache retrieval results too, especially if your query patterns are repetitive in production.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit