LlamaIndex Tutorial (Python): caching embeddings for beginners

By Cyprian AaronsUpdated 2026-04-21
llamaindexcaching-embeddings-for-beginnerspython

This tutorial shows you how to cache embedding calls in a LlamaIndex Python app so repeated document indexing does not hit your embedding provider every time. You need this when you are re-running ingestion during development, rebuilding indexes from the same documents, or trying to cut embedding costs in production.

What You'll Need

  • Python 3.10+
  • llama-index
  • llama-index-embeddings-openai
  • openai API key
  • A local folder with a few text files to index
  • Basic familiarity with VectorStoreIndex and SimpleDirectoryReader

Install the packages:

pip install llama-index llama-index-embeddings-openai openai

Set your OpenAI key:

export OPENAI_API_KEY="your-key-here"

Step-by-Step

  1. Start with a normal LlamaIndex ingestion script.

This first version loads documents, creates embeddings, and builds an index. It works, but every run recomputes embeddings for the same chunks.

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.embeddings.openai import OpenAIEmbedding

documents = SimpleDirectoryReader("./data").load_data()

embed_model = OpenAIEmbedding(model="text-embedding-3-small")

index = VectorStoreIndex.from_documents(
    documents,
    embed_model=embed_model,
)
print("Index built")
  1. Add a persistent cache for embedding results.

LlamaIndex exposes a cache interface through the core settings and storage layer. For beginners, the simplest practical pattern is to persist the full index and reuse it across runs so unchanged embeddings are not recomputed.

import os
from llama_index.core import SimpleDirectoryReader, StorageContext, load_index_from_storage, VectorStoreIndex
from llama_index.embeddings.openai import OpenAIEmbedding

persist_dir = "./storage"
documents = SimpleDirectoryReader("./data").load_data()
embed_model = OpenAIEmbedding(model="text-embedding-3-small")

if os.path.exists(persist_dir):
    storage_context = StorageContext.from_defaults(persist_dir=persist_dir)
    index = load_index_from_storage(storage_context, embed_model=embed_model)
else:
    index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)
    index.storage_context.persist(persist_dir=persist_dir)

print("Ready")
  1. Make the cache behavior explicit with a reusable ingestion function.

In real projects, you want one function that checks whether storage exists and only embeds new content when needed. This keeps your ingestion code clean and makes it obvious what gets reused.

import os
from llama_index.core import SimpleDirectoryReader, StorageContext, load_index_from_storage, VectorStoreIndex
from llama_index.embeddings.openai import OpenAIEmbedding

def build_or_load_index(data_dir: str = "./data", persist_dir: str = "./storage"):
    documents = SimpleDirectoryReader(data_dir).load_data()
    embed_model = OpenAIEmbedding(model="text-embedding-3-small")

    if os.path.exists(persist_dir):
        storage_context = StorageContext.from_defaults(persist_dir=persist_dir)
        return load_index_from_storage(storage_context, embed_model=embed_model)

    index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)
    index.storage_context.persist(persist_dir=persist_dir)
    return index

index = build_or_load_index()
print("Index ready")
  1. Query the loaded index to confirm it behaves like a cached build.

If the storage directory is present, loading should be much faster than re-indexing from scratch. The query path stays the same either way.

query_engine = index.as_query_engine()
response = query_engine.query("What is in these documents?")
print(response)
  1. Add a simple change test so you can see caching in action.

Edit one file in ./data, rerun the script, and compare runtime. If only the persisted index is loaded, unchanged content should not trigger a full rebuild.

import time

start = time.time()
index = build_or_load_index()
elapsed = time.time() - start

print(f"Load/build took {elapsed:.2f} seconds")
print(index.as_query_engine().query("Summarize the documents"))

Testing It

Run the script once with an empty ./storage directory. That first run should take longer because embeddings are generated and written to disk.

Run it again without changing any files. The second run should be noticeably faster because LlamaIndex loads the stored index instead of rebuilding it.

Then modify one document in ./data and rerun. If your app still reloads from storage without ingesting changes, that tells you this basic persistence pattern is working but you need a more advanced incremental ingestion setup for true per-document cache invalidation.

Next Steps

  • Learn IngestionPipeline so you can do incremental updates instead of full rebuilds.
  • Add a vector store backend like Chroma or Pinecone when your dataset outgrows local storage.
  • Look into embedding model batching and chunking strategy to reduce total API calls further.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides