AutoGen Tutorial (Python): caching embeddings for advanced developers

By Cyprian AaronsUpdated 2026-04-21

autogencaching-embeddings-for-advanced-developerspython

This tutorial shows you how to cache embeddings in a Python AutoGen workflow so repeated retrieval calls stop burning tokens and latency. You need this when your agent keeps re-embedding the same documents, chat history, or knowledge base across runs and you want deterministic performance.

What You'll Need

•Python 3.10+
•pyautogen
•chromadb
•openai
•An OpenAI API key in OPENAI_API_KEY
•A local project folder with write access for the embedding cache
•Basic familiarity with AutoGen agents and retrieval

Step-by-Step

•Start by installing the packages and setting up your environment. We’ll use Chroma as the persistent vector store because it gives you durable caching without writing custom serialization code.

pip install pyautogen chromadb openai
export OPENAI_API_KEY="your-api-key"

•Define a persistent embedding client and a cache-backed collection. The important part is PersistentClient, which keeps embeddings on disk across process restarts instead of recomputing them every time.

import os
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

persist_dir = "./chroma_cache"
client = chromadb.PersistentClient(path=persist_dir)

embedding_fn = OpenAIEmbeddingFunction(
    api_key=os.environ["OPENAI_API_KEY"],
    model_name="text-embedding-3-small",
)

collection = client.get_or_create_collection(
    name="policy_docs",
    embedding_function=embedding_fn,
)

•Add documents once, then query them many times. Chroma stores the embeddings alongside the text, so repeated queries reuse the persisted vectors instead of rebuilding them.

docs = [
    "Claims must be filed within 30 days of the incident.",
    "Policyholders can request a coverage review once per quarter.",
    "Fraud investigations require manager approval before escalation.",
]

ids = ["doc_1", "doc_2", "doc_3"]

existing = collection.count()
if existing == 0:
    collection.add(
        ids=ids,
        documents=docs,
        metadatas=[{"source": "handbook"}] * len(docs),
    )

results = collection.query(
    query_texts=["How long do I have to file a claim?"],
    n_results=2,
)
print(results["documents"][0])

•Wire the cached retrieval into an AutoGen assistant. Here we expose a simple function that searches the cached collection and let the agent call it when needed.

from autogen import AssistantAgent, UserProxyAgent

def retrieve_policy(query: str) -> str:
    result = collection.query(query_texts=[query], n_results=2)
    chunks = result["documents"][0]
    return "\n".join(chunks)

assistant = AssistantAgent(
    name="assistant",
    llm_config={
        "model": "gpt-4o-mini",
        "api_key": os.environ["OPENAI_API_KEY"],
    },
)

user_proxy = UserProxyAgent(
    name="user_proxy",
    human_input_mode="NEVER",
)

print(retrieve_policy("When do claims need to be filed?"))

•Make the cache explicit in your workflow so you can reuse it across jobs, workers, or notebook sessions. In production, this is what stops every new run from rebuilding the same embedding index from scratch.

def warm_cache() -> None:
    seed_docs = [
        "Coverage disputes go through legal review.",
        "Appeals must include supporting documentation.",
        "Escalations are tracked in the case management system.",
    ]
    seed_ids = ["seed_1", "seed_2", "seed_3"]

    if collection.count() < 6:
        collection.add(ids=seed_ids, documents=seed_docs)

warm_cache()

for question in [
    "What happens during an appeal?",
    "Where are escalations tracked?",
]:
    print(f"\nQ: {question}")
    print(retrieve_policy(question))

Testing It

Run the script twice. On the first run, Chroma creates the persistent store and writes embeddings to disk; on the second run, it should reuse that data without needing to rebuild your corpus.

Check that ./chroma_cache exists after execution and that collection.count() stays stable between runs. If you want a stronger signal, add logging around collection.add() and confirm it only executes on an empty store.

You should also verify retrieval quality by asking semantically similar questions with different wording. If "How long do I have to file a claim?" returns the document about filing windows, your cache-backed embedding path is working.

Next Steps

•Add metadata filters for tenant ID, product line, or jurisdiction before querying.
•Swap Chroma for a managed vector database if you need multi-node persistence.
•Wrap retrieval in an AutoGen tool/function call so multiple agents can share the same cached knowledge base.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit