AutoGen Tutorial (Python): caching embeddings for intermediate developers

By Cyprian AaronsUpdated 2026-04-21

autogencaching-embeddings-for-intermediate-developerspython

This tutorial shows you how to add a persistent embedding cache to an AutoGen Python workflow so repeated document lookups stop paying the embedding cost every run. You need this when your agent keeps re-processing the same chunks, because embeddings are one of the easiest places to waste time and API spend.

What You'll Need

•Python 3.10+
•autogen-agentchat
•autogen-ext
•chromadb
•openai
•An OpenAI API key in OPENAI_API_KEY
•A local project folder with write access for the cache directory

Install the packages:

pip install autogen-agentchat autogen-ext chromadb openai

Step-by-Step

•First, create a small embedding client and a persistent Chroma collection. The key detail is PersistentClient, which stores vectors on disk instead of rebuilding them every time your script runs.

import os
import chromadb
from autogen_ext.models.openai import OpenAIChatCompletionClient

client = chromadb.PersistentClient(path="./chroma_cache")
collection = client.get_or_create_collection(name="policy_chunks")

llm_client = OpenAIChatCompletionClient(
    model="gpt-4o-mini",
    api_key=os.environ["OPENAI_API_KEY"],
)

•Next, define a helper that checks whether a chunk is already cached before creating a new embedding record. This avoids duplicate inserts and gives you deterministic reuse across runs.

from openai import OpenAI

embedder = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def get_embedding(text: str) -> list[float]:
    response = embedder.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return response.data[0].embedding

def cache_chunk(chunk_id: str, text: str) -> None:
    existing = collection.get(ids=[chunk_id])
    if existing["ids"]:
        return

    vector = get_embedding(text)
    collection.add(
        ids=[chunk_id],
        documents=[text],
        embeddings=[vector],
        metadatas=[{"source": "policy"}],
    )

•Then, add a retrieval function that uses the cached vectors instead of recomputing them. In production, this is where you turn raw text into reusable context for an agent.

def search_chunks(query: str, k: int = 3):
    query_vector = get_embedding(query)
    results = collection.query(
        query_embeddings=[query_vector],
        n_results=k,
        include=["documents", "metadatas", "distances"],
    )
    return results

sample_chunks = {
    "chunk_001": "Claims must be filed within 30 days of the incident.",
    "chunk_002": "High-value claims require manager approval.",
    "chunk_003": "Fraud indicators should be escalated immediately.",
}

for chunk_id, text in sample_chunks.items():
    cache_chunk(chunk_id, text)

•Now wire the retrieved context into an AutoGen agent. The agent itself does not manage embedding storage; your app layer does, and that separation keeps things predictable.

from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.messages import TextMessage

agent = AssistantAgent(
    name="policy_assistant",
    model_client=llm_client,
    system_message="Answer using only the provided policy context.",
)

query = "What happens if a claim is large?"
hits = search_chunks(query)

context_lines = []
for doc in hits["documents"][0]:
    context_lines.append(f"- {doc}")

prompt = f"Policy context:\n" + "\n".join(context_lines) + f"\n\nQuestion: {query}"

message = TextMessage(content=prompt, source="user")

•Finally, run the agent against the cached context and inspect the result. If you rerun the script, Chroma will reuse the same stored vectors instead of starting from zero.

import asyncio

async def main():
    result = await agent.on_messages([message], cancellation_token=None)
    print(result.chat_message.content)

if __name__ == "__main__":
    asyncio.run(main())

Testing It

Run the script twice with the same inputs. On the first run, Chroma creates the persistent store and writes embeddings; on the second run, it should skip re-adding existing chunk IDs and reuse what is already on disk.

Check that ./chroma_cache exists after execution and contains persisted data files. If you want to verify behavior more directly, add a print inside cache_chunk() when it inserts versus when it returns early.

Also test with a new chunk ID and confirm only that chunk gets embedded and added. That tells you your cache keying strategy is working, which matters more than raw vector search speed.

Next Steps

•Add TTL or versioning to your cache keys so updated documents invalidate old embeddings cleanly.
•Move from exact chunk IDs to content hashes so duplicate text across sources collapses into one cached vector.
•Add hybrid retrieval with metadata filters for department, product line, or document version.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit