AutoGen Tutorial (Python): caching embeddings for beginners

By Cyprian AaronsUpdated 2026-04-21
autogencaching-embeddings-for-beginnerspython

This tutorial shows how to cache embeddings in a Python AutoGen workflow so repeated document lookups stop paying the embedding cost every time. You need this when your agent keeps re-processing the same files, prompts, or chunks and you want lower latency, fewer API calls, and more predictable costs.

What You'll Need

  • Python 3.10+
  • autogen-agentchat
  • autogen-ext
  • openai
  • diskcache
  • An OpenAI API key set as OPENAI_API_KEY
  • A small text corpus to embed, such as product docs or policy snippets

Install the packages:

pip install autogen-agentchat autogen-ext openai diskcache

Step-by-Step

  1. Start by creating a local embedding cache.
    We’ll store vectors on disk keyed by the exact text input, so the same chunk never gets embedded twice unless it changes.
import hashlib
from diskcache import Cache

cache = Cache("./embedding_cache")

def cache_key(text: str) -> str:
    return hashlib.sha256(text.encode("utf-8")).hexdigest()

def get_cached_embedding(text: str):
    return cache.get(cache_key(text))

def set_cached_embedding(text: str, embedding: list[float]):
    cache.set(cache_key(text), embedding)
  1. Create a small embedding client using AutoGen’s OpenAI extension.
    This keeps the code aligned with AutoGen’s model abstractions instead of calling raw HTTP yourself.
import os
import asyncio
from autogen_ext.models.openai import OpenAIChatCompletionClient
from openai import AsyncOpenAI

client = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"])

async def embed_text(text: str) -> list[float]:
    response = await client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return response.data[0].embedding
  1. Wrap the embed call with a cache lookup.
    The first request computes and stores the vector; later requests hit disk and return immediately.
async def get_embedding(text: str) -> list[float]:
    cached = get_cached_embedding(text)
    if cached is not None:
        return cached

    embedding = await embed_text(text)
    set_cached_embedding(text, embedding)
    return embedding
  1. Use cached embeddings inside an AutoGen assistant workflow.
    Here we keep it simple: generate embeddings for chunks before sending them into your retrieval or ranking logic.
from autogen_agentchat.agents import AssistantAgent

assistant = AssistantAgent(
    name="doc_helper",
    model_client=OpenAIChatCompletionClient(
        model="gpt-4o-mini",
        api_key=os.environ["OPENAI_API_KEY"],
    ),
)

documents = [
    "Claims must be filed within 30 days.",
    "Claims must be filed within 30 days.",
    "Policy renewal happens annually.",
]

async def main():
    vectors = []
    for doc in documents:
        vectors.append(await get_embedding(doc))

    print(f"Embedded {len(vectors)} documents")
    print(f"First vector length: {len(vectors[0])}")

if __name__ == "__main__":
    asyncio.run(main())
  1. Add a simple invalidation rule when content changes.
    In production, you should cache by normalized text plus a version string so schema changes or prompt rewrites don’t reuse stale vectors.
CACHE_VERSION = "v1"

def cache_key_v2(text: str) -> str:
    payload = f"{CACHE_VERSION}:{text.strip()}"
    return hashlib.sha256(payload.encode("utf-8")).hexdigest()

def get_cached_embedding_v2(text: str):
    return cache.get(cache_key_v2(text))

def set_cached_embedding_v2(text: str, embedding: list[float]):
    cache.set(cache_key_v2(text), embedding)

Testing It

Run the script twice with the same input texts. On the first run, you should see normal latency from the embedding API; on the second run, most calls should come straight from diskcache.

A quick sanity check is to log whether each text was a cache hit or miss before returning the vector. If you want stronger verification, compare timestamps or count outbound embedding requests with a proxy like mitmproxy.

Also test one modified string, like changing a period or adding whitespace after .strip(). If your keying is correct, that change should either miss the cache intentionally or hit it only when your normalization rules say it should.

Next Steps

  • Add batch embedding support so multiple uncached texts are sent in one API call.
  • Store metadata alongside vectors, such as source document ID, chunk index, and embedding model version.
  • Plug this into a retrieval pipeline with cosine similarity and top-k chunk selection before handing context to an AutoGen agent.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides