How to Fix 'vector search returning irrelevant results in production' in AutoGen (Python)
When vector search starts returning irrelevant results in production, the issue is usually not “the model got worse.” It means your retrieval pipeline is feeding the LLM the wrong chunks, the wrong embeddings, or both. In AutoGen Python setups, this usually shows up after you move from a local notebook to a real service with persistent data, larger corpora, and mixed document formats.
The failure mode is predictable: queries look semantically close, but the retrieved context is off-topic, stale, or too broad. In AutoGen terms, you’ll often see the agent answer with confidence while your retrieve_docs() output clearly contains unrelated chunks.
The Most Common Cause
The #1 cause is embedding mismatch: you indexed documents with one embedding model and queried with another, or you changed chunking/tokenization after ingestion without reindexing.
This happens a lot when people use autogen.retrieve_utils or a custom VectorDB wrapper and later swap models in production. The index still “works,” but similarity scores become garbage.
Broken vs fixed pattern
| Broken pattern | Fixed pattern |
|---|---|
| Index built with one embedding model, query uses another | Same embedding model for indexing and querying |
| Chunking changed after ingestion | Rebuild index after changing chunking |
| No metadata/versioning on vectors | Store embedding model + chunk schema version |
# BROKEN: index and query embeddings don't match
from autogen.retrieve_utils import TextSplitter
from autogen.agentchat.contrib.retrieve_assistant_agent import RetrieveAssistantAgent
# Indexed earlier with text-embedding-3-small
assistant = RetrieveAssistantAgent(
name="rag_assistant",
retrieve_config={
"task": "qa",
"vector_db": "chromadb",
"collection_name": "policies_v1",
"embedding_model": "text-embedding-3-large", # changed later
},
)
# Query now hits vectors created with a different model
answer = assistant.generate_reply(messages=[{"role": "user", "content": "What is the claims SLA?"}])
# FIXED: keep ingestion/query embedding config identical
from autogen.agentchat.contrib.retrieve_assistant_agent import RetrieveAssistantAgent
EMBEDDING_MODEL = "text-embedding-3-small"
COLLECTION_NAME = "policies_v2"
CHUNK_SIZE = 800
CHUNK_OVERLAP = 120
assistant = RetrieveAssistantAgent(
name="rag_assistant",
retrieve_config={
"task": "qa",
"vector_db": "chromadb",
"collection_name": COLLECTION_NAME,
"embedding_model": EMBEDDING_MODEL,
"chunk_token_size": CHUNK_SIZE,
"chunk_overlap_size": CHUNK_OVERLAP,
},
)
If you changed any of these after the initial ingest, delete and rebuild the collection. Don’t keep old vectors around and hope similarity search will recover.
Other Possible Causes
1) Bad chunking strategy
If chunks are too large, retrieval pulls in noisy context. If they’re too small, you lose semantic meaning and get random matches.
# Too large: one chunk per long document
retrieve_config = {
"chunk_token_size": 4000,
"chunk_overlap_size": 0,
}
# Better for policy/docs/FAQs
retrieve_config = {
"chunk_token_size": 600,
"chunk_overlap_size": 100,
}
For insurance or banking docs, I usually start around 500–900 tokens with overlap. Then I validate retrieval quality on a small labeled set before shipping.
2) Stale or mixed collections
A production collection that contains old embeddings from previous schemas will return junk even if your code is correct.
# BAD: reusing the same collection across schema changes
retrieve_config = {
"vector_db": "chromadb",
"collection_name": "customer_support_docs", # reused forever
}
Use versioned collections:
retrieve_config = {
"vector_db": "chromadb",
"collection_name": f"customer_support_docs_v3",
}
If you must keep one collection name, add metadata filters for schema version:
query_filter = {"schema_version": {"$eq": 3}}
3) Query rewriting is damaging intent
AutoGen agents can rewrite user questions before retrieval. If that rewrite becomes too verbose or generic, semantic search drifts.
# Example: overly aggressive query transformation upstream
query = f"Answer the user's request comprehensively: {user_message}"
Use the raw user intent for retrieval and let generation happen later:
query = user_message.strip()
retrieved_docs = vector_db.search(query)
In AutoGen workflows, keep retrieval queries short and literal. Save elaboration for the assistant response stage.
4) Wrong distance metric or index settings
If your backend uses cosine similarity but your vectors weren’t normalized correctly, ranking gets unstable.
# Example config issue in a custom FAISS setup
index = faiss.IndexFlatIP(dim) # inner product
# but embeddings are not normalized -> bad ranking
faiss.normalize_L2(embeddings)
index.add(embeddings)
For Chroma/Pinecone/Qdrant-style setups, verify:
- •metric matches embedding assumptions
- •normalization is consistent
- •top_k isn’t too high or too low
How to Debug It
- •
Print the top-k retrieved chunks
- •Don’t trust the agent response.
- •Inspect raw retrieval output:
results = vector_db.search("claims SLA", top_k=5) for r in results: print(r["score"], r["text"][:300], r.get("metadata")) - •
Compare embedding versions
- •Confirm ingest-time and query-time configs match.
- •Check model name, dimension, tokenizer/chunk settings.
- •If they differ, rebuild the index.
- •
Test with a known-good query
- •Use a question whose source document is obvious.
- •If retrieval still fails, the problem is indexing/configuration.
- •If only some queries fail, it’s likely chunking or query rewriting.
- •
Inspect AutoGen logs
- •Turn on verbose logging around retrieval.
- •Look for
RetrieveAssistantAgent,retrieve_config, and backend warnings. - •Common signals include empty results, low similarity scores, or repeated fallback answers like:
- •
No relevant documents found - •
Retrieved context is empty - •
The assistant could not find supporting information in the knowledge base
- •
Prevention
- •Version your embeddings like code:
- •store
embedding_model,chunk_token_size,overlap, and corpus version alongside every collection.
- •store
- •Rebuild indexes whenever you change:
- •embedding model
- •chunking strategy
- •preprocessing rules
- •Add retrieval regression tests:
- •maintain 10–20 gold queries with expected source docs and assert top-k hits before deploy.
If you’re using AutoGen in production, treat vector search as an indexed data pipeline, not a magical memory layer. Most “irrelevant results” bugs come from stale vectors or mismatched embeddings, and both are fixable once you inspect what’s actually being retrieved instead of what the agent says it retrieved.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit