How to Fix 'vector search returning irrelevant results in production' in AutoGen (Python)

By Cyprian AaronsUpdated 2026-04-22
vector-search-returning-irrelevant-results-in-productionautogenpython

When vector search starts returning irrelevant results in production, the issue is usually not “the model got worse.” It means your retrieval pipeline is feeding the LLM the wrong chunks, the wrong embeddings, or both. In AutoGen Python setups, this usually shows up after you move from a local notebook to a real service with persistent data, larger corpora, and mixed document formats.

The failure mode is predictable: queries look semantically close, but the retrieved context is off-topic, stale, or too broad. In AutoGen terms, you’ll often see the agent answer with confidence while your retrieve_docs() output clearly contains unrelated chunks.

The Most Common Cause

The #1 cause is embedding mismatch: you indexed documents with one embedding model and queried with another, or you changed chunking/tokenization after ingestion without reindexing.

This happens a lot when people use autogen.retrieve_utils or a custom VectorDB wrapper and later swap models in production. The index still “works,” but similarity scores become garbage.

Broken vs fixed pattern

Broken patternFixed pattern
Index built with one embedding model, query uses anotherSame embedding model for indexing and querying
Chunking changed after ingestionRebuild index after changing chunking
No metadata/versioning on vectorsStore embedding model + chunk schema version
# BROKEN: index and query embeddings don't match
from autogen.retrieve_utils import TextSplitter
from autogen.agentchat.contrib.retrieve_assistant_agent import RetrieveAssistantAgent

# Indexed earlier with text-embedding-3-small
assistant = RetrieveAssistantAgent(
    name="rag_assistant",
    retrieve_config={
        "task": "qa",
        "vector_db": "chromadb",
        "collection_name": "policies_v1",
        "embedding_model": "text-embedding-3-large",  # changed later
    },
)

# Query now hits vectors created with a different model
answer = assistant.generate_reply(messages=[{"role": "user", "content": "What is the claims SLA?"}])
# FIXED: keep ingestion/query embedding config identical
from autogen.agentchat.contrib.retrieve_assistant_agent import RetrieveAssistantAgent

EMBEDDING_MODEL = "text-embedding-3-small"
COLLECTION_NAME = "policies_v2"
CHUNK_SIZE = 800
CHUNK_OVERLAP = 120

assistant = RetrieveAssistantAgent(
    name="rag_assistant",
    retrieve_config={
        "task": "qa",
        "vector_db": "chromadb",
        "collection_name": COLLECTION_NAME,
        "embedding_model": EMBEDDING_MODEL,
        "chunk_token_size": CHUNK_SIZE,
        "chunk_overlap_size": CHUNK_OVERLAP,
    },
)

If you changed any of these after the initial ingest, delete and rebuild the collection. Don’t keep old vectors around and hope similarity search will recover.

Other Possible Causes

1) Bad chunking strategy

If chunks are too large, retrieval pulls in noisy context. If they’re too small, you lose semantic meaning and get random matches.

# Too large: one chunk per long document
retrieve_config = {
    "chunk_token_size": 4000,
    "chunk_overlap_size": 0,
}

# Better for policy/docs/FAQs
retrieve_config = {
    "chunk_token_size": 600,
    "chunk_overlap_size": 100,
}

For insurance or banking docs, I usually start around 500–900 tokens with overlap. Then I validate retrieval quality on a small labeled set before shipping.

2) Stale or mixed collections

A production collection that contains old embeddings from previous schemas will return junk even if your code is correct.

# BAD: reusing the same collection across schema changes
retrieve_config = {
    "vector_db": "chromadb",
    "collection_name": "customer_support_docs",  # reused forever
}

Use versioned collections:

retrieve_config = {
    "vector_db": "chromadb",
    "collection_name": f"customer_support_docs_v3",
}

If you must keep one collection name, add metadata filters for schema version:

query_filter = {"schema_version": {"$eq": 3}}

3) Query rewriting is damaging intent

AutoGen agents can rewrite user questions before retrieval. If that rewrite becomes too verbose or generic, semantic search drifts.

# Example: overly aggressive query transformation upstream
query = f"Answer the user's request comprehensively: {user_message}"

Use the raw user intent for retrieval and let generation happen later:

query = user_message.strip()
retrieved_docs = vector_db.search(query)

In AutoGen workflows, keep retrieval queries short and literal. Save elaboration for the assistant response stage.

4) Wrong distance metric or index settings

If your backend uses cosine similarity but your vectors weren’t normalized correctly, ranking gets unstable.

# Example config issue in a custom FAISS setup
index = faiss.IndexFlatIP(dim)   # inner product
# but embeddings are not normalized -> bad ranking

faiss.normalize_L2(embeddings)
index.add(embeddings)

For Chroma/Pinecone/Qdrant-style setups, verify:

  • metric matches embedding assumptions
  • normalization is consistent
  • top_k isn’t too high or too low

How to Debug It

  1. Print the top-k retrieved chunks

    • Don’t trust the agent response.
    • Inspect raw retrieval output:
    results = vector_db.search("claims SLA", top_k=5)
    for r in results:
        print(r["score"], r["text"][:300], r.get("metadata"))
    
  2. Compare embedding versions

    • Confirm ingest-time and query-time configs match.
    • Check model name, dimension, tokenizer/chunk settings.
    • If they differ, rebuild the index.
  3. Test with a known-good query

    • Use a question whose source document is obvious.
    • If retrieval still fails, the problem is indexing/configuration.
    • If only some queries fail, it’s likely chunking or query rewriting.
  4. Inspect AutoGen logs

    • Turn on verbose logging around retrieval.
    • Look for RetrieveAssistantAgent, retrieve_config, and backend warnings.
    • Common signals include empty results, low similarity scores, or repeated fallback answers like:
      • No relevant documents found
      • Retrieved context is empty
      • The assistant could not find supporting information in the knowledge base

Prevention

  • Version your embeddings like code:
    • store embedding_model, chunk_token_size, overlap, and corpus version alongside every collection.
  • Rebuild indexes whenever you change:
    • embedding model
    • chunking strategy
    • preprocessing rules
  • Add retrieval regression tests:
    • maintain 10–20 gold queries with expected source docs and assert top-k hits before deploy.

If you’re using AutoGen in production, treat vector search as an indexed data pipeline, not a magical memory layer. Most “irrelevant results” bugs come from stale vectors or mismatched embeddings, and both are fixable once you inspect what’s actually being retrieved instead of what the agent says it retrieved.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides