How to Fix 'vector search returning irrelevant results in production' in CrewAI (Python)

By Cyprian AaronsUpdated 2026-04-22
vector-search-returning-irrelevant-results-in-productioncrewaipython

What this error usually means

When CrewAI’s vector search starts returning irrelevant results in production, the retrieval layer is doing exactly what you asked — just not what you meant. In practice, this shows up when your embeddings, chunking, metadata filters, or query text are misaligned with the actual knowledge base.

You’ll usually see it after moving from local tests to real traffic: answers drift, the agent cites the wrong policy, or knowledge.search() returns content that looks semantically close but operationally useless.

The Most Common Cause

The #1 cause is bad chunking plus weak metadata. In CrewAI, if you dump large documents into a KnowledgeSource without preserving section boundaries and source metadata, your embeddings become noisy and retrieval gets fuzzy fast.

A common broken pattern is embedding huge blobs and querying with vague prompts:

Broken patternFixed pattern
One giant document chunkSmaller chunks with structure
No metadata filtersSource/type/version metadata
Vague user queryQuery rewritten for retrieval
# BROKEN
from crewai import Agent, Task, Crew
from crewai.knowledge.source.string_knowledge_source import StringKnowledgeSource

policy_text = open("policies.txt").read()  # huge unstructured blob

source = StringKnowledgeSource(content=policy_text)

agent = Agent(
    role="Claims Assistant",
    goal="Answer policy questions",
    backstory="Works with insurance policy docs",
    knowledge_sources=[source],
)

task = Task(
    description="What does the policy say about cancellations?",
    expected_output="A precise answer",
    agent=agent,
)

crew = Crew(agents=[agent], tasks=[task])
crew.kickoff()
# FIXED
from crewai import Agent, Task, Crew
from crewai.knowledge.source.string_knowledge_source import StringKnowledgeSource

chunks = [
    {
        "content": "Cancellation terms: customer may cancel within 14 days...",
        "metadata": {"doc_type": "policy", "section": "cancellation", "version": "2024-10"},
    },
    {
        "content": "Refund terms: refunds are prorated after day 14...",
        "metadata": {"doc_type": "policy", "section": "refunds", "version": "2024-10"},
    },
]

sources = [
    StringKnowledgeSource(content=c["content"], metadata=c["metadata"])
    for c in chunks
]

agent = Agent(
    role="Claims Assistant",
    goal="Answer policy questions using retrieved policy sections only",
    backstory="Works with insurance policy docs",
    knowledge_sources=sources,
)

task = Task(
    description="Retrieve the cancellation section for a 14-day cancellation question.",
    expected_output="A precise answer with cited section",
    agent=agent,
)

crew = Crew(agents=[agent], tasks=[task])
crew.kickoff()

If your source text is long and unstructured, embeddings collapse multiple topics into one vector neighborhood. That’s why “cancellation” can return “refunds,” “billing,” or even unrelated underwriting language.

Other Possible Causes

1) You’re querying with the wrong wording

Vector search is semantic, not magical. If production users ask short or ambiguous questions like “Can I cancel?” while your docs say “termination within cooling-off period,” retrieval quality drops.

# Better: rewrite the query before retrieval
retrieval_query = (
    "Find policy language about customer cancellation rights "
    "and any cooling-off period within the insurance contract."
)

2) Your embedding model changed between indexing and querying

This one breaks relevance quietly. If you indexed with one embedding model and query with another, similarity scores become unstable.

# Make sure index-time and query-time embeddings match exactly
EMBEDDING_MODEL = "text-embedding-3-small"

# bad: index uses one model, runtime uses another
# good: pin both to the same model/version in config

3) Metadata filters are missing or too broad

If your knowledge base contains policies, claims notes, FAQs, and legal docs together, unfiltered search will happily mix them.

# Example filter conceptually
filters = {
    "doc_type": "policy",
    "version": "2024-10"
}

Use filters when your corpus contains multiple document families. Otherwise the top hit may be technically relevant but operationally wrong.

4) Chunk size is too large or overlap is wrong

Large chunks dilute meaning. Tiny chunks lose context. Both hurt relevance.

# Pseudocode for ingestion settings
chunk_size = 500      # good starting point
chunk_overlap = 80    # enough to preserve context across boundaries

# avoid:
# chunk_size = 5000
# chunk_overlap = 0

How to Debug It

  1. Inspect the top-k raw results

    • Don’t debug through the agent first.
    • Print retrieved chunks and scores before generation.
    • If top results are already wrong, this is a retrieval problem, not an LLM problem.
  2. Compare index-time and query-time embedding config

    • Check model name, dimensions, normalization settings.
    • Make sure your production container didn’t pick up a different default than local dev.
  3. Test with a known-good query

    • Use a doc phrase copied verbatim from source text.
    • If exact phrasing works but user phrasing fails, your issue is query rewriting or semantic mismatch.
  4. Narrow the corpus

    • Temporarily search only one document type.
    • If relevance improves immediately, add metadata filters and separate indexes by domain.

A useful production check is logging both the user prompt and the rewritten retrieval query:

print({
    "user_query": user_query,
    "retrieval_query": retrieval_query,
})

If those two differ too much, your retriever may be optimizing for fluency instead of precision.

Prevention

  • Separate indexes by domain

    • Keep policies, claims notes, underwriting rules, and FAQs in different collections when possible.
    • Mixed corpora are where irrelevant results start showing up first.
  • Pin embedding versions

    • Treat embedding models like schema migrations.
    • Changing them without reindexing is how production relevance silently degrades.
  • Add retrieval tests

    • Build a small eval set with known questions and expected source sections.
    • Run it in CI so you catch relevance regressions before deployment.

If you’re seeing CrewAI answers that look “kind of related” but consistently miss the mark, start with chunking and metadata first. In most production systems I’ve debugged, that’s where the real bug lives.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides