How to Fix 'vector search returning irrelevant results in production' in CrewAI (Python)
What this error usually means
When CrewAI’s vector search starts returning irrelevant results in production, the retrieval layer is doing exactly what you asked — just not what you meant. In practice, this shows up when your embeddings, chunking, metadata filters, or query text are misaligned with the actual knowledge base.
You’ll usually see it after moving from local tests to real traffic: answers drift, the agent cites the wrong policy, or knowledge.search() returns content that looks semantically close but operationally useless.
The Most Common Cause
The #1 cause is bad chunking plus weak metadata. In CrewAI, if you dump large documents into a KnowledgeSource without preserving section boundaries and source metadata, your embeddings become noisy and retrieval gets fuzzy fast.
A common broken pattern is embedding huge blobs and querying with vague prompts:
| Broken pattern | Fixed pattern |
|---|---|
| One giant document chunk | Smaller chunks with structure |
| No metadata filters | Source/type/version metadata |
| Vague user query | Query rewritten for retrieval |
# BROKEN
from crewai import Agent, Task, Crew
from crewai.knowledge.source.string_knowledge_source import StringKnowledgeSource
policy_text = open("policies.txt").read() # huge unstructured blob
source = StringKnowledgeSource(content=policy_text)
agent = Agent(
role="Claims Assistant",
goal="Answer policy questions",
backstory="Works with insurance policy docs",
knowledge_sources=[source],
)
task = Task(
description="What does the policy say about cancellations?",
expected_output="A precise answer",
agent=agent,
)
crew = Crew(agents=[agent], tasks=[task])
crew.kickoff()
# FIXED
from crewai import Agent, Task, Crew
from crewai.knowledge.source.string_knowledge_source import StringKnowledgeSource
chunks = [
{
"content": "Cancellation terms: customer may cancel within 14 days...",
"metadata": {"doc_type": "policy", "section": "cancellation", "version": "2024-10"},
},
{
"content": "Refund terms: refunds are prorated after day 14...",
"metadata": {"doc_type": "policy", "section": "refunds", "version": "2024-10"},
},
]
sources = [
StringKnowledgeSource(content=c["content"], metadata=c["metadata"])
for c in chunks
]
agent = Agent(
role="Claims Assistant",
goal="Answer policy questions using retrieved policy sections only",
backstory="Works with insurance policy docs",
knowledge_sources=sources,
)
task = Task(
description="Retrieve the cancellation section for a 14-day cancellation question.",
expected_output="A precise answer with cited section",
agent=agent,
)
crew = Crew(agents=[agent], tasks=[task])
crew.kickoff()
If your source text is long and unstructured, embeddings collapse multiple topics into one vector neighborhood. That’s why “cancellation” can return “refunds,” “billing,” or even unrelated underwriting language.
Other Possible Causes
1) You’re querying with the wrong wording
Vector search is semantic, not magical. If production users ask short or ambiguous questions like “Can I cancel?” while your docs say “termination within cooling-off period,” retrieval quality drops.
# Better: rewrite the query before retrieval
retrieval_query = (
"Find policy language about customer cancellation rights "
"and any cooling-off period within the insurance contract."
)
2) Your embedding model changed between indexing and querying
This one breaks relevance quietly. If you indexed with one embedding model and query with another, similarity scores become unstable.
# Make sure index-time and query-time embeddings match exactly
EMBEDDING_MODEL = "text-embedding-3-small"
# bad: index uses one model, runtime uses another
# good: pin both to the same model/version in config
3) Metadata filters are missing or too broad
If your knowledge base contains policies, claims notes, FAQs, and legal docs together, unfiltered search will happily mix them.
# Example filter conceptually
filters = {
"doc_type": "policy",
"version": "2024-10"
}
Use filters when your corpus contains multiple document families. Otherwise the top hit may be technically relevant but operationally wrong.
4) Chunk size is too large or overlap is wrong
Large chunks dilute meaning. Tiny chunks lose context. Both hurt relevance.
# Pseudocode for ingestion settings
chunk_size = 500 # good starting point
chunk_overlap = 80 # enough to preserve context across boundaries
# avoid:
# chunk_size = 5000
# chunk_overlap = 0
How to Debug It
- •
Inspect the top-k raw results
- •Don’t debug through the agent first.
- •Print retrieved chunks and scores before generation.
- •If top results are already wrong, this is a retrieval problem, not an LLM problem.
- •
Compare index-time and query-time embedding config
- •Check model name, dimensions, normalization settings.
- •Make sure your production container didn’t pick up a different default than local dev.
- •
Test with a known-good query
- •Use a doc phrase copied verbatim from source text.
- •If exact phrasing works but user phrasing fails, your issue is query rewriting or semantic mismatch.
- •
Narrow the corpus
- •Temporarily search only one document type.
- •If relevance improves immediately, add metadata filters and separate indexes by domain.
A useful production check is logging both the user prompt and the rewritten retrieval query:
print({
"user_query": user_query,
"retrieval_query": retrieval_query,
})
If those two differ too much, your retriever may be optimizing for fluency instead of precision.
Prevention
- •
Separate indexes by domain
- •Keep policies, claims notes, underwriting rules, and FAQs in different collections when possible.
- •Mixed corpora are where irrelevant results start showing up first.
- •
Pin embedding versions
- •Treat embedding models like schema migrations.
- •Changing them without reindexing is how production relevance silently degrades.
- •
Add retrieval tests
- •Build a small eval set with known questions and expected source sections.
- •Run it in CI so you catch relevance regressions before deployment.
If you’re seeing CrewAI answers that look “kind of related” but consistently miss the mark, start with chunking and metadata first. In most production systems I’ve debugged, that’s where the real bug lives.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit