How to Fix 'vector search returning irrelevant results when scaling' in CrewAI (Python)

By Cyprian AaronsUpdated 2026-04-22
vector-search-returning-irrelevant-results-when-scalingcrewaipython

When CrewAI starts returning irrelevant vector search results as your corpus grows, the usual meaning is simple: your retrieval layer is no longer matching the right chunks to the right queries. This shows up after you move from a few documents to hundreds or thousands, or when you switch embedding models, chunking rules, or vector stores without retuning retrieval.

In practice, the issue is almost never CrewAI itself. It’s usually bad chunking, stale embeddings, weak metadata filtering, or a retriever configured for small datasets.

The Most Common Cause

The #1 cause is chunking that was fine for a small dataset but breaks down at scale. If your chunks are too large, too small, or split without overlap, similarity search starts pulling in semantically noisy matches.

Here’s the broken pattern I see most often with CrewAI + RAGTool + a vector store:

Broken patternFixed pattern
Large chunks with no overlapSmaller chunks with overlap
Embeddings generated once, then content changesRe-embed after every document update
Generic retrieval settingsTuned k, metadata filters, and chunk size
# BROKEN
from crewai import Agent, Task, Crew
from crewai_tools import RAGTool
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=0
)

docs = splitter.split_text(open("policy.txt").read())

rag = RAGTool(
    knowledge_sources=docs,
    # default retrieval settings
)

agent = Agent(
    role="Insurance Analyst",
    goal="Answer policy questions",
    tools=[rag],
)

task = Task(
    description="What does the cancellation clause say?",
    agent=agent,
)
# FIXED
from crewai import Agent, Task, Crew
from crewai_tools import RAGTool
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=120
)

docs = splitter.split_text(open("policy.txt").read())

rag = RAGTool(
    knowledge_sources=docs,
    # tune retrieval if your tool/vector store supports it
    top_k=5,
)

agent = Agent(
    role="Insurance Analyst",
    goal="Answer policy questions using only retrieved context",
    tools=[rag],
)

task = Task(
    description="What does the cancellation clause say?",
    agent=agent,
)

Why this fails at scale:

  • A 2,000-character chunk often contains multiple topics.
  • Similarity search grabs the “closest” topic inside the blob, not the exact answer.
  • With more documents, false positives increase because everything looks vaguely similar.

If you’re using RecursiveCharacterTextSplitter, start around 600–1000 characters and add overlap. Then measure retrieval quality before changing anything else.

Other Possible Causes

1. Stale embeddings after document updates

If you update source files but don’t re-index them, CrewAI will retrieve old vectors.

# BROKEN: source changed but index was not rebuilt
vector_store.add_documents(new_docs)  # old docs still dominate retrieval
# FIXED: rebuild or upsert consistently
vector_store.delete_collection()
vector_store.add_documents(all_current_docs)

If you’re using Chroma, Pinecone, Weaviate, or Qdrant, make sure updates are idempotent and versioned.

2. Wrong embedding model for the domain

A general-purpose embedding model can be weak on insurance policy language, banking product terms, or internal jargon.

# BROKEN: generic embeddings on domain-specific text
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# FIXED: use a stronger model and keep it consistent
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

Don’t switch models without rebuilding the index. Vector spaces are not interchangeable.

3. Missing metadata filters

If you mix product docs, legal docs, and support articles in one index without filters, retrieval gets noisy fast.

# BROKEN: searching across everything
results = retriever.get_relevant_documents("premium refund terms")
# FIXED: filter by document type / tenant / product line
results = retriever.get_relevant_documents(
    "premium refund terms",
    filter={"doc_type": "policy", "product": "life_insurance"}
)

In multi-tenant systems this is mandatory. Without filters, one customer’s content can pollute another customer’s results.

4. Retriever settings tuned too aggressively

High k can flood the agent with marginal context. Too low k can miss the answer entirely.

# BROKEN: too many noisy chunks returned
retriever.search_kwargs = {"k": 20}
# FIXED: start small and test recall/precision tradeoff
retriever.search_kwargs = {"k": 4}

If your vector store supports MMR (max marginal relevance), it can help reduce duplicate-looking chunks:

retriever.search_type = "mmr"
retriever.search_kwargs = {"k": 4, "fetch_k": 20}

How to Debug It

  1. Inspect what actually got retrieved Print raw chunks before they reach the agent. If the top result is vaguely related instead of directly relevant, this is a retrieval problem — not an LLM problem.

    docs = retriever.get_relevant_documents("cancellation clause")
    for d in docs:
        print(d.metadata)
        print(d.page_content[:300])
        print("---")
    
  2. Check whether embeddings were rebuilt If documents changed recently and results got worse immediately after scale-up, verify indexing timestamps and document hashes.

  3. Test with one known query per document Build a tiny evaluation set:

    • query: “What is the cancellation period?”
    • expected doc id: policy_17

    If retrieval fails on known examples, tune chunking and filters before touching prompts.

  4. Compare results across vector stores or models If FAISS works but Pinecone doesn’t — or vice versa — look at normalization settings, metadata handling, and distance metrics.

    Also check whether your store uses cosine similarity while your embeddings expect dot product behavior.

Prevention

  • Use chunk sizes that match the content type:

    • policies/contracts: smaller chunks with overlap
    • FAQs/short answers: slightly larger chunks are fine
  • Rebuild or upsert embeddings on every content change.

    • stale vectors are one of the fastest ways to get irrelevant matches
  • Add metadata from day one.

    • tenant_id
    • doc_type
    • product
    • version
  • Keep a small retrieval test suite.

    • run it before deploying any CrewAI change that touches ingestion or embeddings

If you want this to stay stable in production, treat retrieval like code: version it, test it, and monitor it. The moment your corpus grows past a few hundred documents without evaluation gates, irrelevant results stop being occasional noise and become your default failure mode.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides