How to Fix 'vector search returning irrelevant results when scaling' in LangChain (Python)
When vector search starts returning irrelevant results as your LangChain app scales, the problem is usually not “the model got worse.” It means your retrieval pipeline changed under load: embeddings drifted, chunking got inconsistent, filters stopped matching, or the index was built with one configuration and queried with another.
This shows up after you move from a small local corpus to thousands or millions of chunks, add metadata filters, switch embedding models, or start using a different vector store backend in production.
The Most Common Cause
The #1 cause is embedding mismatch: you indexed documents with one embedding model and queried with another. In LangChain, that often looks fine in code because both sides use Embeddings, but the vectors live in different semantic spaces.
A classic symptom is this kind of retrieval behavior:
- •
similarity_search()returns technically valid documents - •results are obviously off-topic
- •similarity scores flatten out as the dataset grows
Broken vs fixed pattern
| Broken pattern | Fixed pattern |
|---|---|
Index built with text-embedding-3-small and queries run with all-MiniLM-L6-v2 | Use the exact same embedding class/version for both indexing and querying |
| Embedding config defined in two different files | Centralize embedding creation in one factory |
| Rebuilt query service without reindexing existing vectors | Re-embed and rebuild the index when embeddings change |
# BROKEN
from langchain_openai import OpenAIEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
# Indexing pipeline
index_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
db = Chroma.from_documents(docs, index_embeddings, persist_directory="./chroma")
# Query service accidentally uses a different embedding model
query_embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
db = Chroma(persist_directory="./chroma", embedding_function=query_embeddings)
results = db.similarity_search("How do I file a claim?", k=5)
# FIXED
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
def get_embeddings():
return OpenAIEmbeddings(model="text-embedding-3-small")
embeddings = get_embeddings()
# Indexing
db = Chroma.from_documents(docs, embeddings, persist_directory="./chroma")
# Querying uses the exact same embeddings
db = Chroma(persist_directory="./chroma", embedding_function=embeddings)
results = db.similarity_search("How do I file a claim?", k=5)
If you changed embedding providers or model versions, assume the old index is invalid until reindexed.
Other Possible Causes
1. Chunking is too large or too small
Bad chunk boundaries destroy semantic locality. If chunks are huge, each vector becomes noisy. If chunks are tiny, you lose context and retrieve fragments that look similar but answer nothing.
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=2000,
chunk_overlap=0,
)
A safer default for many enterprise docs:
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=120,
)
2. Metadata filters are too broad or wrong type
This happens a lot with vectorstore.similarity_search(..., filter=...). If your filter values don’t match stored metadata types exactly, you get partial matches or empty retrieval paths that fall back to irrelevant docs.
# BROKEN: stored as int, queried as string
results = db.similarity_search(
"policy renewal",
k=5,
filter={"tenant_id": "42"}
)
# FIXED: match the stored type exactly
results = db.similarity_search(
"policy renewal",
k=5,
filter={"tenant_id": 42}
)
Also verify whether your backend expects Mongo-style filters, simple equality filters, or nested operators.
3. You are using plain similarity search when you need MMR or reranking
At scale, nearest-neighbor search often returns near-duplicates. That feels like “irrelevant results” because top-k is dominated by repeated chunks from the same source.
# BROKEN for repetitive corpora
docs = retriever.get_relevant_documents(query)
# BETTER: diversify results
retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={"k": 5, "fetch_k": 20}
)
docs = retriever.get_relevant_documents(query)
If you need precision on long-tail queries, add a reranker after retrieval.
4. Your index is stale after document updates
If documents changed but embeddings were not refreshed, retrieval quality degrades quietly. This is common when using background ingestion jobs that append new docs without deleting old chunks.
# BROKEN: new content added but old vectors still remain
vectorstore.add_documents(new_docs)
# FIXED: upsert or rebuild consistently depending on backend support
vectorstore.delete(ids=old_ids)
vectorstore.add_documents(new_docs)
For many teams, nightly full reindexing is safer than trying to patch stale chunks incrementally.
How to Debug It
- •
Check whether indexing and querying use the same embedding class
- •Print the model name in both code paths.
- •Confirm you did not swap providers during a deploy.
- •If the model changed, reindex immediately.
- •
Inspect raw retrieved documents
- •Log
doc.page_content[:200]anddoc.metadata. - •If top results share keywords but miss intent, chunking is likely wrong.
- •If metadata looks missing or malformed, your filters are suspect.
- •Log
- •
Compare similarity scores across known-good queries
- •Run a small test set of 10 queries against expected docs.
- •Watch for flat scores like
0.71, 0.70, 0.69across unrelated content. - •That usually points to bad embeddings or noisy chunks.
- •
Test without filters and with larger
fetch_k- •Remove metadata filters temporarily.
- •Increase candidate pool size:
then:retriever = vectorstore.as_retriever(search_kwargs={"k": 5})retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 5, "fetch_k": 50}) - •If quality improves without filters, your issue is filter logic or backend query translation.
Prevention
- •Use one embedding factory shared by ingestion and query services.
- •Store embedding model name/version alongside every index build.
- •Add regression tests with known query → expected document pairs before every deployment.
- •Rebuild indexes whenever you change chunk size, overlap, or embedding model.
- •Prefer MMR or reranking for large corpora with repetitive content.
If you want one rule to keep in mind: irrelevant retrieval at scale is usually an indexing consistency problem first, not a LangChain bug. Start by checking embeddings and chunking before touching prompts or LLM parameters.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit