How to Fix 'vector search returning irrelevant results during development' in LangChain (Python)
If your LangChain vector search is returning irrelevant results during development, the retriever is usually doing exactly what you told it to do — just not what you intended. In practice, this shows up when chunking, embeddings, or metadata filters are misconfigured, and the top-k matches look semantically close but useless.
The most common symptom is: you ask a question, LangChain returns documents that share a few keywords, but not the actual answer. You’ll often see this with FAISS, Chroma, or PineconeVectorStore backed by VectorStoreRetriever.
The Most Common Cause
The #1 cause is bad chunking plus weak embedding input. If you split documents too aggressively, strip structure, or embed noisy text, the vector store has nothing meaningful to work with.
A classic broken pattern is embedding raw text chunks without preserving context:
| Broken | Fixed |
|---|---|
| Split on arbitrary character counts | Split on semantic boundaries with overlap |
| Embed tiny fragments like headers alone | Keep enough surrounding context in each chunk |
| Query against low-signal chunks | Query against chunks that contain full meaning |
# BROKEN
from langchain_text_splitters import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
text_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=0)
chunks = text_splitter.split_text(raw_policy_text)
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_texts(chunks, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
docs = retriever.get_relevant_documents("What is the claims waiting period?")
# FIXED
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=150,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = text_splitter.split_text(raw_policy_text)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.from_texts(chunks, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 6})
docs = retriever.get_relevant_documents("What is the claims waiting period?")
Why this matters:
- •Too-small chunks lose context.
- •Zero overlap breaks references across boundaries.
- •Plain character splitting often cuts tables, clauses, and definitions in half.
- •If you’re using
RetrievalQAorcreate_retrieval_chain, garbage chunks still flow downstream and look like retrieval failure.
Other Possible Causes
1) Wrong embedding model for your corpus
If you indexed with one embedding model and queried with another, similarity search becomes meaningless. This happens when developers rebuild part of the pipeline and forget to reindex.
# BAD: indexed with one model, queried with another
index_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
query_embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
Fix:
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.from_texts(chunks, embeddings)
# use the same embeddings object/config for retrieval lifecycle
2) You forgot to normalize metadata filters
A bad filter can silently exclude the right documents and leave only irrelevant ones. This shows up a lot with Chroma.as_retriever(search_kwargs={"filter": ...}).
# BAD: filter key/value mismatch
retriever = vectorstore.as_retriever(
search_kwargs={"k": 5, "filter": {"tenantId": "123"}}
)
# FIXED: match stored metadata exactly
retriever = vectorstore.as_retriever(
search_kwargs={"k": 5, "filter": {"tenant_id": "123"}}
)
If your metadata schema is inconsistent, retrieval will look random even though the index is fine.
3) Your documents are not deduplicated
Duplicate chunks dominate nearest-neighbor results. You’ll think retrieval is “irrelevant,” but it’s actually returning near-identical copies of boilerplate.
# Example dedupe before indexing
seen = set()
unique_chunks = []
for chunk in chunks:
key = chunk.strip()
if key not in seen:
seen.add(key)
unique_chunks.append(chunk)
4) k is too small or score thresholding is too aggressive
If you only retrieve 2 docs from a noisy index, you may miss the right one entirely. The same happens when using similarity score thresholds that are too strict.
retriever = vectorstore.as_retriever(
search_type="similarity_score_threshold",
search_kwargs={"score_threshold": 0.85}
)
Try this first:
retriever = vectorstore.as_retriever(search_kwargs={"k": 8})
Then tune thresholds after inspecting actual scores.
How to Debug It
- •Inspect raw retrieved chunks
- •Print the top 5 docs before they go into your chain.
- •Check whether the issue is retrieval or generation.
docs = retriever.get_relevant_documents("What is the claims waiting period?")
for i, doc in enumerate(docs):
print(i, doc.page_content[:300], doc.metadata)
- •
Test with a known-answer query
- •Use a question whose answer exists in one exact document.
- •If retrieval still fails, your index setup is wrong.
- •
Check embedding consistency
- •Confirm the same embedding model was used for indexing and querying.
- •Rebuild the index after changing models.
- •
Verify chunk size and overlap
- •Print chunk lengths.
- •Look for tiny fragments or chopped sentences.
print(min(len(c) for c in chunks), max(len(c) for c in chunks))
print(chunks[:3])
If you see lots of sub-200 character chunks from policy PDFs or legal docs, that’s usually your problem.
Prevention
- •Use
RecursiveCharacterTextSplitterfor most production text corpora. - •Rebuild the entire vector index whenever embeddings change.
- •Add retrieval tests with fixed queries and expected source documents.
- •Log retrieved document IDs, scores, and metadata during development so bad config shows up immediately.
If you want reliable LangChain retrieval, treat indexing as part of your application logic, not a preprocessing script you run once and forget. Most “irrelevant results” bugs are really data-shaping bugs hiding behind a vector database.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit