How to Fix 'vector search returning irrelevant results when scaling' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-22

vector-search-returning-irrelevant-results-when-scalingllamaindexpython

When vector search returning irrelevant results when scaling shows up in a LlamaIndex app, it usually means your retrieval quality was fine in local tests and then fell apart once the corpus, chunk count, or ingestion volume grew. The common pattern is simple: the index was built one way, but queried another way, so similarity search starts returning weak matches instead of the chunks you expect.

In LlamaIndex Python apps, this usually surfaces as “my top-k results look random”, “retriever returns unrelated chunks”, or downstream answers become hallucination-heavy after you add more documents. The root cause is almost always in chunking, embeddings, metadata filters, or index persistence.

The Most Common Cause

The #1 cause is inconsistent ingestion and retrieval settings, especially chunking. If you indexed with one chunk_size and later changed it, or if your documents are getting split into tiny fragments at scale, vector similarity gets noisy fast.

Here’s the broken pattern:

# WRONG: indexing with tiny chunks and default settings that drift over time
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

docs = SimpleDirectoryReader("./data").load_data()

# Too small for most enterprise docs
index = VectorStoreIndex.from_documents(docs)

query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query("What is our claims escalation policy?")
print(response)

And here’s the fixed version:

# RIGHT: control chunking and reuse the same ingestion pipeline consistently
from llama_index.core import VectorStoreIndex, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import SimpleDirectoryReader

Settings.chunk_size = 1024
Settings.chunk_overlap = 128
Settings.node_parser = SentenceSplitter(
    chunk_size=Settings.chunk_size,
    chunk_overlap=Settings.chunk_overlap,
)

docs = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(docs)

query_engine = index.as_query_engine(similarity_top_k=10)
response = query_engine.query("What is our claims escalation policy?")
print(response)

The important part is not the exact numbers. It’s that your document segmentation stays stable as you scale. If you ingest 10 files today and 100k files next month with different splitting behavior, retrieval quality will drift.

Other Possible Causes

1) Embedding model mismatch between ingestion and query time

If you re-embed queries with a different model than the one used to build the index, similarity scores become meaningless.

# WRONG: index built with one embedding model, query path changed later
from llama_index.core import Settings
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
index = VectorStoreIndex.from_documents(docs)

# Later in another process:
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-large")
query_engine = index.as_query_engine()

Fix: keep the same embedding model/version for both indexing and querying.

embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.embed_model = embed_model
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()

2) You are not persisting and reloading the same index correctly

A common scaling bug is rebuilding from raw docs on every deploy instead of loading the persisted vector store.

# WRONG: rebuilds everything every time; easy to drift across environments
index = VectorStoreIndex.from_documents(docs)

# RIGHT: persist once, reload consistently
from llama_index.core import StorageContext, load_index_from_storage

storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)

If your production app uses a different data snapshot than staging, irrelevant results are expected.

3) Top-k is too low for larger corpora

At small scale, similarity_top_k=3 might work. At larger scale it often misses the right neighborhood entirely.

query_engine = index.as_query_engine(similarity_top_k=3)

Try increasing it:

query_engine = index.as_query_engine(similarity_top_k=10)

If recall improves but answer quality drops, add reranking instead of blindly increasing top_k.

4) Metadata filters are too broad or too strict

Bad filters can silently exclude the right nodes or flood retrieval with unrelated ones.

# Example of a filter that may be too broad for multi-tenant data
retriever = index.as_retriever(filters={"tenant_id": "prod"})

Use explicit metadata fields and verify they exist on every node before querying. In multi-tenant systems, missing tenant metadata is a classic source of cross-document noise.

How to Debug It

•
Inspect retrieved nodes directly
- •Don’t trust the final answer.
- •Print node text, score, and metadata from RetrieverQueryEngine or VectorIndexRetriever.
•
Compare ingestion settings to query settings
- •Check chunk_size, chunk_overlap, embedding model name, and vector store backend.
- •If any of these changed after indexing, rebuild the index.
•
Test recall with a known document
- •Query for a phrase that exists verbatim in one source document.
- •If retrieval misses it at similarity_top_k=10, your issue is usually chunking or embeddings.
•
Check persistence boundaries
- •Confirm production loads from the same StorageContext / vector store namespace.
- •Verify you are not mixing old embeddings with newly indexed documents.

A useful debugging snippet:

retriever = index.as_retriever(similarity_top_k=5)
nodes = retriever.retrieve("claims escalation policy")

for i, node in enumerate(nodes):
    print(i, node.score)
    print(node.node.metadata)
    print(node.node.get_content()[:300])
    print("---")

If top results look semantically close but still wrong, tune chunking and reranking. If they look completely unrelated, focus on embedding mismatch or bad persistence first.

Prevention

•
Keep ingestion config in code, not in someone’s notebook.
- •Pin chunk_size, chunk_overlap, embedding model name, and vector store namespace.
•
Add a retrieval regression test.
- •Store a few golden queries and assert that expected document IDs appear in top-k.
•
Rebuild indexes when you change embedding models or chunking strategy.
- •Treat those changes as schema migrations.

If you’re seeing irrelevant results only after scale-up, don’t start by changing prompts. Fix retrieval first. In LlamaIndex apps, bad context in means bad answers out.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit