How to Fix 'intermittent 500 errors in production' in LlamaIndex (Python)

By Cyprian AaronsUpdated 2026-04-21
intermittent-500-errors-in-productionllamaindexpython

Intermittent 500 Internal Server Error responses in a LlamaIndex app usually mean your request path is failing only under certain inputs, load, or runtime conditions. In practice, this shows up when your index query works locally, then starts failing in production because of bad chunking, missing API retries, unhandled exceptions in a tool call, or shared client state.

The annoying part is that the stack trace often points at generic FastAPI/Starlette middleware, while the real failure is deeper in llama_index.core or your LLM/vector store client. The fix is usually not “restart the service”; it’s isolating the exact failure mode and making the pipeline deterministic.

The Most Common Cause

The #1 cause I see is unhandled exceptions inside retrieval or LLM calls, usually from bad input shape, rate limits, or a null/empty index path that only happens for some requests.

Typical production symptom:

  • One request works
  • Another returns 500
  • Logs show something like:
    • ValueError: No nodes found for query
    • openai.RateLimitError: Error code: 429
    • httpx.ReadTimeout
    • llama_index.core.indices.query.schema.QueryBundle

Broken vs fixed pattern

Broken patternFixed pattern
Calls LlamaIndex directly from the route and lets exceptions bubble upWraps query execution and returns controlled errors
Reuses mutable global state without checksValidates index readiness before querying
No retry/backoff for transient LLM failuresRetries transient failures and fails cleanly
# BROKEN
from fastapi import FastAPI
from llama_index.core import VectorStoreIndex

app = FastAPI()
index = VectorStoreIndex.from_documents(docs)  # may be empty or partially built

@app.get("/ask")
def ask(q: str):
    # Any exception here becomes a 500
    response = index.as_query_engine().query(q)
    return {"answer": str(response)}
# FIXED
from fastapi import FastAPI, HTTPException
from llama_index.core import VectorStoreIndex
from openai import RateLimitError
import httpx

app = FastAPI()
index = VectorStoreIndex.from_documents(docs) if docs else None

@app.get("/ask")
def ask(q: str):
    if index is None:
        raise HTTPException(status_code=503, detail="Index not ready")

    try:
        engine = index.as_query_engine()
        response = engine.query(q)
        return {"answer": str(response)}
    except (RateLimitError, httpx.TimeoutException) as e:
        raise HTTPException(status_code=503, detail=f"Transient upstream failure: {e}")
    except ValueError as e:
        raise HTTPException(status_code=400, detail=f"Bad query input: {e}")

The key change is simple: don’t let every exception become an opaque 500. In production you want to classify failures into:

  • 400 for bad user input
  • 503 for transient dependency issues
  • 500 only for true server bugs

Other Possible Causes

1) Empty or malformed documents during ingestion

If your ingestion pipeline sometimes produces empty text nodes, the retriever can fail later with errors like:

  • ValueError: No nodes found
  • ValueError: embedding dimension mismatch

Broken:

docs = [Document(text=""), Document(text=None)]
index = VectorStoreIndex.from_documents(docs)

Fixed:

clean_docs = [d for d in docs if getattr(d, "text", None) and d.text.strip()]
index = VectorStoreIndex.from_documents(clean_docs)

2) Embedding model mismatch with existing vector store

This happens when you changed embedding models but kept the same persisted index. You’ll see errors such as:

  • ValueError: embedding dimension mismatch
  • PineconeApiException
  • Chroma/FAISS dimension errors

Config example:

# BROKEN: switching models without reindexing
Settings.embed_model = "text-embedding-3-small"
# old index was built with a different dimension

Fix:

# Rebuild the index after changing embed models
Settings.embed_model = "text-embedding-3-small"
index = VectorStoreIndex.from_documents(clean_docs)
index.storage_context.persist(persist_dir="./storage_v2")

3) Shared client/session state across requests

If you keep a mutable retriever or LLM client in a global object and mutate it per request, you can get nondeterministic failures under concurrency.

Broken:

retriever.filters = {"tenant_id": tenant_id}
result = retriever.retrieve(query)

Fixed:

from copy import deepcopy

base_retriever = index.as_retriever()

def get_retriever_for_tenant(tenant_id: str):
    retriever = deepcopy(base_retriever)
    retriever.filters = {"tenant_id": tenant_id}
    return retriever

4) Timeouts from slow tools or long context windows

If you use agents/tools on top of LlamaIndex, long-running tool calls can trigger gateway or app timeouts.

Snippet:

llm_kwargs = {"timeout": 10}  # too low for large retrieval + generation paths

Fix:

llm_kwargs = {"timeout": 60}
# also cap retrieved context size and top_k aggressively
query_engine = index.as_query_engine(similarity_top_k=5)

How to Debug It

  1. Check whether the failure is deterministic

    • Replay the same query against staging.
    • If only some prompts fail, inspect input length, empty strings, and tenant-specific filters.
  2. Log the exact exception class

    • Don’t log only "500 error".
    • Capture the real class name:
      • openai.RateLimitError
      • httpx.TimeoutException
      • ValueError
      • llama_index.core.indices.query.schema.QueryBundle
  3. Isolate ingestion from querying

    • Run ingestion separately.
    • Then query a known-good document set.
    • If ingestion fails first, look for malformed docs or embedding mismatches.
  4. Disable concurrency and retry with one request

    • Send one request at a time.
    • If the bug disappears under low load, suspect shared mutable state or upstream rate limits.

A practical logging wrapper helps a lot:

import logging

logger = logging.getLogger(__name__)

try:
    result = query_engine.query(user_query)
except Exception as e:
    logger.exception("LlamaIndex query failed", extra={
        "exception_type": type(e).__name__,
        "query": user_query,
    })
    raise

Prevention

  • Build a thin service layer around LlamaIndex and map exceptions to proper HTTP status codes.
  • Rebuild indexes whenever you change chunking strategy or embedding model dimensions.
  • Add validation before indexing:
    • non-empty text
    • stable metadata schema
    • consistent tenant filters

If you’re seeing intermittent 500s, treat it like an observability problem first and a code problem second. Once you know whether it’s bad input, transient upstream failure, or shared state corruption, the fix is usually small and boring.


Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides