LlamaIndex Tutorial (Python): implementing retry logic for advanced developers

By Cyprian AaronsUpdated 2026-04-21
llamaindeximplementing-retry-logic-for-advanced-developerspython

This tutorial shows how to wrap LlamaIndex calls with production-grade retry logic in Python, so transient failures from LLM APIs, vector stores, or retrieval pipelines do not break your agent flow. You need this when you are running LlamaIndex in a real system where rate limits, timeouts, and flaky upstream services are normal, not edge cases.

What You'll Need

  • Python 3.10+
  • llama-index
  • tenacity
  • An OpenAI API key set as OPENAI_API_KEY
  • A basic LlamaIndex setup with an index or document store
  • Optional: access to a vector database if your app uses one in production

Install the packages:

pip install llama-index tenacity

Step-by-Step

  1. Start by isolating the operation that can fail. In LlamaIndex apps, that is usually query execution or response synthesis, not the whole application.
import os
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.openai import OpenAI

os.environ["OPENAI_API_KEY"] = os.environ["OPENAI_API_KEY"]

llm = OpenAI(model="gpt-4o-mini", temperature=0)
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(llm=llm)
  1. Add retry logic around the exact call that talks to the model or retriever. Use exponential backoff and cap the number of attempts so you do not hammer a failing dependency.
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from openai import RateLimitError, APITimeoutError

@retry(
    reraise=True,
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=1, max=16),
    retry=retry_if_exception_type((RateLimitError, APITimeoutError)),
)
def run_query(query: str) -> str:
    response = query_engine.query(query)
    return str(response)

answer = run_query("Summarize the main risks in these documents.")
print(answer)
  1. If you need retries on more than one LlamaIndex path, keep them separate. Querying, ingestion, and embedding generation fail for different reasons, so one blanket retry wrapper is usually too coarse.
from tenacity import before_sleep_log
import logging

logger = logging.getLogger("retry")
logging.basicConfig(level=logging.INFO)

@retry(
    reraise=True,
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=2, min=2, max=20),
    retry=retry_if_exception_type(Exception),
    before_sleep=before_sleep_log(logger, logging.INFO),
)
def build_index_with_retry(path: str):
    docs = SimpleDirectoryReader(path).load_data()
    return VectorStoreIndex.from_documents(docs)

index = build_index_with_retry("./data")
query_engine = index.as_query_engine(llm=llm)
  1. For advanced setups, add a fallback path after retries are exhausted. In banking and insurance workflows, that usually means returning a safe error response or switching to a lower-cost model.
def safe_query(query: str) -> dict:
    try:
        result = run_query(query)
        return {"ok": True, "answer": result}
    except Exception as exc:
        return {
            "ok": False,
            "error": type(exc).__name__,
            "message": "The request could not be completed after retries.",
        }

print(safe_query("What is the policy summary?"))
  1. If you want retries at the application boundary instead of per-function decorators, wrap your service method and keep observability close to it. That makes it easier to log attempt counts and failure types in production.
class RagService:
    def __init__(self, engine):
        self.engine = engine

    @retry(
        reraise=True,
        stop=stop_after_attempt(4),
        wait=wait_exponential(multiplier=1, min=1, max=8),
        retry=retry_if_exception_type((RateLimitError, APITimeoutError)),
    )
    def answer(self, question: str) -> str:
        return str(self.engine.query(question))

service = RagService(query_engine)
print(service.answer("Extract the key obligations from this contract."))

Testing It

Run the script against a real OpenAI-backed query engine and confirm normal requests succeed on the first attempt. Then simulate failure by temporarily lowering rate limits at the provider level or pointing the code at an invalid network route to verify retries are actually happening.

Watch logs for repeated attempts with increasing delays. Also confirm that after the final attempt fails, your fallback path returns a controlled error object instead of crashing the process.

If you use this in an API service, check latency under failure conditions. Retry logic should improve resilience without turning every outage into a long blocking request.

Next Steps

  • Add structured logging for retry count, exception type, and total elapsed time.
  • Extend retry policies per failure class: rate limits vs timeouts vs retriable 5xx errors.
  • Move this pattern into a shared service layer so all LlamaIndex agents use the same failure handling rules.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides