LlamaIndex Tutorial (Python): implementing retry logic for beginners

By Cyprian AaronsUpdated 2026-04-21
llamaindeximplementing-retry-logic-for-beginnerspython

This tutorial shows you how to add retry logic around LlamaIndex calls in Python so your agent can recover from transient failures like rate limits, timeouts, and flaky upstream APIs. You need this when your LLM provider or retrieval layer occasionally fails and you want your app to keep working instead of crashing on the first error.

What You'll Need

  • Python 3.10+
  • llama-index
  • openai
  • An OpenAI API key in OPENAI_API_KEY
  • A small text file for your local documents, or any other source you want to index
  • Optional: tenacity if you want a cleaner retry decorator, but this tutorial uses plain Python so it stays simple

Install the packages:

pip install llama-index openai

Step-by-Step

  1. Start with a basic LlamaIndex setup that loads a document and creates an index. Keep this part boring and predictable; retries should wrap the risky calls, not the whole application.
import os
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY", "")

documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

response = query_engine.query("What is in these documents?")
print(response)
  1. Add a small retry helper that catches transient exceptions and tries again with exponential backoff. This is the core pattern: fail fast on permanent errors, retry only on network-style failures.
import time
from typing import Callable, TypeVar

T = TypeVar("T")

def retry(
    fn: Callable[[], T],
    attempts: int = 3,
    base_delay: float = 1.0,
    exceptions: tuple[type[Exception], ...] = (Exception,),
) -> T:
    last_error = None
    for attempt in range(1, attempts + 1):
        try:
            return fn()
        except exceptions as e:
            last_error = e
            if attempt == attempts:
                raise
            sleep_for = base_delay * (2 ** (attempt - 1))
            print(f"Attempt {attempt} failed: {e}. Retrying in {sleep_for:.1f}s...")
            time.sleep(sleep_for)
    raise last_error  # type: ignore[misc]
  1. Wrap the query call with the retry helper. This keeps your index creation separate from your query resilience, which is usually what you want in production.
import os
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY", "")

documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

def ask_question():
    return query_engine.query("Summarize the main topic of these documents.")

result = retry(ask_question, attempts=4, base_delay=1.5)
print(result)
  1. If you are calling an LLM directly through LlamaIndex, wrap that call too. This is useful when you want retries around a custom response synthesizer or a direct model invocation.
import os
from llama_index.core.llms import ChatMessage
from llama_index.llms.openai import OpenAI

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY", "")

llm = OpenAI(model="gpt-4o-mini")

def generate_reply():
    messages = [
        ChatMessage(role="system", content="You are a concise assistant."),
        ChatMessage(role="user", content="Write one sentence about retries in APIs."),
    ]
    return llm.chat(messages)

reply = retry(generate_reply, attempts=3, base_delay=1.0)
print(reply.message.content)
  1. Make the retry policy more selective by catching only temporary failures. In real systems, don’t blindly retry everything; invalid prompts and missing files should fail immediately.
import time

class TransientError(Exception):
    pass

def retry_transient(fn, attempts=3):
    for i in range(attempts):
        try:
            return fn()
        except TransientError as e:
            if i == attempts - 1:
                raise
            delay = 2 ** i
            print(f"Transient failure: {e}. Sleeping {delay}s.")
            time.sleep(delay)

def flaky_call():
    raise TransientError("rate limited by upstream provider")

retry_transient(flaky_call)

Testing It

Run the script against a real document folder under data/ and confirm that normal queries still return answers. Then simulate a failure by temporarily raising a TransientError inside your wrapped function and watch the backoff messages appear before the final success or failure.

If you are testing against OpenAI rate limits or network instability, reduce attempts to avoid long waits during development. In production, log every failed attempt with enough context to debug which model call or query caused it.

Next Steps

  • Add tenacity for cleaner decorators and better exception filtering.
  • Separate retries for document loading, indexing, retrieval, and generation.
  • Add observability: structured logs, metrics, and trace IDs around each retry attempt

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides