LlamaIndex Tutorial (Python): implementing retry logic for beginners
This tutorial shows you how to add retry logic around LlamaIndex calls in Python so your agent can recover from transient failures like rate limits, timeouts, and flaky upstream APIs. You need this when your LLM provider or retrieval layer occasionally fails and you want your app to keep working instead of crashing on the first error.
What You'll Need
- •Python 3.10+
- •
llama-index - •
openai - •An OpenAI API key in
OPENAI_API_KEY - •A small text file for your local documents, or any other source you want to index
- •Optional:
tenacityif you want a cleaner retry decorator, but this tutorial uses plain Python so it stays simple
Install the packages:
pip install llama-index openai
Step-by-Step
- •Start with a basic LlamaIndex setup that loads a document and creates an index. Keep this part boring and predictable; retries should wrap the risky calls, not the whole application.
import os
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY", "")
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What is in these documents?")
print(response)
- •Add a small retry helper that catches transient exceptions and tries again with exponential backoff. This is the core pattern: fail fast on permanent errors, retry only on network-style failures.
import time
from typing import Callable, TypeVar
T = TypeVar("T")
def retry(
fn: Callable[[], T],
attempts: int = 3,
base_delay: float = 1.0,
exceptions: tuple[type[Exception], ...] = (Exception,),
) -> T:
last_error = None
for attempt in range(1, attempts + 1):
try:
return fn()
except exceptions as e:
last_error = e
if attempt == attempts:
raise
sleep_for = base_delay * (2 ** (attempt - 1))
print(f"Attempt {attempt} failed: {e}. Retrying in {sleep_for:.1f}s...")
time.sleep(sleep_for)
raise last_error # type: ignore[misc]
- •Wrap the query call with the retry helper. This keeps your index creation separate from your query resilience, which is usually what you want in production.
import os
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY", "")
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
def ask_question():
return query_engine.query("Summarize the main topic of these documents.")
result = retry(ask_question, attempts=4, base_delay=1.5)
print(result)
- •If you are calling an LLM directly through LlamaIndex, wrap that call too. This is useful when you want retries around a custom response synthesizer or a direct model invocation.
import os
from llama_index.core.llms import ChatMessage
from llama_index.llms.openai import OpenAI
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY", "")
llm = OpenAI(model="gpt-4o-mini")
def generate_reply():
messages = [
ChatMessage(role="system", content="You are a concise assistant."),
ChatMessage(role="user", content="Write one sentence about retries in APIs."),
]
return llm.chat(messages)
reply = retry(generate_reply, attempts=3, base_delay=1.0)
print(reply.message.content)
- •Make the retry policy more selective by catching only temporary failures. In real systems, don’t blindly retry everything; invalid prompts and missing files should fail immediately.
import time
class TransientError(Exception):
pass
def retry_transient(fn, attempts=3):
for i in range(attempts):
try:
return fn()
except TransientError as e:
if i == attempts - 1:
raise
delay = 2 ** i
print(f"Transient failure: {e}. Sleeping {delay}s.")
time.sleep(delay)
def flaky_call():
raise TransientError("rate limited by upstream provider")
retry_transient(flaky_call)
Testing It
Run the script against a real document folder under data/ and confirm that normal queries still return answers. Then simulate a failure by temporarily raising a TransientError inside your wrapped function and watch the backoff messages appear before the final success or failure.
If you are testing against OpenAI rate limits or network instability, reduce attempts to avoid long waits during development. In production, log every failed attempt with enough context to debug which model call or query caused it.
Next Steps
- •Add
tenacityfor cleaner decorators and better exception filtering. - •Separate retries for document loading, indexing, retrieval, and generation.
- •Add observability: structured logs, metrics, and trace IDs around each retry attempt
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit