LlamaIndex Tutorial (Python): implementing retry logic for intermediate developers

By Cyprian AaronsUpdated 2026-04-21

llamaindeximplementing-retry-logic-for-intermediate-developerspython

This tutorial shows you how to add retry logic around LlamaIndex calls so transient failures don’t break your pipeline. You need this when you’re calling hosted LLMs, embedding APIs, or retrieval tools that occasionally fail with rate limits, timeouts, or temporary network errors.

What You'll Need

•Python 3.10+
•llama-index
•openai
•An OpenAI API key in OPENAI_API_KEY
•Basic familiarity with VectorStoreIndex, QueryEngine, and Settings
•A terminal and a virtual environment

Install the packages:

pip install llama-index openai

Set your API key:

export OPENAI_API_KEY="your-key-here"

Step-by-Step

•Start with a minimal LlamaIndex setup.
We’ll build a small index from in-memory documents so the retry logic is easy to see without extra infrastructure.

from llama_index.core import Document, VectorStoreIndex, Settings
from llama_index.llms.openai import OpenAI

Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)

documents = [
    Document(text="LlamaIndex helps connect LLMs to your data."),
    Document(text="Retry logic is useful for transient API failures."),
]

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

•Add a retry wrapper for query calls.
The key idea is simple: keep the LlamaIndex code unchanged, and wrap the call at the boundary where failures happen. This keeps retries centralized and avoids scattering error handling across your app.

import time
from typing import Callable, TypeVar

T = TypeVar("T")

def retry(
    fn: Callable[[], T],
    attempts: int = 3,
    base_delay: float = 1.0,
) -> T:
    last_error = None

    for attempt in range(1, attempts + 1):
        try:
            return fn()
        except Exception as exc:
            last_error = exc
            if attempt == attempts:
                raise
            sleep_for = base_delay * (2 ** (attempt - 1))
            print(f"Attempt {attempt} failed: {exc}. Retrying in {sleep_for}s")
            time.sleep(sleep_for)

    raise last_error  # pragma: no cover

•Use the wrapper around a real query engine call.
In production, this pattern works for .query(), .chat(), retrievers, or any custom function that touches an external dependency. Keep the wrapped function small so the retry boundary is obvious.

def ask_question(question: str):
    return retry(lambda: query_engine.query(question), attempts=4, base_delay=0.5)

response = ask_question("What does LlamaIndex help with?")
print(response)

•Make retries smarter by catching only transient exceptions.
Retrying every exception is sloppy because it hides bugs like bad prompts or invalid inputs. In practice, you want to retry timeouts, rate limits, and connection issues while letting deterministic failures fail fast.

import httpx

TRANSIENT_ERRORS = (
    TimeoutError,
    httpx.TimeoutException,
    httpx.NetworkError,
)

def retry_transient(fn: Callable[[], T], attempts: int = 3) -> T:
    for attempt in range(1, attempts + 1):
        try:
            return fn()
        except TRANSIENT_ERRORS as exc:
            if attempt == attempts:
                raise
            print(f"Transient failure on attempt {attempt}: {exc}")
            time.sleep(2 ** (attempt - 1))

•Apply the same pattern to embeddings or ingestion jobs.
This matters when you’re indexing larger datasets because embedding calls are often where rate limits show up first. Wrapping the ingestion function gives you one place to handle retries instead of retrying each document manually.

from llama_index.core import SimpleDirectoryReader

def build_index_from_folder(folder_path: str):
    def _build():
        docs = SimpleDirectoryReader(folder_path).load_data()
        return VectorStoreIndex.from_documents(docs)

    return retry_transient(_build, attempts=5)

# Example usage:
# index = build_index_from_folder("./data")

Testing It

Run the script once with a valid API key and confirm you get a normal answer back from the query engine. Then temporarily break connectivity or use an invalid endpoint configuration to force a transient failure and watch the retry messages appear before the final exception.

If you want a deterministic test without depending on network conditions, replace query_engine.query with a function that raises an exception on the first two calls and succeeds on the third. That confirms your backoff and attempt counting are correct.

For production code, verify that non-transient errors still fail immediately when using retry_transient. That’s how you avoid hiding prompt bugs behind unnecessary retries.

Next Steps

•Add structured logging with attempt number, latency, and exception type.
•Replace the custom retry helper with tenacity if you need jitter, stop conditions, and richer policies.
•Wrap retries around tool calls in agent workflows, not just raw LLM queries.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit