LlamaIndex Tutorial (Python): parsing structured output for intermediate developers

By Cyprian AaronsUpdated 2026-04-21

llamaindexparsing-structured-output-for-intermediate-developerspython

This tutorial shows you how to make LlamaIndex return structured Python objects instead of messy free-form text. You need this when your app has to reliably extract fields like names, dates, amounts, or classifications from model output and pass them into downstream code.

What You'll Need

•Python 3.10+
•An OpenAI API key set as OPENAI_API_KEY
•llama-index
•pydantic
•A working internet connection for the LLM call
•Basic familiarity with LlamaIndex Settings, QueryEngine, and prompt-based querying

Install the packages:

pip install llama-index pydantic

Set your API key:

export OPENAI_API_KEY="your-key-here"

Step-by-Step

•Start by defining the structure you want back from the model. Pydantic is the cleanest way to do this because LlamaIndex can validate the response against a schema before your code touches it.

from pydantic import BaseModel, Field


class InvoiceSummary(BaseModel):
    vendor: str = Field(description="Vendor name")
    invoice_number: str = Field(description="Invoice identifier")
    total_amount: float = Field(description="Invoice total in USD")
    due_date: str = Field(description="Due date in YYYY-MM-DD format")

•Create a structured prediction program with LlamaIndex. This is the key part: instead of asking for plain text, you tell the model what shape the answer must take.

from llama_index.llms.openai import OpenAI
from llama_index.core.program import LLMTextCompletionProgram

llm = OpenAI(model="gpt-4o-mini")

prompt_template = """
Extract invoice details from the text below.

Text:
{input_str}
"""

program = LLMTextCompletionProgram.from_defaults(
    output_cls=InvoiceSummary,
    llm=llm,
    prompt_template_str=prompt_template,
)

•Feed in raw text and get back a validated Python object. In production, this is where you replace brittle regex parsing with schema-backed extraction.

invoice_text = """
Invoice from Acme Supplies
Invoice No: INV-10482
Total Due: $249.50
Due Date: 2024-09-15
"""

result = program(input_str=invoice_text)

print(type(result))
print(result.model_dump())
print(result.vendor)
print(result.total_amount)

•If you want to use structured output inside a retrieval flow, wrap it around retrieved context first. The pattern stays the same: retrieve text, then parse that text into a typed object.

from llama_index.core import VectorStoreIndex, Document

docs = [
    Document(text="""
    Acme Supplies sent invoice INV-10482.
    Total due is $249.50 and payment is due on 2024-09-15.
    """)
]

index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()

retrieved_response = query_engine.query("What invoice details are mentioned?")
parsed = program(input_str=str(retrieved_response))

print(parsed)

•Add basic validation so bad outputs fail fast. If the model returns something that cannot fit your schema, Pydantic will raise an error instead of letting invalid data leak deeper into your system.

from pydantic import ValidationError

bad_text = """
Invoice from Acme Supplies
Invoice No: INV-10482
Total Due: not-a-number
Due Date: 2024-09-15
"""

try:
    program(input_str=bad_text)
except ValidationError as e:
    print("Validation failed:")
    print(e)

Testing It

Run the script end to end and confirm that result is an InvoiceSummary instance, not a string. Then inspect model_dump() to make sure each field is populated with the expected type and value.

Test one clean input and one malformed input. The clean input should parse successfully, while malformed data should raise a validation error or produce a clearly incorrect field you can catch before persistence.

If you’re wiring this into an API, serialize the result with model_dump() and return JSON from there. That keeps your boundary explicit and avoids passing raw model text through your application layers.

Next Steps

•Learn how to use StructuredLLM for direct schema-constrained generation in newer LlamaIndex flows.
•Add retry logic with prompt repair when validation fails on semi-structured documents.
•Combine structured parsing with tool calling so extracted fields can trigger downstream actions like ticket creation or payment review

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit