How to Build a document extraction Agent Using LlamaIndex in TypeScript for fintech
A document extraction agent in fintech reads PDFs, scans, statements, invoices, KYC packs, and policy documents, then turns them into structured data your systems can trust. The point is not just OCR; it is extracting fields with traceability so underwriting, onboarding, reconciliation, and compliance workflows can run without a human retyping everything.
Architecture
- •
Document ingestion layer
- •Pulls files from S3, Azure Blob, GCS, or an internal upload service.
- •Normalizes PDFs, images, and text-heavy docs into a format the pipeline can process.
- •
Text extraction layer
- •Uses LlamaIndex readers/loaders to turn documents into
Documentobjects. - •Keeps metadata like
documentId,tenantId,sourceUri, andjurisdiction.
- •Uses LlamaIndex readers/loaders to turn documents into
- •
Chunking and indexing layer
- •Splits long documents into smaller nodes for retrieval and extraction.
- •Builds a vector index when you need cross-page lookup or field validation.
- •
Extraction agent layer
- •Uses an LLM-backed query engine or agent to extract specific fields.
- •Returns structured JSON for downstream systems.
- •
Validation and guardrail layer
- •Checks required fields, formats, confidence thresholds, and policy rules.
- •Blocks unsafe outputs before they hit core banking or compliance systems.
- •
Audit trail layer
- •Stores source document references, extracted spans, model version, and timestamps.
- •Gives compliance teams evidence for every field produced by the agent.
Implementation
1) Install the packages and set up environment variables
For TypeScript, use the LlamaIndex TS packages plus a model provider. This example uses OpenAI because the TypeScript API is straightforward and production-proven.
npm install llamaindex zod dotenv
npm install @llamaindex/openai
Set your secrets:
OPENAI_API_KEY=your_key_here
2) Load a document with metadata
You want every document tied to a tenant and a source URI. That metadata matters later when you write audit logs or enforce residency rules.
import "dotenv/config";
import { Document } from "llamaindex";
const doc = new Document({
text: `
Customer Name: Acme Trading Ltd
Account Number: 12345678
IBAN: GB29NWBK60161331926819
Invoice Total: USD 42,500.00
Due Date: 2026-02-15
Tax ID: TIN-99887766
`,
metadata: {
documentId: "inv_2026_00091",
tenantId: "fintech-eu",
sourceUri: "s3://fintech-docs/eu/invoices/inv_2026_00091.pdf",
jurisdiction: "EU",
docType: "invoice",
},
});
3) Build an index and query it for structured extraction
For extraction workloads, a vector index gives you retrieval over relevant chunks before you ask the model to produce structured output. The important part is to constrain the output shape with a schema so downstream code does not parse free-form text.
import { VectorStoreIndex } from "llamaindex";
import { OpenAI } from "@llamaindex/openai";
import { z } from "zod";
const ExtractionSchema = z.object({
customerName: z.string(),
accountNumber: z.string(),
iban: z.string(),
invoiceTotal: z.string(),
dueDate: z.string(),
taxId: z.string(),
});
async function run() {
const llm = new OpenAI({
model: "gpt-4o-mini",
apiKey: process.env.OPENAI_API_KEY,
temperature: 0,
});
const index = await VectorStoreIndex.fromDocuments([doc]);
const queryEngine = index.asQueryEngine({
llm,
similarityTopK: 3,
systemPrompt:
"Extract only fields present in the document. Do not guess. If a field is missing, return an empty string.",
});
const response = await queryEngine.query({
query:
"Extract customerName, accountNumber, iban, invoiceTotal, dueDate, and taxId as JSON.",
});
console.log(String(response));
}
run();
That gets you retrieval plus generation. In production I would not stop at plain text output; I would validate the result against a schema before storing it.
4) Validate output and reject bad payloads
The extraction step should feed into deterministic validation. Fintech systems need strict checks for missing values, invalid account formats, currency parsing errors, and jurisdiction-specific rules.
type ExtractedInvoice = {
customerName: string;
accountNumber: string;
iban: string;
};
function validateInvoice(payload: unknown): ExtractedInvoice {
}
A better pattern is to parse the model output after enforcing JSON mode at the prompt level or through your wrapper. Then validate with Zod:
import { z } from "zod";
const InvoiceSchema = z.object({
customerName: z.string().min(1),
accountNumber: z.string().min(1),
iban: z.string().regex(/^[A-Z]{2}[0-9A-Z]{13,32}$/),
});
function validatePayload(payload: unknown) {
}
In practice you should also persist:
- •raw document hash
- •extracted JSON
- •source spans or chunk IDs
- •model name/version
- •validation status
That gives you an audit trail regulators can inspect.
Production Considerations
- •Deploy close to your data boundary
Use regional infrastructure that matches your residency requirements. If EU customer docs must stay in-region, keep ingestion, indexing storage, logs, and model calls inside that boundary.
- •Log every extraction decision
Store document hash, tenant ID, prompt version, response payload, validation errors, and operator overrides. For AML/KYC or loan onboarding disputes, you need reproducible evidence of what the agent saw and returned.
- •Add hard guardrails before persistence
Reject outputs that fail schema checks or contain hallucinated fields not present in the source document. For finance docs I also block any value that was not directly supported by retrieved text spans.
- •Monitor extraction quality by document type
Track precision/recall on invoices separately from bank statements or KYC forms. A single aggregate accuracy number hides failure modes like misread IBANs or swapped totals.
Common Pitfalls
- •
Treating OCR as extraction
- •OCR gives you text; it does not give you trustworthy fields.
- •Fix it by combining document parsing with schema validation and source traceability.
- •
Letting the model infer missing values
- •In fintech this becomes silent corruption fast.
- •Fix it with prompts that say “do not guess” and validators that reject non-empty defaults for missing fields.
- •
Skipping metadata
- •Without tenant ID, jurisdiction, source URI, and document hash you cannot audit anything properly.
- •Fix it by attaching metadata at ingestion time and carrying it through every stage of the pipeline.
- •
Using one prompt for all documents
- •Invoices, bank statements, insurance claims forms, and KYC packs have different field sets.
- •Fix it by maintaining per-document-type schemas and prompts so each extractor has one job.
Keep learning
- •The complete AI Agents Roadmap — my full 8-step breakdown
- •Free: The AI Agent Starter Kit — PDF checklist + starter code
- •Work with me — I build AI for banks and insurance companies
By Cyprian Aarons, AI Consultant at Topiax.
Want the complete 8-step roadmap?
Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.
Get the Starter Kit