How to Build a document extraction Agent Using LangChain in TypeScript for lending

By Cyprian AaronsUpdated 2026-04-21

document-extractionlangchaintypescriptlending

A document extraction agent for lending reads borrower documents, identifies the fields your underwriting workflow needs, and returns structured data you can trust downstream. In practice, that means pulling values from bank statements, pay stubs, tax returns, IDs, and business financials so your LOS can make faster decisions without a human rekeying everything.

Architecture

•
Document intake layer
- •Accept PDFs, images, and scanned files from the lending portal or S3 bucket.
- •Normalize file metadata: borrower ID, application ID, document type, jurisdiction.
•
Loader and text extraction
- •Use LangChain document loaders for PDFs and OCR-backed pipelines for scans.
- •Preserve page numbers and source references for auditability.
•
Extraction chain
- •Send cleaned text to an LLM with a strict schema.
- •Extract only lending-required fields like employer name, income, account balances, SSN last4, and statement dates.
•
Validation layer
- •Enforce types and business rules with Zod.
- •Reject incomplete or low-confidence outputs before they hit underwriting.
•
Audit and trace store
- •Store raw input hashes, extracted JSON, model version, prompt version, and page citations.
- •Keep this separate from the operational database.
•
Workflow integration
- •Push validated output into the LOS or decision engine.
- •Route exceptions to manual review when confidence is low or compliance checks fail.

Implementation

1) Install dependencies and define the schema

For lending workflows, schema design matters more than prompt cleverness. If you do not constrain output up front, you will spend time cleaning inconsistent JSON later.

npm install langchain @langchain/openai zod pdf-parse

import { z } from "zod";

export const LendingDocumentSchema = z.object({
  documentType: z.enum(["bank_statement", "pay_stub", "tax_return", "id_document", "other"]),
  borrowerName: z.string().optional(),
  employerName: z.string().optional(),
  statementDate: z.string().optional(),
  monthlyIncome: z.number().optional(),
  endingBalance: z.number().optional(),
  accountNumberLast4: z.string().optional(),
  taxYear: z.number().optional(),
  confidence: z.number().min(0).max(1),
  citations: z.array(
    z.object({
      page: z.number(),
      quote: z.string()
    })
  )
});

export type LendingDocument = z.infer<typeof LendingDocumentSchema>;

2) Load the document and prepare text with source metadata

For production lending systems, keep page-level provenance. Underwriters need to know where each field came from when they review exceptions or respond to auditors.

import fs from "fs";
import pdfParse from "pdf-parse";
import { Document } from "@langchain/core/documents";

export async function loadPdfAsDocuments(filePath: string): Promise<Document[]> {
  const buffer = fs.readFileSync(filePath);
  const parsed = await pdfParse(buffer);

  return [
    new Document({
      pageContent: parsed.text,
      metadata: {
        source: filePath,
        documentTypeHint: "bank_statement",
        pages: parsed.numpages
      }
    })
  ];
}

3) Build the extraction chain with LangChain

Use ChatOpenAI plus StructuredOutputParser so the model is forced into your schema. This is the pattern that holds up in regulated workflows because it is deterministic enough to validate and easy to audit.

import { ChatOpenAI } from "@langchain/openai";
import { StructuredOutputParser } from "@langchain/core/output_parsers";
import { PromptTemplate } from "@langchain/core/prompts";
import { RunnableSequence } from "@langchain/core/runnables";
import { LendingDocumentSchema } from "./schema";
import { loadPdfAsDocuments } from "./loader";

const parser = StructuredOutputParser.fromZodSchema(LendingDocumentSchema);

const prompt = PromptTemplate.fromTemplate(`
You are a lending document extraction agent.

Extract fields only if they are explicitly supported by the document text.
If a field is missing, omit it.
Return confidence based on evidence quality.

Document text:
{document_text}

{format_instructions}
`);

const llm = new ChatOpenAI({
  modelName: "gpt-4o-mini",
  temperature: 0
});

const chain = RunnableSequence.from([
  async (input: { filePath: string }) => {
    const docs = await loadPdfAsDocuments(input.filePath);
    return {
      document_text: docs.map((d) => d.pageContent).join("\n\n")
    };
  },
  prompt.partial({ format_instructions: parser.getFormatInstructions() }),
  llm,
  parser
]);

async function run() {
  const result = await chain.invoke({ filePath: "./samples/bank-statement.pdf" });
  console.log(result);
}

run();

4) Add validation and exception routing

Do not let raw model output flow directly into underwriting. Validate again after parsing, then route uncertain cases to manual review with the original source attached.

import { LendingDocumentSchema } from "./schema";

export function validateExtraction(output: unknown) {
  const parsed = LendingDocumentSchema.safeParse(output);

  if (!parsed.success) {
    return {
      status: "manual_review",
      errors: parsed.error.flatten()
    };
    }

  if (parsed.data.confidence < 0.8) {
    return {
      status: "manual_review",
      reason: "low_confidence",
      data: parsed.data
    };
  }

  return {
    status: "approved",
    data: parsed.data
  };
}

Production Considerations

•
Compliance logging
- •Store prompt version, model version, input hash, output hash, timestamp, and reviewer override history.
- •Keep immutable logs for audit requests tied to loan file IDs.
•
Data residency
- •Keep borrower documents in-region if your bank or lender has residency constraints.
- •If you use hosted LLM APIs, confirm where processing occurs and whether retention is disabled.
•
Guardrails
- •Block extraction of unsupported fields like race, religion, health data, or other prohibited attributes.
- •Restrict prompts to lending-relevant data only.
- •Add schema validation plus allowlists for field names and document types.
•
Monitoring
- •Track parse failure rate, manual-review rate, field-level accuracy by doc type, and latency per page.
- •Alert when confidence drops after a model upgrade or prompt change.

Common Pitfalls

•
Treating OCR text as ground truth
- •Scanned bank statements often contain broken lines and merged columns.
- •Fix this by preserving page references and sending low-quality scans to OCR cleanup before extraction.
•
Letting the model infer missing values
- •In lending, inferred income or guessed balances create compliance risk.
- •Force the agent to omit unknown fields instead of filling them in.
•
Skipping post-extraction validation
- •A valid-looking JSON object can still violate business rules.
- •Validate ranges, date formats, account number length, and cross-field consistency before writing to your LOS.

A lending extraction agent is not just a parser. It is a controlled decision support component that has to be accurate enough for credit policy, traceable enough for audit teams, and constrained enough to avoid compliance mistakes. Build it with strict schemas, source citations, and manual review paths from day one.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit