How to Build a document extraction Agent Using LangGraph in TypeScript for investment banking

By Cyprian AaronsUpdated 2026-04-21
document-extractionlanggraphtypescriptinvestment-banking

A document extraction agent for investment banking takes deal documents, lender decks, CIMs, credit agreements, term sheets, and KYC packs, then turns them into structured fields your downstream systems can trust. It matters because bankers and ops teams spend a lot of time re-keying the same data, and the cost of a bad extraction is not just inefficiency — it can break compliance checks, misstate deal terms, or send the wrong numbers into a model.

Architecture

  • Document intake layer

    • Accepts PDFs, DOCX exports, scans, and email attachments.
    • Stores the raw file in an immutable location for audit and replay.
  • Text extraction layer

    • Uses OCR for scanned docs and parser-based extraction for digital PDFs.
    • Normalizes page order, headers, footers, and tables before LLM processing.
  • Extraction state model

    • Holds document metadata, extracted text chunks, candidate fields, validation errors, and final structured output.
    • This is what LangGraph passes between nodes.
  • LLM extraction node

    • Pulls specific fields like borrower name, facility amount, maturity date, covenants, governing law, and fees.
    • Returns structured JSON that matches a strict schema.
  • Validation and reconciliation node

    • Checks date formats, currency consistency, numeric ranges, and required fields.
    • Flags conflicts between sections like the term sheet vs. credit agreement.
  • Audit sink

    • Persists prompts, model outputs, validation results, and source page references.
    • Required for reviewability in regulated environments.

Implementation

1) Define the state and output schema

Use a typed state so every node knows exactly what it receives and returns. In banking workflows, loose any types become expensive very quickly.

import { z } from "zod";
import { StateGraph, START, END } from "@langchain/langgraph";

const ExtractedFieldsSchema = z.object({
  borrowerName: z.string().optional(),
  facilityAmount: z.string().optional(),
  currency: z.string().optional(),
  maturityDate: z.string().optional(),
  governingLaw: z.string().optional(),
  sourcePages: z.array(z.number()).default([]),
});

type ExtractionState = {
  fileName: string;
  rawText: string;
  extracted?: z.infer<typeof ExtractedFieldsSchema>;
  validationErrors?: string[];
};

const StateSchema = z.object({
  fileName: z.string(),
  rawText: z.string(),
  extracted: ExtractedFieldsSchema.optional(),
  validationErrors: z.array(z.string()).optional(),
});

2) Add an extraction node with a real LangGraph pattern

For production extraction you want structured output. In TypeScript with LangChain models inside LangGraph nodes, that usually means calling withStructuredOutput() on your chat model and keeping the prompt narrow.

import { ChatOpenAI } from "@langchain/openai";

const model = new ChatOpenAI({
  model: "gpt-4o-mini",
  temperature: 0,
});

const extractNode = async (state: ExtractionState): Promise<Partial<ExtractionState>> => {
  const extractor = model.withStructuredOutput(ExtractedFieldsSchema);

  const result = await extractor.invoke([
    {
      role: "system",
      content:
        "Extract key terms from investment banking documents. Return only fields supported by the text. If a field is missing, omit it.",
    },
    {
      role: "user",
      content: `File: ${state.fileName}\n\nDocument text:\n${state.rawText}`,
    },
  ]);

  return { extracted: result };
};

3) Validate the extracted fields before you accept them

This is where most toy demos stop. In banking you need deterministic checks for things like malformed dates or missing mandatory terms.

const validateNode = async (
  state: ExtractionState
): Promise<Partial<ExtractionState>> => {
  const errors: string[] = [];
  const extracted = state.extracted;

  if (!extracted) {
    errors.push("No structured extraction returned.");
    return { validationErrors: errors };
  }

  if (!extracted.borrowerName) errors.push("Missing borrowerName.");
  
   if (extracted.facilityAmount && !/^[A-Z]{3}\s?[\d,.]+$/.test(extracted.facilityAmount)) {
    errors.push("facilityAmount must include a valid currency prefix.");
   }

   if (extracted.maturityDate && !/^\d{4}-\d{2}-\d{2}$/.test(extracted.maturityDate)) {
    errors.push("maturityDate must be ISO-8601 formatted as YYYY-MM-DD.");
   }

   return { validationErrors: errors };
};

4) Wire the graph together and run it

This is the actual LangGraph orchestration layer. The graph gives you explicit control over how documents move through extraction and validation.

const graph = new StateGraph(StateSchema)
  .addNode("extract", extractNode)
  .addNode("validate", validateNode)
  .addEdge(START, "extract")
  .addEdge("extract", "validate")
   .addEdge("validate", END)
   .compile();

async function main() {
   const result = await graph.invoke({
     fileName: "credit-agreement.pdf",
     rawText:
       "Borrower: Northwind Capital Ltd.\nFacility Amount: USD 250,000,000\nMaturity Date: 2028-06-30\nGoverning Law: New York",
   });

   console.log(JSON.stringify(result, null,     ));
}

main();

Production Considerations

  • Data residency

    • Keep raw documents and extracted outputs in-region if your desk or entity requires it.
    • For cross-border deals, make sure your model provider deployment matches legal review requirements.
  • Auditability

    • Persist source page references alongside every extracted field.
  • Store prompt versioning and model versioning with each run.

  • You need to answer “where did this value come from?” during control reviews.

  • Guardrails

  • Reject outputs that fail schema checks or conflict with deterministic rules.

  • Route low-confidence extractions to human review instead of auto-posting to downstream systems.

  • Use allowlisted field sets so the agent cannot invent irrelevant attributes.

  • Monitoring

  • Track field-level accuracy by document type: term sheet vs. credit agreement vs. board memo.

  • Alert on spikes in missing required fields or validation failures.

  • Log latency per node so OCR bottlenecks do not hide behind LLM performance numbers.

Common Pitfalls

  1. Treating OCR as optional

    • Scanned deal docs are common in legacy workflows.
    • If you only test on clean digital PDFs, your production accuracy will collapse on real inbox traffic.
  2. Letting the model free-form its output

    • Unstructured prose is not acceptable for downstream booking or compliance systems.
    • Always force a schema with withStructuredOutput() or equivalent strict parsing.
  3. Skipping reconciliation across document versions

    • A term sheet may say one thing while the final credit agreement says another.
    • Compare against document hierarchy and version timestamps before promoting values into systems of record.
  4. Ignoring audit trails

    • Investment banking teams need traceability on every extracted number.
    • Keep raw input hashes, page citations, prompt versions, and validation results attached to each run.

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides