How to Build a document extraction Agent Using LangChain in TypeScript for investment banking

By Cyprian AaronsUpdated 2026-04-21

document-extractionlangchaintypescriptinvestment-banking

A document extraction agent for investment banking takes unstructured files like pitch books, CIMs, loan agreements, term sheets, and financial statements, then turns them into structured fields a downstream system can trust. That matters because bankers waste hours retyping data, and every manual pass adds risk around compliance, auditability, and version control.

Architecture

•
Document ingestion
- •Accept PDFs, DOCX, scanned images, and email attachments from approved storage.
- •Normalize file metadata early: source system, deal ID, user ID, timestamp.
•
Text extraction layer
- •Use loaders for digital documents.
- •Add OCR for scanned pages before the LLM sees anything.
•
Chunking and field targeting
- •Split long documents into manageable chunks.
- •Route chunks to the right extraction schema based on document type.
•
LLM extraction chain
- •Use LangChain to prompt the model for structured JSON.
- •Force output into a typed schema with validation.
•
Post-processing and validation
- •Validate dates, currencies, percentages, and entity names.
- •Reject or flag low-confidence outputs for human review.
•
Audit and storage
- •Persist raw text, extracted JSON, model version, prompt version, and trace IDs.
- •Keep everything tied to the original source document for audit trails.

Implementation

1) Install the LangChain packages you actually need

For TypeScript, keep the stack simple: a chat model wrapper, a loader or parser layer, and schema validation. For this example I’m using OpenAI through LangChain plus Zod for strict output parsing.

npm install langchain @langchain/openai zod

If you are handling PDFs locally in production, add your PDF loader of choice and OCR pipeline separately. The agent should not depend on one brittle parser path.

2) Define the extraction schema first

Investment banking extraction fails when you let the model free-form the response. Define exact fields up front so you can validate against deal workflows like credit memo intake or CIM normalization.

import { z } from "zod";

export const DealSummarySchema = z.object({
  dealName: z.string().describe("Official transaction or company name"),
  sponsor: z.string().nullable().describe("Private equity sponsor if present"),
  targetCompany: z.string().describe("Target or issuer name"),
  currency: z.enum(["USD", "EUR", "GBP", "JPY", "CHF", "OTHER"]),
  enterpriseValue: z.number().nullable(),
  revenueLtm: z.number().nullable(),
  ebitdaLtm: z.number().nullable(),
  closingDate: z.string().nullable().describe("ISO-8601 date if available"),
  confidenceNotes: z.array(z.string()).default([]),
});

export type DealSummary = z.infer<typeof DealSummarySchema>;

This is where you encode business rules. If your desk only accepts USD deals for a specific workflow, make that explicit here instead of cleaning it later in SQL.

3) Build the LangChain extraction chain

Use ChatOpenAI with structured output so the model returns data that matches your schema. The pattern below is production-friendly because it keeps prompting and validation close together.

import { ChatOpenAI } from "@langchain/openai";
import { PromptTemplate } from "@langchain/core/prompts";
import { StringOutputParser } from "@langchain/core/output_parsers";
import { DealSummarySchema } from "./schema";

const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  temperature: 0,
});

const prompt = PromptTemplate.fromTemplate(`
You are extracting fields from an investment banking document.
Return only information supported by the text.
If a field is missing, use null.
Text:
{documentText}
`);

export async function extractDealSummary(documentText: string) {
  const chain = prompt.pipe(llm).pipe(new StringOutputParser());

  const raw = await chain.invoke({ documentText });

  const parsed = DealSummarySchema.safeParse(JSON.parse(raw));
  if (!parsed.success) {
    throw new Error(`Validation failed: ${parsed.error.message}`);
    }

  return parsed.data;
}

In practice I prefer llm.withStructuredOutput(schema) when available in your LangChain version because it reduces parsing drift. The important part is not the exact wrapper; it is that you validate against a strict schema before any downstream system sees the output.

4) Add document loading and chunk-level routing

A real agent does not assume one clean text blob. It loads documents from controlled storage, extracts text per page or section, then routes by document type so a credit agreement does not get treated like a pitch book.

import fs from "node:fs/promises";
import path from "node:path";

async function loadDocumentText(filePath: string): Promise<string> {
  // Replace this with a real PDF/DOCX loader in your stack.
  return fs.readFile(filePath, "utf8");
}

export async function runExtraction(filePath: string) {
  const text = await loadDocumentText(filePath);

  const result = await extractDealSummary(text);

  await fs.writeFile(
    path.join(process.cwd(), "extracted.json"),
    JSON.stringify(
      {
        sourceFile: filePath,
        extractedAt: new Date().toISOString(),
        result,
      },
      null,
      2
    )
  );

  return result;
}

In a bank environment this write step should go to an immutable audit store or governed object storage bucket, not just local disk. Capture prompt version and model version alongside the result so reviewers can reproduce outputs later.

Production Considerations

•
Data residency
- •Keep documents and inference within approved regions.
- •Do not send client material to unapproved endpoints or consumer-grade SaaS tools.
•
Auditability
- •Store source document hash, extracted fields, prompt template version, model name, and timestamp.
- •Make every extraction traceable back to the original file page or section.
•
Human review gates
- •Route low-confidence extractions to analysts before they hit CRM or deal systems.
- •Require review for legal terms like change-of-control clauses, covenants, and closing conditions.
•
Monitoring
- •Track field-level accuracy by document type.
- •Alert on schema failures, OCR degradation, token spikes, and sudden drops in extraction completeness.

Common Pitfalls

•
Letting the model infer missing values

If EBITDA is not stated explicitly, do not ask the model to estimate it. Force null and flag it for review; bankers need sourced facts, not guesses.
•
Skipping OCR quality checks

Scanned PDFs with bad OCR will produce confident nonsense. Run page-level quality checks and reject documents below your threshold before extraction starts.
•
Ignoring compliance metadata

If you do not store who uploaded the file, where it came from, and which region processed it, you will fail basic audit requirements. Treat metadata as part of the payload, not an afterthought.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit