Haystack Tutorial (TypeScript): rate limiting API calls for intermediate developers

By Cyprian AaronsUpdated 2026-04-21
haystackrate-limiting-api-calls-for-intermediate-developerstypescript

This tutorial shows you how to put a hard rate limit in front of Haystack API calls in TypeScript, so your app stops hammering upstream services and starts behaving like a production system. You need this when you’re calling LLMs, search APIs, or internal model gateways that enforce quotas, charge per request, or return 429s under load.

What You'll Need

  • Node.js 18+ and npm
  • A TypeScript project with ts-node or a build step already set up
  • Haystack installed in your project:
    • @haystack-ai/core
    • @haystack-ai/agents
  • A rate limiter package:
    • bottleneck
  • An API key for the service your Haystack component will call
  • Basic familiarity with Haystack pipelines and components

Step-by-Step

  1. Start by installing the packages and setting up your environment variables. In real systems, keep the limiter separate from the model client so you can reuse it across multiple components.
npm install @haystack-ai/core @haystack-ai/agents bottleneck
npm install -D typescript ts-node @types/node
// .env.example
OPENAI_API_KEY=your_key_here
  1. Create a small rate-limited wrapper around the function that performs the upstream call. The limiter below allows 5 requests per second with a concurrency of 1, which is a sane starting point for most vendor APIs.
import Bottleneck from "bottleneck";

export const limiter = new Bottleneck({
  maxConcurrent: 1,
  minTime: 200, // 5 req/sec
});

export async function rateLimited<T>(fn: () => Promise<T>): Promise<T> {
  return limiter.schedule(() => fn());
}
  1. Build a Haystack component that uses the wrapper before calling the external API. This example uses an OpenAI chat model through Haystack’s TypeScript SDK, but the same pattern works for any HTTP-backed component.
import { OpenAIChatGenerator } from "@haystack-ai/agents";
import { rateLimited } from "./rateLimiter";

const generator = new OpenAIChatGenerator({
  apiKey: process.env.OPENAI_API_KEY!,
  model: "gpt-4o-mini",
});

export async function generateAnswer(prompt: string): Promise<string> {
  const result = await rateLimited(() =>
    generator.run({
      messages: [{ role: "user", content: prompt }],
    })
  );

  return result.replies[0]?.content ?? "";
}
  1. If you’re using Haystack pipelines, wrap the component at the boundary where requests enter the external service. That keeps your pipeline code clean and makes it obvious where throttling happens.
import { Pipeline } from "@haystack-ai/core";
import { generateAnswer } from "./generateAnswer";

async function main() {
  const pipeline = new Pipeline();

  const prompts = [
    "Summarize Basel III in one paragraph.",
    "Explain idempotency in payment systems.",
    "List three causes of duplicate claims processing.",
  ];

  for (const prompt of prompts) {
    const answer = await generateAnswer(prompt);
    console.log("\nPROMPT:", prompt);
    console.log("ANSWER:", answer);
  }
}

main().catch((err) => {
  console.error(err);
  process.exit(1);
});
  1. Add backoff for vendor throttling errors. Rate limiting on your side reduces pressure, but you still need retry logic when the provider returns transient failures like HTTP 429 or 503.
function sleep(ms: number) {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

async function withRetry<T>(fn: () => Promise<T>, attempts = 3): Promise<T> {
  let lastError: unknown;

  for (let i = 0; i < attempts; i++) {
    try {
      return await fn();
    } catch (err) {
      lastError = err;
      await sleep(250 * Math.pow(2, i));
    }
  }

  throw lastError;
}
  1. Combine retry and rate limiting in one entry point so every caller gets the same behavior. This is the version you should expose from your service layer, not the raw model client.
import { OpenAIChatGenerator } from "@haystack-ai/agents";
import { rateLimited } from "./rateLimiter";

const generator = new OpenAIChatGenerator({
  apiKey: process.env.OPENAI_API_KEY!,
  model: "gpt-4o-mini",
});

function sleep(ms: number) {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

async function withRetry<T>(fn: () => Promise<T>, attempts = 3): Promise<T> {
  let lastError: unknown;

  for (let i = 0; i < attempts; i++) {
    try {
      return await fn();
    } catch (err) {
      lastError = err;
      await sleep(250 * Math.pow(2, i));
    }
  }

  throw lastError;
}

export async function safeGenerate(prompt: string): Promise<string> {
  const result = await rateLimited(() =>
    withRetry(() =>
      generator.run({
        messages: [{ role: "user", content: prompt }],
      })
    )
  );

  return result.replies[0]?.content ?? "";
}

Testing It

Run several requests back-to-back and watch the timestamps. With minTime: 200, you should see roughly five requests per second instead of bursts hitting the provider all at once.

Also test failure paths by temporarily lowering your provider quota or sending many concurrent requests from a loop. You want to confirm two things: requests are queued instead of exploding immediately, and retries only happen after actual failures.

If you’re running this behind an API route or worker queue, check logs for latency spikes and rejected calls. The goal is predictable throughput, not maximum raw speed.

Next Steps

  • Add per-user or per-tenant limits instead of one global limiter
  • Move the limiter into Redis if you run multiple Node instances
  • Add circuit breaking so repeated provider failures stop traffic early

Keep learning

By Cyprian Aarons, AI Consultant at Topiax.

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit

Related Guides