Haystack Tutorial (Python): rate limiting API calls for beginners

By Cyprian AaronsUpdated 2026-04-21

haystackrate-limiting-api-calls-for-beginnerspython

This tutorial shows you how to add rate limiting around Haystack-powered API calls in Python, so your app stops hammering a model or search backend when traffic spikes. You need this when you want predictable costs, fewer 429 errors, and a simple way to keep one user or one worker from exhausting your quota.

What You'll Need

•Python 3.10+
•haystack-ai
•httpx
•tenacity
•An API key for the service you call through Haystack
•Basic familiarity with Haystack Pipeline, Component, and run()

Install the packages:

pip install haystack-ai httpx tenacity

Step-by-Step

•Start with a normal Haystack component that makes an outbound API call.
For this example, we’ll use a small HTTP client component so the rate limiting pattern is easy to see and reuse.

import httpx
from haystack import component

@component
class FetchJSON:
    @component.output_types(data=dict)
    def run(self, url: str):
        response = httpx.get(url, timeout=10)
        response.raise_for_status()
        return {"data": response.json()}

•Add a simple rate limiter using a token bucket.
This keeps calls under a fixed requests-per-second limit without needing any external service.

import time
import threading

class TokenBucketRateLimiter:
    def __init__(self, rate_per_second: float, burst: int = 1):
        self.rate_per_second = rate_per_second
        self.capacity = burst
        self.tokens = burst
        self.updated_at = time.monotonic()
        self.lock = threading.Lock()

    def acquire(self):
        while True:
            with self.lock:
                now = time.monotonic()
                elapsed = now - self.updated_at
                self.updated_at = now
                self.tokens = min(self.capacity, self.tokens + elapsed * self.rate_per_second)

                if self.tokens >= 1:
                    self.tokens -= 1
                    return

                wait_time = (1 - self.tokens) / self.rate_per_second

            time.sleep(wait_time)

•Wrap the Haystack component so every call passes through the limiter first.
This is the part you’ll actually reuse in production: keep the business logic in one component and enforce throttling outside it.

from haystack import component

@component
class RateLimitedFetchJSON:
    def __init__(self, rate_limiter: TokenBucketRateLimiter):
        self.rate_limiter = rate_limiter
        self.fetcher = FetchJSON()

    @component.output_types(data=dict)
    def run(self, url: str):
        self.rate_limiter.acquire()
        return self.fetcher.run(url=url)

•Put the component into a Haystack pipeline and run it a few times.
If you set the limit to 1 request per second, repeated calls will pause instead of firing all at once.

from haystack import Pipeline

rate_limiter = TokenBucketRateLimiter(rate_per_second=1.0, burst=1)
limited_fetch = RateLimitedFetchJSON(rate_limiter=rate_limiter)

pipe = Pipeline()
pipe.add_component("fetch", limited_fetch)

for i in range(3):
    result = pipe.run({
        "fetch": {
            "url": "https://jsonplaceholder.typicode.com/todos/1"
        }
    })
    print(i + 1, result["fetch"]["data"]["title"])

•Handle real API failures separately from throttling.
Rate limiting prevents overload, but you still want retries for transient 429s and network errors.

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=8))
def fetch_with_retry(url: str):
    response = httpx.get(url, timeout=10)
    response.raise_for_status()
    return response.json()

@component
class RetryableRateLimitedFetchJSON:
    def __init__(self, rate_limiter: TokenBucketRateLimiter):
        self.rate_limiter = rate_limiter

    @component.output_types(data=dict)
    def run(self, url: str):
        self.rate_limiter.acquire()
        return {"data": fetch_with_retry(url)}

Testing It

Run the pipeline three times in a row and watch the timestamps or total runtime. With rate_per_second=1.0, three requests should take roughly three seconds plus network time.

If you want to verify throttling more clearly, print time.monotonic() before and after each pipe.run() call. You should see the second and third calls wait instead of completing immediately.

Also test an invalid URL or a temporary 429-style endpoint to confirm retries work independently of throttling. In production, those two concerns should stay separate so you can tune them independently.

Next Steps

•Add per-user or per-tenant buckets so one customer cannot consume all shared capacity.
•Move from an in-process token bucket to Redis if you run multiple workers.
•Combine this with circuit breakers and exponential backoff for external model APIs.

Keep learning

•The complete AI Agents Roadmap — my full 8-step breakdown
•Free: The AI Agent Starter Kit — PDF checklist + starter code
•Work with me — I build AI for banks and insurance companies

By Cyprian Aarons, AI Consultant at Topiax.

ShareX / Twitter LinkedIn

Want the complete 8-step roadmap?

Grab the free AI Agent Starter Kit — architecture templates, compliance checklists, and a 7-email deep-dive course.

Get the Starter Kit