From Zero to AI

Lesson 5.4: Rate Limiting and Retry Logic

Duration: 60 minutes

Learning Objectives

By the end of this lesson, you will be able to:

  • Understand rate limits and why they exist
  • Implement exponential backoff for retries
  • Handle different types of API errors appropriately
  • Build a robust retry wrapper for AI API calls
  • Track and manage request quotas
  • Create a rate-limited request queue

Introduction

When working with AI APIs in production, you will encounter rate limits. Every API provider restricts how many requests you can make in a given time period. Handling these limits gracefully is essential for building reliable applications. In this lesson, you will learn strategies for dealing with rate limits and implementing retry logic.


Understanding Rate Limits

Rate limits protect API providers from being overwhelmed and ensure fair usage among all users.

Common Rate Limit Types

┌─────────────────────────────────────────────────────────────────┐
│                    Types of Rate Limits                          │
├─────────────────┬───────────────────────────────────────────────┤
│  Requests/min   │  Maximum requests allowed per minute          │
│                 │  Example: 60 requests per minute              │
├─────────────────┼───────────────────────────────────────────────┤
│  Tokens/min     │  Maximum tokens processed per minute          │
│                 │  Example: 90,000 tokens per minute            │
├─────────────────┼───────────────────────────────────────────────┤
│  Requests/day   │  Maximum requests allowed per day             │
│                 │  Example: 10,000 requests per day             │
├─────────────────┼───────────────────────────────────────────────┤
│  Concurrent     │  Maximum simultaneous requests                │
│                 │  Example: 5 concurrent requests               │
└─────────────────┴───────────────────────────────────────────────┘

OpenAI Rate Limits (Examples)

Tier RPM (Requests) TPM (Tokens)
Free 3 40,000
Tier 1 500 200,000
Tier 2 5,000 2,000,000
Tier 3 5,000 4,000,000

Anthropic Rate Limits (Examples)

Tier RPM Input TPM Output TPM
Free 5 20,000 4,000
Build 50 40,000 8,000
Scale 1,000 400,000 80,000

Detecting Rate Limit Errors

Both OpenAI and Anthropic return HTTP 429 when you hit rate limits.

OpenAI Rate Limit Detection

import OpenAI from 'openai';

const openai = new OpenAI();

async function makeRequest(message: string): Promise<string | null> {
  try {
    const response = await openai.chat.completions.create({
      model: 'gpt-4o-mini',
      messages: [{ role: 'user', content: message }],
    });
    return response.choices[0].message.content;
  } catch (error) {
    if (error instanceof OpenAI.RateLimitError) {
      console.log('Rate limit hit!');
      console.log('Status:', error.status); // 429
      console.log('Headers:', error.headers);
      return null;
    }
    throw error;
  }
}

Anthropic Rate Limit Detection

import Anthropic from '@anthropic-ai/sdk';

const anthropic = new Anthropic();

async function makeRequest(message: string): Promise<string | null> {
  try {
    const response = await anthropic.messages.create({
      model: 'claude-sonnet-4-20250514',
      max_tokens: 1024,
      messages: [{ role: 'user', content: message }],
    });
    const block = response.content[0];
    return block.type === 'text' ? block.text : null;
  } catch (error) {
    if (error instanceof Anthropic.RateLimitError) {
      console.log('Rate limit hit!');
      return null;
    }
    if (error instanceof Anthropic.APIError && error.status === 529) {
      console.log('API overloaded!');
      return null;
    }
    throw error;
  }
}

Basic Retry Logic

The simplest retry approach waits a fixed time between attempts.

async function simpleRetry<T>(
  fn: () => Promise<T>,
  maxRetries: number = 3,
  delayMs: number = 1000
): Promise<T> {
  let lastError: Error | null = null;

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      lastError = error as Error;
      console.log(`Attempt ${attempt} failed. Retrying in ${delayMs}ms...`);

      if (attempt < maxRetries) {
        await sleep(delayMs);
      }
    }
  }

  throw lastError;
}

function sleep(ms: number): Promise<void> {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

// Usage
const result = await simpleRetry(
  () =>
    openai.chat.completions.create({
      model: 'gpt-4o-mini',
      messages: [{ role: 'user', content: 'Hello!' }],
    }),
  3,
  2000
);

Exponential Backoff

A better approach increases wait time between retries. This gives the API more time to recover.

Attempt 1: Wait 1 second
Attempt 2: Wait 2 seconds
Attempt 3: Wait 4 seconds
Attempt 4: Wait 8 seconds
...

Implementing Exponential Backoff

interface RetryOptions {
  maxRetries: number;
  initialDelayMs: number;
  maxDelayMs: number;
  backoffMultiplier: number;
}

const DEFAULT_OPTIONS: RetryOptions = {
  maxRetries: 5,
  initialDelayMs: 1000,
  maxDelayMs: 60000,
  backoffMultiplier: 2,
};

async function exponentialBackoff<T>(
  fn: () => Promise<T>,
  options: Partial<RetryOptions> = {}
): Promise<T> {
  const opts = { ...DEFAULT_OPTIONS, ...options };
  let lastError: Error | null = null;
  let delay = opts.initialDelayMs;

  for (let attempt = 1; attempt <= opts.maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      lastError = error as Error;

      // Check if error is retryable
      if (!isRetryableError(error)) {
        throw error;
      }

      if (attempt < opts.maxRetries) {
        console.log(`Attempt ${attempt} failed. Retrying in ${delay}ms...`);
        await sleep(delay);

        // Increase delay for next attempt
        delay = Math.min(delay * opts.backoffMultiplier, opts.maxDelayMs);
      }
    }
  }

  throw lastError;
}

function isRetryableError(error: unknown): boolean {
  // OpenAI errors
  if (error instanceof OpenAI.RateLimitError) return true;
  if (error instanceof OpenAI.APIConnectionError) return true;
  if (error instanceof OpenAI.APIError && error.status >= 500) return true;

  // Anthropic errors
  if (error instanceof Anthropic.RateLimitError) return true;
  if (error instanceof Anthropic.APIConnectionError) return true;
  if (error instanceof Anthropic.APIError) {
    if (error.status >= 500 || error.status === 529) return true;
  }

  return false;
}

function sleep(ms: number): Promise<void> {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

Adding Jitter

To prevent many clients from retrying at the same time (thundering herd problem), add random jitter to delays.

function calculateDelayWithJitter(baseDelay: number, jitterFactor: number = 0.5): number {
  // Add random jitter: delay * (1 +/- jitterFactor)
  const jitter = baseDelay * jitterFactor * (Math.random() * 2 - 1);
  return Math.max(0, baseDelay + jitter);
}

async function exponentialBackoffWithJitter<T>(
  fn: () => Promise<T>,
  options: Partial<RetryOptions & { jitterFactor: number }> = {}
): Promise<T> {
  const opts = {
    maxRetries: 5,
    initialDelayMs: 1000,
    maxDelayMs: 60000,
    backoffMultiplier: 2,
    jitterFactor: 0.5,
    ...options,
  };

  let lastError: Error | null = null;
  let baseDelay = opts.initialDelayMs;

  for (let attempt = 1; attempt <= opts.maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      lastError = error as Error;

      if (!isRetryableError(error)) {
        throw error;
      }

      if (attempt < opts.maxRetries) {
        const delay = calculateDelayWithJitter(baseDelay, opts.jitterFactor);
        console.log(`Attempt ${attempt} failed. Retrying in ${Math.round(delay)}ms...`);
        await sleep(delay);

        baseDelay = Math.min(baseDelay * opts.backoffMultiplier, opts.maxDelayMs);
      }
    }
  }

  throw lastError;
}

Building a Robust Retry Wrapper

Let us create a complete retry system. Create src/retry.ts:

import Anthropic from '@anthropic-ai/sdk';
import OpenAI from 'openai';

// Configuration types
interface RetryConfig {
  maxRetries: number;
  initialDelayMs: number;
  maxDelayMs: number;
  backoffMultiplier: number;
  jitterFactor: number;
  onRetry?: (attempt: number, error: Error, delayMs: number) => void;
}

// Result types
interface RetryResult<T> {
  success: boolean;
  data?: T;
  error?: Error;
  attempts: number;
  totalTimeMs: number;
}

// Default configuration
const DEFAULT_CONFIG: RetryConfig = {
  maxRetries: 5,
  initialDelayMs: 1000,
  maxDelayMs: 60000,
  backoffMultiplier: 2,
  jitterFactor: 0.3,
};

// Main retry class
class RetryHandler {
  private config: RetryConfig;

  constructor(config: Partial<RetryConfig> = {}) {
    this.config = { ...DEFAULT_CONFIG, ...config };
  }

  async execute<T>(fn: () => Promise<T>): Promise<RetryResult<T>> {
    const startTime = Date.now();
    let lastError: Error | null = null;
    let baseDelay = this.config.initialDelayMs;

    for (let attempt = 1; attempt <= this.config.maxRetries; attempt++) {
      try {
        const data = await fn();
        return {
          success: true,
          data,
          attempts: attempt,
          totalTimeMs: Date.now() - startTime,
        };
      } catch (error) {
        lastError = error as Error;

        // Check if we should retry
        if (!this.shouldRetry(error)) {
          return {
            success: false,
            error: lastError,
            attempts: attempt,
            totalTimeMs: Date.now() - startTime,
          };
        }

        // Check if we have retries left
        if (attempt >= this.config.maxRetries) {
          break;
        }

        // Calculate delay with jitter
        const delay = this.calculateDelay(baseDelay);

        // Notify about retry
        if (this.config.onRetry) {
          this.config.onRetry(attempt, lastError, delay);
        }

        await this.sleep(delay);

        // Increase base delay for next attempt
        baseDelay = Math.min(baseDelay * this.config.backoffMultiplier, this.config.maxDelayMs);
      }
    }

    return {
      success: false,
      error: lastError || new Error('Unknown error'),
      attempts: this.config.maxRetries,
      totalTimeMs: Date.now() - startTime,
    };
  }

  private shouldRetry(error: unknown): boolean {
    // OpenAI errors
    if (error instanceof OpenAI.RateLimitError) return true;
    if (error instanceof OpenAI.APIConnectionError) return true;
    if (error instanceof OpenAI.InternalServerError) return true;

    // Anthropic errors
    if (error instanceof Anthropic.RateLimitError) return true;
    if (error instanceof Anthropic.APIConnectionError) return true;
    if (error instanceof Anthropic.APIError) {
      // 529 = overloaded, 5xx = server errors
      return error.status === 529 || error.status >= 500;
    }

    // Generic network errors
    if (error instanceof Error) {
      const message = error.message.toLowerCase();
      if (
        message.includes('network') ||
        message.includes('timeout') ||
        message.includes('econnreset')
      ) {
        return true;
      }
    }

    return false;
  }

  private calculateDelay(baseDelay: number): number {
    const jitter = baseDelay * this.config.jitterFactor;
    const randomJitter = (Math.random() * 2 - 1) * jitter;
    return Math.max(100, baseDelay + randomJitter);
  }

  private sleep(ms: number): Promise<void> {
    return new Promise((resolve) => setTimeout(resolve, ms));
  }
}

// Convenience function
async function withRetry<T>(fn: () => Promise<T>, config?: Partial<RetryConfig>): Promise<T> {
  const handler = new RetryHandler(config);
  const result = await handler.execute(fn);

  if (result.success) {
    return result.data as T;
  }

  throw result.error;
}

export { RetryHandler, RetryConfig, RetryResult, withRetry };

Usage:

import { RetryHandler, withRetry } from './retry';

// Simple usage
const response = await withRetry(() =>
  openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [{ role: 'user', content: 'Hello!' }],
  })
);

// With custom configuration
const handler = new RetryHandler({
  maxRetries: 3,
  initialDelayMs: 2000,
  onRetry: (attempt, error, delay) => {
    console.log(`Retry ${attempt}: ${error.message}. Waiting ${delay}ms`);
  },
});

const result = await handler.execute(() =>
  anthropic.messages.create({
    model: 'claude-sonnet-4-20250514',
    max_tokens: 1024,
    messages: [{ role: 'user', content: 'Hello!' }],
  })
);

if (result.success) {
  console.log('Success after', result.attempts, 'attempts');
} else {
  console.log('Failed:', result.error?.message);
}

Request Queue for Rate Limiting

When you need to make many requests, a queue helps manage rate limits proactively.

interface QueuedRequest<T> {
  fn: () => Promise<T>;
  resolve: (value: T) => void;
  reject: (error: Error) => void;
}

class RateLimitedQueue {
  private queue: QueuedRequest<unknown>[] = [];
  private processing: boolean = false;
  private requestsThisMinute: number = 0;
  private minuteStart: number = Date.now();
  private maxRequestsPerMinute: number;
  private minDelayMs: number;

  constructor(maxRequestsPerMinute: number = 60, minDelayMs: number = 100) {
    this.maxRequestsPerMinute = maxRequestsPerMinute;
    this.minDelayMs = minDelayMs;
  }

  async add<T>(fn: () => Promise<T>): Promise<T> {
    return new Promise((resolve, reject) => {
      this.queue.push({
        fn,
        resolve: resolve as (value: unknown) => void,
        reject,
      });
      this.processQueue();
    });
  }

  private async processQueue(): Promise<void> {
    if (this.processing || this.queue.length === 0) {
      return;
    }

    this.processing = true;

    while (this.queue.length > 0) {
      // Check if we need to reset the minute counter
      const now = Date.now();
      if (now - this.minuteStart >= 60000) {
        this.requestsThisMinute = 0;
        this.minuteStart = now;
      }

      // Check if we're at the rate limit
      if (this.requestsThisMinute >= this.maxRequestsPerMinute) {
        const waitTime = 60000 - (now - this.minuteStart);
        console.log(`Rate limit reached. Waiting ${waitTime}ms...`);
        await this.sleep(waitTime);
        this.requestsThisMinute = 0;
        this.minuteStart = Date.now();
      }

      // Process next request
      const request = this.queue.shift()!;
      this.requestsThisMinute++;

      try {
        const result = await request.fn();
        request.resolve(result);
      } catch (error) {
        request.reject(error as Error);
      }

      // Minimum delay between requests
      await this.sleep(this.minDelayMs);
    }

    this.processing = false;
  }

  private sleep(ms: number): Promise<void> {
    return new Promise((resolve) => setTimeout(resolve, ms));
  }

  getQueueLength(): number {
    return this.queue.length;
  }

  getRequestsThisMinute(): number {
    return this.requestsThisMinute;
  }
}

// Usage
const queue = new RateLimitedQueue(60); // 60 requests per minute

// Add many requests - they will be processed respecting rate limits
const promises = messages.map((msg) =>
  queue.add(() =>
    openai.chat.completions.create({
      model: 'gpt-4o-mini',
      messages: [{ role: 'user', content: msg }],
    })
  )
);

const results = await Promise.all(promises);

Token-Based Rate Limiting

Some providers limit by tokens, not just requests. Here is how to track token usage.

interface TokenBucket {
  tokens: number;
  lastRefill: number;
  maxTokens: number;
  refillRatePerSecond: number;
}

class TokenRateLimiter {
  private bucket: TokenBucket;

  constructor(maxTokensPerMinute: number) {
    this.bucket = {
      tokens: maxTokensPerMinute,
      lastRefill: Date.now(),
      maxTokens: maxTokensPerMinute,
      refillRatePerSecond: maxTokensPerMinute / 60,
    };
  }

  private refill(): void {
    const now = Date.now();
    const elapsed = (now - this.bucket.lastRefill) / 1000;
    const tokensToAdd = elapsed * this.bucket.refillRatePerSecond;

    this.bucket.tokens = Math.min(this.bucket.maxTokens, this.bucket.tokens + tokensToAdd);
    this.bucket.lastRefill = now;
  }

  async waitForTokens(needed: number): Promise<void> {
    this.refill();

    if (this.bucket.tokens >= needed) {
      this.bucket.tokens -= needed;
      return;
    }

    // Calculate wait time
    const deficit = needed - this.bucket.tokens;
    const waitSeconds = deficit / this.bucket.refillRatePerSecond;
    const waitMs = Math.ceil(waitSeconds * 1000);

    console.log(`Waiting ${waitMs}ms for ${needed} tokens...`);
    await new Promise((resolve) => setTimeout(resolve, waitMs));

    this.refill();
    this.bucket.tokens -= needed;
  }

  getAvailableTokens(): number {
    this.refill();
    return Math.floor(this.bucket.tokens);
  }
}

// Usage
const tokenLimiter = new TokenRateLimiter(90000); // 90K tokens per minute

async function makeRequest(message: string): Promise<string> {
  // Estimate tokens (rough: 1 token per 4 characters)
  const estimatedTokens = Math.ceil(message.length / 4) + 500; // +500 for response

  await tokenLimiter.waitForTokens(estimatedTokens);

  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [{ role: 'user', content: message }],
  });

  return response.choices[0].message.content || '';
}

Combining Retry and Rate Limiting

Here is a complete solution that combines both patterns.

import 'dotenv/config';
import OpenAI from 'openai';

const openai = new OpenAI();

interface RobustClientConfig {
  maxRetries: number;
  initialRetryDelayMs: number;
  maxRetryDelayMs: number;
  requestsPerMinute: number;
}

class RobustOpenAIClient {
  private config: RobustClientConfig;
  private requestTimestamps: number[] = [];

  constructor(config: Partial<RobustClientConfig> = {}) {
    this.config = {
      maxRetries: 5,
      initialRetryDelayMs: 1000,
      maxRetryDelayMs: 60000,
      requestsPerMinute: 50,
      ...config,
    };
  }

  async chat(
    messages: OpenAI.Chat.ChatCompletionMessageParam[],
    options: Partial<OpenAI.Chat.ChatCompletionCreateParams> = {}
  ): Promise<string> {
    // Wait for rate limit
    await this.waitForRateLimit();

    // Execute with retry
    let delay = this.config.initialRetryDelayMs;

    for (let attempt = 1; attempt <= this.config.maxRetries; attempt++) {
      try {
        this.recordRequest();

        const response = await openai.chat.completions.create({
          model: 'gpt-4o-mini',
          messages,
          ...options,
        });

        return response.choices[0].message.content || '';
      } catch (error) {
        if (!this.isRetryable(error) || attempt === this.config.maxRetries) {
          throw error;
        }

        // Extract retry-after header if available
        const retryAfter = this.getRetryAfter(error);
        const waitTime = retryAfter || this.addJitter(delay);

        console.log(
          `Attempt ${attempt} failed: ${(error as Error).message}. ` +
            `Retrying in ${Math.round(waitTime)}ms...`
        );

        await this.sleep(waitTime);
        delay = Math.min(delay * 2, this.config.maxRetryDelayMs);
      }
    }

    throw new Error('Max retries exceeded');
  }

  private async waitForRateLimit(): Promise<void> {
    const now = Date.now();
    const windowStart = now - 60000;

    // Remove old timestamps
    this.requestTimestamps = this.requestTimestamps.filter((ts) => ts > windowStart);

    // Check if we're at the limit
    if (this.requestTimestamps.length >= this.config.requestsPerMinute) {
      const oldestInWindow = this.requestTimestamps[0];
      const waitTime = oldestInWindow - windowStart + 100;

      console.log(`Rate limit: waiting ${waitTime}ms...`);
      await this.sleep(waitTime);
    }
  }

  private recordRequest(): void {
    this.requestTimestamps.push(Date.now());
  }

  private isRetryable(error: unknown): boolean {
    if (error instanceof OpenAI.RateLimitError) return true;
    if (error instanceof OpenAI.APIConnectionError) return true;
    if (error instanceof OpenAI.APIError && error.status >= 500) return true;
    return false;
  }

  private getRetryAfter(error: unknown): number | null {
    if (error instanceof OpenAI.APIError) {
      const retryAfter = error.headers?.['retry-after'];
      if (retryAfter) {
        return parseInt(retryAfter, 10) * 1000;
      }
    }
    return null;
  }

  private addJitter(delay: number): number {
    const jitter = delay * 0.3 * (Math.random() * 2 - 1);
    return Math.max(100, delay + jitter);
  }

  private sleep(ms: number): Promise<void> {
    return new Promise((resolve) => setTimeout(resolve, ms));
  }
}

export { RobustOpenAIClient };

Usage:

import { RobustOpenAIClient } from './robust-client';

const client = new RobustOpenAIClient({
  maxRetries: 3,
  requestsPerMinute: 50,
});

// Single request
const response = await client.chat([{ role: 'user', content: 'Hello!' }]);

// Multiple requests - rate limiting handled automatically
const messages = ['Question 1', 'Question 2', 'Question 3'];
const responses = await Promise.all(
  messages.map((msg) => client.chat([{ role: 'user', content: msg }]))
);

Exercises

Exercise 1: Circuit Breaker

Implement a circuit breaker that stops making requests after too many failures:

// Your implementation here
class CircuitBreaker {
  private failures: number = 0;
  private lastFailure: number = 0;
  private state: 'closed' | 'open' | 'half-open' = 'closed';

  constructor(
    private failureThreshold: number,
    private resetTimeMs: number
  ) {}

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    // TODO: Implement circuit breaker logic
    // - closed: normal operation
    // - open: reject all requests (after threshold failures)
    // - half-open: allow one request to test if service recovered
  }
}
Solution
type CircuitState = 'closed' | 'open' | 'half-open';

class CircuitBreaker {
  private failures: number = 0;
  private lastFailure: number = 0;
  private state: CircuitState = 'closed';

  constructor(
    private failureThreshold: number = 5,
    private resetTimeMs: number = 30000
  ) {}

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    // Check if circuit should transition from open to half-open
    if (this.state === 'open') {
      const timeSinceLastFailure = Date.now() - this.lastFailure;

      if (timeSinceLastFailure >= this.resetTimeMs) {
        console.log('Circuit transitioning to half-open');
        this.state = 'half-open';
      } else {
        throw new Error(
          `Circuit breaker is open. Try again in ${Math.ceil(
            (this.resetTimeMs - timeSinceLastFailure) / 1000
          )}s`
        );
      }
    }

    try {
      const result = await fn();

      // Success - reset circuit
      if (this.state === 'half-open') {
        console.log('Circuit closing after successful test');
      }
      this.failures = 0;
      this.state = 'closed';

      return result;
    } catch (error) {
      this.failures++;
      this.lastFailure = Date.now();

      if (this.state === 'half-open' || this.failures >= this.failureThreshold) {
        console.log(`Circuit opening after ${this.failures} failures`);
        this.state = 'open';
      }

      throw error;
    }
  }

  getState(): CircuitState {
    return this.state;
  }

  getFailures(): number {
    return this.failures;
  }

  reset(): void {
    this.failures = 0;
    this.state = 'closed';
  }
}

// Usage
const breaker = new CircuitBreaker(3, 10000);

async function makeRequest(): Promise<string> {
  return breaker.execute(async () => {
    const response = await openai.chat.completions.create({
      model: 'gpt-4o-mini',
      messages: [{ role: 'user', content: 'Hello!' }],
    });
    return response.choices[0].message.content || '';
  });
}

// Test
try {
  const result = await makeRequest();
  console.log('Result:', result);
} catch (error) {
  console.log('Error:', (error as Error).message);
  console.log('Circuit state:', breaker.getState());
}

Exercise 2: Adaptive Rate Limiter

Create a rate limiter that adjusts based on API responses:

// Your implementation here
class AdaptiveRateLimiter {
  private currentRPM: number;

  constructor(initialRPM: number) {
    this.currentRPM = initialRPM;
  }

  onSuccess(): void {
    // TODO: Gradually increase rate after successes
  }

  onRateLimit(): void {
    // TODO: Decrease rate when rate limited
  }

  async waitIfNeeded(): Promise<void> {
    // TODO: Wait based on current rate
  }
}
Solution
class AdaptiveRateLimiter {
  private currentRPM: number;
  private minRPM: number;
  private maxRPM: number;
  private successStreak: number = 0;
  private requestTimestamps: number[] = [];

  constructor(initialRPM: number = 30, minRPM: number = 5, maxRPM: number = 60) {
    this.currentRPM = initialRPM;
    this.minRPM = minRPM;
    this.maxRPM = maxRPM;
  }

  onSuccess(): void {
    this.successStreak++;

    // Increase rate after 10 consecutive successes
    if (this.successStreak >= 10 && this.currentRPM < this.maxRPM) {
      const increase = Math.ceil(this.currentRPM * 0.1);
      this.currentRPM = Math.min(this.currentRPM + increase, this.maxRPM);
      this.successStreak = 0;
      console.log(`Rate increased to ${this.currentRPM} RPM`);
    }
  }

  onRateLimit(): void {
    this.successStreak = 0;

    // Decrease rate by 25%
    const decrease = Math.ceil(this.currentRPM * 0.25);
    this.currentRPM = Math.max(this.currentRPM - decrease, this.minRPM);
    console.log(`Rate decreased to ${this.currentRPM} RPM`);
  }

  async waitIfNeeded(): Promise<void> {
    const now = Date.now();
    const windowStart = now - 60000;

    // Clean old timestamps
    this.requestTimestamps = this.requestTimestamps.filter((ts) => ts > windowStart);

    // Calculate required delay
    if (this.requestTimestamps.length >= this.currentRPM) {
      const oldestInWindow = this.requestTimestamps[0];
      const waitTime = oldestInWindow + 60000 - now + 100;

      if (waitTime > 0) {
        console.log(`Adaptive limiter: waiting ${waitTime}ms`);
        await new Promise((resolve) => setTimeout(resolve, waitTime));
      }
    }

    this.requestTimestamps.push(Date.now());
  }

  getCurrentRPM(): number {
    return this.currentRPM;
  }
}

// Usage with API calls
const limiter = new AdaptiveRateLimiter(30);

async function makeAdaptiveRequest(message: string): Promise<string> {
  await limiter.waitIfNeeded();

  try {
    const response = await openai.chat.completions.create({
      model: 'gpt-4o-mini',
      messages: [{ role: 'user', content: message }],
    });

    limiter.onSuccess();
    return response.choices[0].message.content || '';
  } catch (error) {
    if (error instanceof OpenAI.RateLimitError) {
      limiter.onRateLimit();
    }
    throw error;
  }
}

Exercise 3: Request Batcher

Create a batcher that groups requests to reduce API calls:

// Your implementation here
class RequestBatcher {
  private pending: Map<
    string,
    {
      resolve: (value: string) => void;
      reject: (error: Error) => void;
    }[]
  > = new Map();

  constructor(
    private batchDelayMs: number,
    private maxBatchSize: number
  ) {}

  async add(prompt: string): Promise<string> {
    // TODO: Queue prompts and batch them together
    // After batchDelayMs or maxBatchSize, send all prompts in one request
  }
}
Solution
import 'dotenv/config';
import OpenAI from 'openai';

const openai = new OpenAI();

interface PendingRequest {
  prompt: string;
  resolve: (value: string) => void;
  reject: (error: Error) => void;
}

class RequestBatcher {
  private pending: PendingRequest[] = [];
  private batchTimer: NodeJS.Timeout | null = null;

  constructor(
    private batchDelayMs: number = 100,
    private maxBatchSize: number = 10
  ) {}

  async add(prompt: string): Promise<string> {
    return new Promise((resolve, reject) => {
      this.pending.push({ prompt, resolve, reject });

      // If we hit max batch size, process immediately
      if (this.pending.length >= this.maxBatchSize) {
        this.processBatch();
      } else if (!this.batchTimer) {
        // Start timer for batch delay
        this.batchTimer = setTimeout(() => this.processBatch(), this.batchDelayMs);
      }
    });
  }

  private async processBatch(): Promise<void> {
    if (this.batchTimer) {
      clearTimeout(this.batchTimer);
      this.batchTimer = null;
    }

    if (this.pending.length === 0) return;

    // Take all pending requests
    const batch = [...this.pending];
    this.pending = [];

    console.log(`Processing batch of ${batch.length} requests`);

    try {
      // Create a single prompt that handles all requests
      const combinedPrompt = batch.map((req, i) => `[${i + 1}] ${req.prompt}`).join('\n\n');

      const response = await openai.chat.completions.create({
        model: 'gpt-4o-mini',
        messages: [
          {
            role: 'system',
            content: `You will receive multiple numbered prompts. 
Respond to each one with a numbered response matching the prompt number.
Format: [1] response\n[2] response\n etc.`,
          },
          {
            role: 'user',
            content: combinedPrompt,
          },
        ],
      });

      const content = response.choices[0].message.content || '';

      // Parse responses
      const responses = this.parseResponses(content, batch.length);

      // Resolve each pending request
      batch.forEach((req, i) => {
        req.resolve(responses[i] || 'No response generated');
      });
    } catch (error) {
      // Reject all pending requests
      batch.forEach((req) => req.reject(error as Error));
    }
  }

  private parseResponses(content: string, count: number): string[] {
    const responses: string[] = [];

    for (let i = 1; i <= count; i++) {
      const pattern = new RegExp(`\\[${i}\\]\\s*([\\s\\S]*?)(?=\\[${i + 1}\\]|$)`);
      const match = content.match(pattern);
      responses.push(match ? match[1].trim() : '');
    }

    return responses;
  }

  getPendingCount(): number {
    return this.pending.length;
  }
}

// Usage
const batcher = new RequestBatcher(200, 5);

// These requests will be batched together
const results = await Promise.all([
  batcher.add('What is 2+2?'),
  batcher.add('What is the capital of France?'),
  batcher.add('Who wrote Romeo and Juliet?'),
]);

console.log(results);

Key Takeaways

  1. Rate Limits Are Normal: Every API has them - plan for them from the start
  2. Exponential Backoff: Double wait time between retries to avoid overwhelming the API
  3. Jitter Prevents Thundering Herd: Add randomness to prevent synchronized retries
  4. Know Your Errors: Distinguish between retryable and non-retryable errors
  5. Proactive Rate Limiting: Use request queues to stay under limits
  6. Track Token Usage: Some limits are token-based, not just request-based
  7. Circuit Breakers: Stop trying when the service is clearly down
  8. Respect Retry-After Headers: APIs may tell you exactly how long to wait

Resources

Resource Type Description
OpenAI Rate Limits Documentation Official rate limit guide
Anthropic Rate Limits Documentation Anthropic limits documentation
Exponential Backoff Article AWS best practices
Circuit Breaker Pattern Article Pattern explanation

Next Lesson

You have learned how to make your API integrations robust and reliable. In the next lesson, you will put everything together by building a complete AI assistant that uses all the techniques you have learned.

Continue to Lesson 5.5: Practice - Simple AI Assistant