From Zero to AI

Lesson 4.4: Chunking Strategies

Duration: 60 minutes

Learning Objectives

By the end of this lesson, you will be able to:

  1. Explain why chunking is essential for RAG systems
  2. Implement different chunking strategies
  3. Choose the right chunk size for your use case
  4. Handle overlap between chunks to preserve context
  5. Apply specialized chunking for different document types

Why Chunking Matters

Large documents cannot be embedded and searched effectively as single units:

  1. Embedding Quality: Embedding models work best with focused text. A 50-page document embedded as one vector loses specificity.
  2. Context Window Limits: LLMs have token limits. You cannot pass entire documents as context.
  3. Relevance Precision: Smaller chunks enable finding exactly the relevant part of a document.
  4. Cost Efficiency: Including irrelevant text wastes tokens and money.

The goal of chunking is to split documents into pieces that are:

  • Small enough to be specific and fit in context
  • Large enough to contain complete, meaningful information

The Chunking Challenge

Poor chunking leads to poor RAG performance:

PROBLEM: Sentence Split in Middle
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Original: "To reset your password, go to Settings. Then click
Security and select Reset Password."

Bad Chunking:
  Chunk 1: "To reset your password, go to Setti"
  Chunk 2: "ngs. Then click Security and select Reset Password."

Result: Neither chunk contains complete instructions.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

PROBLEM: Context Lost Across Chunks
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Original: "The refund policy varies by product type. For
electronics, returns must be within 30 days. For clothing,
the window is 60 days."

Bad Chunking:
  Chunk 1: "The refund policy varies by product type."
  Chunk 2: "For electronics, returns must be within 30 days."
  Chunk 3: "For clothing, the window is 60 days."

Result: Chunk 2 mentions "30 days" but without context about
what policy this relates to.

Chunking Strategies

1. Fixed-Size Chunking

Split text into chunks of a fixed character or token count:

function chunkByCharacters(text: string, chunkSize: number, overlap: number = 0): string[] {
  const chunks: string[] = [];
  let start = 0;

  while (start < text.length) {
    const end = Math.min(start + chunkSize, text.length);
    chunks.push(text.slice(start, end));
    start += chunkSize - overlap;
  }

  return chunks;
}

// Usage
const text = 'This is a long document that needs to be split into chunks...';
const chunks = chunkByCharacters(text, 100, 20);

Pros: Simple, predictable chunk sizes Cons: May split sentences, ignores document structure

2. Recursive Character Splitting

Split by progressively smaller separators until chunks are small enough:

function recursiveCharacterSplit(
  text: string,
  chunkSize: number,
  overlap: number,
  separators: string[] = ['\n\n', '\n', '. ', ' ', '']
): string[] {
  const chunks: string[] = [];

  function splitText(text: string, separatorIndex: number): string[] {
    if (text.length <= chunkSize) {
      return [text];
    }

    const separator = separators[separatorIndex];

    if (separatorIndex >= separators.length - 1) {
      // Last resort: split by characters
      return chunkByCharacters(text, chunkSize, overlap);
    }

    const parts = text.split(separator);
    const result: string[] = [];
    let currentChunk = '';

    for (const part of parts) {
      const potentialChunk = currentChunk ? currentChunk + separator + part : part;

      if (potentialChunk.length <= chunkSize) {
        currentChunk = potentialChunk;
      } else {
        if (currentChunk) {
          result.push(currentChunk);
        }

        if (part.length > chunkSize) {
          // Recursively split with next separator
          result.push(...splitText(part, separatorIndex + 1));
          currentChunk = '';
        } else {
          currentChunk = part;
        }
      }
    }

    if (currentChunk) {
      result.push(currentChunk);
    }

    return result;
  }

  return splitText(text, 0);
}

Pros: Respects natural boundaries (paragraphs, sentences) Cons: More complex, variable chunk sizes

3. Sentence-Based Chunking

Split text into sentences, then group sentences into chunks:

function chunkBySentences(
  text: string,
  maxChunkSize: number,
  overlap: number = 1 // Number of sentences to overlap
): string[] {
  // Simple sentence splitting (for production, use a proper NLP library)
  const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
  const chunks: string[] = [];
  let currentChunk: string[] = [];
  let currentLength = 0;

  for (let i = 0; i < sentences.length; i++) {
    const sentence = sentences[i].trim();

    if (currentLength + sentence.length > maxChunkSize && currentChunk.length > 0) {
      chunks.push(currentChunk.join(' '));

      // Keep overlap sentences
      const overlapSentences = currentChunk.slice(-overlap);
      currentChunk = overlapSentences;
      currentLength = overlapSentences.join(' ').length;
    }

    currentChunk.push(sentence);
    currentLength += sentence.length + 1;
  }

  if (currentChunk.length > 0) {
    chunks.push(currentChunk.join(' '));
  }

  return chunks;
}

Pros: Never splits mid-sentence, natural language boundaries Cons: May produce very small or large chunks

4. Semantic Chunking

Group semantically similar content together using embeddings:

import OpenAI from 'openai';

const openai = new OpenAI();

async function semanticChunk(
  text: string,
  maxChunkSize: number,
  similarityThreshold: number = 0.8
): Promise<string[]> {
  // First, split into sentences
  const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];

  // Get embeddings for all sentences
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: sentences,
  });

  const embeddings = response.data.map((d) => d.embedding);

  // Group sentences by semantic similarity
  const chunks: string[] = [];
  let currentChunk: string[] = [];
  let currentLength = 0;

  for (let i = 0; i < sentences.length; i++) {
    const sentence = sentences[i].trim();

    // Check if we should start a new chunk
    let shouldSplit = currentLength + sentence.length > maxChunkSize;

    // Also check semantic similarity with previous sentence
    if (i > 0 && !shouldSplit) {
      const similarity = cosineSimilarity(embeddings[i], embeddings[i - 1]);
      if (similarity < similarityThreshold) {
        shouldSplit = true;
      }
    }

    if (shouldSplit && currentChunk.length > 0) {
      chunks.push(currentChunk.join(' '));
      currentChunk = [];
      currentLength = 0;
    }

    currentChunk.push(sentence);
    currentLength += sentence.length + 1;
  }

  if (currentChunk.length > 0) {
    chunks.push(currentChunk.join(' '));
  }

  return chunks;
}

function cosineSimilarity(a: number[], b: number[]): number {
  let dotProduct = 0;
  let magnitudeA = 0;
  let magnitudeB = 0;

  for (let i = 0; i < a.length; i++) {
    dotProduct += a[i] * b[i];
    magnitudeA += a[i] * a[i];
    magnitudeB += b[i] * b[i];
  }

  return dotProduct / (Math.sqrt(magnitudeA) * Math.sqrt(magnitudeB));
}

Pros: Keeps related content together Cons: Expensive (requires embeddings), complex


Chunk Size Guidelines

Use Case Recommended Chunk Size Rationale
Q&A over docs 500-1000 chars Focused answers
Summarization 2000-4000 chars More context needed
Code search 1000-2000 chars Functions/classes
Legal documents 1000-1500 chars Preserve clauses

Factors to Consider

  1. Embedding Model Limits: Most models handle up to 8192 tokens
  2. LLM Context Window: Leave room for system prompt and response
  3. Query Specificity: Shorter chunks = more precise matching
  4. Information Density: Technical docs may need larger chunks

Implementing Overlap

Overlap helps preserve context that might be split across chunks:

interface ChunkOptions {
  chunkSize: number;
  chunkOverlap: number;
}

function chunkWithOverlap(
  text: string,
  options: ChunkOptions
): { content: string; startIndex: number; endIndex: number }[] {
  const { chunkSize, chunkOverlap } = options;
  const chunks: { content: string; startIndex: number; endIndex: number }[] = [];

  // Split into sentences first
  const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];

  let currentChunk: string[] = [];
  let currentLength = 0;
  let chunkStartIndex = 0;
  let sentenceStartIndex = 0;

  for (const sentence of sentences) {
    const trimmedSentence = sentence.trim();

    if (currentLength + trimmedSentence.length > chunkSize && currentChunk.length > 0) {
      const chunkContent = currentChunk.join(' ');
      chunks.push({
        content: chunkContent,
        startIndex: chunkStartIndex,
        endIndex: sentenceStartIndex,
      });

      // Calculate overlap
      const overlapTarget = chunkOverlap;
      let overlapLength = 0;
      let overlapStart = currentChunk.length;

      for (let i = currentChunk.length - 1; i >= 0; i--) {
        overlapLength += currentChunk[i].length + 1;
        if (overlapLength >= overlapTarget) {
          overlapStart = i;
          break;
        }
      }

      currentChunk = currentChunk.slice(overlapStart);
      currentLength = currentChunk.join(' ').length;
      chunkStartIndex = sentenceStartIndex - currentLength;
    }

    currentChunk.push(trimmedSentence);
    currentLength += trimmedSentence.length + 1;
    sentenceStartIndex += sentence.length;
  }

  if (currentChunk.length > 0) {
    chunks.push({
      content: currentChunk.join(' '),
      startIndex: chunkStartIndex,
      endIndex: text.length,
    });
  }

  return chunks;
}

Overlap Guidelines

Document Type Recommended Overlap
Technical docs 10-15% of chunk size
Conversational 5-10%
Legal/formal 15-20%
Code 20-30% (preserve function context)

Document-Specific Chunking

Markdown Documents

function chunkMarkdown(
  markdown: string,
  maxChunkSize: number
): { content: string; metadata: { heading?: string } }[] {
  const chunks: { content: string; metadata: { heading?: string } }[] = [];

  // Split by headers
  const sections = markdown.split(/(?=^#{1,6}\s)/m);

  for (const section of sections) {
    if (!section.trim()) continue;

    // Extract heading if present
    const headingMatch = section.match(/^(#{1,6})\s+(.+?)(?:\n|$)/);
    const heading = headingMatch ? headingMatch[2].trim() : undefined;
    const content = headingMatch ? section.slice(headingMatch[0].length) : section;

    if (content.length <= maxChunkSize) {
      chunks.push({
        content: `${heading ? `# ${heading}\n\n` : ''}${content}`.trim(),
        metadata: { heading },
      });
    } else {
      // Split large sections further
      const subChunks = recursiveCharacterSplit(content, maxChunkSize, 100);
      for (const subChunk of subChunks) {
        chunks.push({
          content: `${heading ? `# ${heading}\n\n` : ''}${subChunk}`.trim(),
          metadata: { heading },
        });
      }
    }
  }

  return chunks;
}

Code Files

function chunkCode(
  code: string,
  language: string,
  maxChunkSize: number
): { content: string; metadata: { type: string } }[] {
  const chunks: { content: string; metadata: { type: string } }[] = [];

  // Different patterns for different languages
  const patterns: Record<string, RegExp> = {
    typescript: /(?=(?:export\s+)?(?:async\s+)?(?:function|class|interface|type|const|let)\s+\w+)/g,
    javascript: /(?=(?:export\s+)?(?:async\s+)?(?:function|class|const|let)\s+\w+)/g,
    python: /(?=(?:def|class|async def)\s+\w+)/g,
  };

  const pattern = patterns[language] || patterns.typescript;
  const blocks = code.split(pattern).filter((b) => b.trim());

  let currentChunk = '';

  for (const block of blocks) {
    if (currentChunk.length + block.length > maxChunkSize && currentChunk) {
      chunks.push({
        content: currentChunk.trim(),
        metadata: { type: 'code' },
      });
      currentChunk = '';
    }

    if (block.length > maxChunkSize) {
      // Large block - include as-is or split further
      if (currentChunk) {
        chunks.push({
          content: currentChunk.trim(),
          metadata: { type: 'code' },
        });
        currentChunk = '';
      }
      chunks.push({
        content: block.trim(),
        metadata: { type: 'code' },
      });
    } else {
      currentChunk += block;
    }
  }

  if (currentChunk) {
    chunks.push({
      content: currentChunk.trim(),
      metadata: { type: 'code' },
    });
  }

  return chunks;
}

Complete Chunking Implementation

Here is a complete, production-ready chunking utility:

interface ChunkingOptions {
  chunkSize: number;
  chunkOverlap: number;
  separators?: string[];
}

interface Chunk {
  content: string;
  index: number;
  metadata: {
    startChar: number;
    endChar: number;
    wordCount: number;
  };
}

class TextChunker {
  private options: Required<ChunkingOptions>;

  constructor(options: ChunkingOptions) {
    this.options = {
      chunkSize: options.chunkSize,
      chunkOverlap: options.chunkOverlap,
      separators: options.separators || ['\n\n', '\n', '. ', ', ', ' ', ''],
    };
  }

  chunk(text: string): Chunk[] {
    const chunks = this.recursiveSplit(text);
    return this.addOverlap(chunks, text);
  }

  private recursiveSplit(text: string, separatorIndex: number = 0): string[] {
    const { chunkSize, separators } = this.options;

    if (text.length <= chunkSize) {
      return [text];
    }

    if (separatorIndex >= separators.length) {
      // Last resort: hard split
      return this.hardSplit(text);
    }

    const separator = separators[separatorIndex];
    const parts = separator ? text.split(separator) : [text];

    if (parts.length === 1) {
      return this.recursiveSplit(text, separatorIndex + 1);
    }

    const chunks: string[] = [];
    let currentChunk = '';

    for (const part of parts) {
      const potentialChunk = currentChunk ? currentChunk + separator + part : part;

      if (potentialChunk.length <= chunkSize) {
        currentChunk = potentialChunk;
      } else {
        if (currentChunk) {
          chunks.push(currentChunk);
        }

        if (part.length > chunkSize) {
          chunks.push(...this.recursiveSplit(part, separatorIndex + 1));
          currentChunk = '';
        } else {
          currentChunk = part;
        }
      }
    }

    if (currentChunk) {
      chunks.push(currentChunk);
    }

    return chunks;
  }

  private hardSplit(text: string): string[] {
    const { chunkSize, chunkOverlap } = this.options;
    const chunks: string[] = [];
    let start = 0;

    while (start < text.length) {
      const end = Math.min(start + chunkSize, text.length);
      chunks.push(text.slice(start, end));
      start += chunkSize - chunkOverlap;
    }

    return chunks;
  }

  private addOverlap(rawChunks: string[], originalText: string): Chunk[] {
    const { chunkOverlap } = this.options;
    const chunks: Chunk[] = [];
    let position = 0;

    for (let i = 0; i < rawChunks.length; i++) {
      const content = rawChunks[i];
      const startChar = originalText.indexOf(content, position);
      const endChar = startChar + content.length;

      // Add overlap from previous chunk if exists
      let finalContent = content;
      if (i > 0 && chunkOverlap > 0) {
        const prevChunk = rawChunks[i - 1];
        const overlapText = prevChunk.slice(-chunkOverlap);
        // Only prepend if it makes sense contextually
        if (!content.startsWith(overlapText)) {
          finalContent = overlapText + content;
        }
      }

      chunks.push({
        content: finalContent.trim(),
        index: i,
        metadata: {
          startChar,
          endChar,
          wordCount: finalContent.split(/\s+/).length,
        },
      });

      position = startChar + 1;
    }

    return chunks;
  }
}

// Usage
const chunker = new TextChunker({
  chunkSize: 1000,
  chunkOverlap: 100,
});

const document = `
Your long document text goes here...
It can span multiple paragraphs.

Each paragraph will be handled appropriately.
`;

const chunks = chunker.chunk(document);
console.log(`Created ${chunks.length} chunks`);

for (const chunk of chunks) {
  console.log(`\nChunk ${chunk.index}:`);
  console.log(`Words: ${chunk.metadata.wordCount}`);
  console.log(`Content: ${chunk.content.substring(0, 100)}...`);
}

Testing Your Chunking Strategy

Always test chunking with real documents:

function analyzeChunks(chunks: Chunk[]): void {
  const sizes = chunks.map((c) => c.content.length);
  const avgSize = sizes.reduce((a, b) => a + b, 0) / sizes.length;
  const minSize = Math.min(...sizes);
  const maxSize = Math.max(...sizes);

  console.log('Chunking Analysis:');
  console.log(`Total chunks: ${chunks.length}`);
  console.log(`Average size: ${avgSize.toFixed(0)} chars`);
  console.log(`Min size: ${minSize} chars`);
  console.log(`Max size: ${maxSize} chars`);

  // Check for problematic chunks
  const tooSmall = chunks.filter((c) => c.content.length < 100);
  const tooLarge = chunks.filter((c) => c.content.length > 2000);

  if (tooSmall.length > 0) {
    console.log(`Warning: ${tooSmall.length} chunks are very small`);
  }

  if (tooLarge.length > 0) {
    console.log(`Warning: ${tooLarge.length} chunks are very large`);
  }
}

Key Takeaways

  1. Chunking directly affects RAG quality - poor chunking leads to poor retrieval
  2. Recursive splitting preserves natural boundaries better than fixed-size chunking
  3. Overlap prevents context loss at chunk boundaries
  4. Different document types need different strategies - markdown, code, and prose require different approaches
  5. Test with real documents to find optimal parameters for your use case

Resources

Resource Type Level
LangChain Text Splitters Documentation Intermediate
Chunking Strategies - Pinecone Article Beginner
Semantic Chunking Research Paper Advanced

Next Lesson

In the next lesson, you will put everything together and build a complete Q&A system over your own documents.

Continue to Lesson 4.5: Practice - Q&A on Documents