Lesson 4.4: Chunking Strategies
Duration: 60 minutes
Learning Objectives
By the end of this lesson, you will be able to:
- Explain why chunking is essential for RAG systems
- Implement different chunking strategies
- Choose the right chunk size for your use case
- Handle overlap between chunks to preserve context
- Apply specialized chunking for different document types
Why Chunking Matters
Large documents cannot be embedded and searched effectively as single units:
- Embedding Quality: Embedding models work best with focused text. A 50-page document embedded as one vector loses specificity.
- Context Window Limits: LLMs have token limits. You cannot pass entire documents as context.
- Relevance Precision: Smaller chunks enable finding exactly the relevant part of a document.
- Cost Efficiency: Including irrelevant text wastes tokens and money.
The goal of chunking is to split documents into pieces that are:
- Small enough to be specific and fit in context
- Large enough to contain complete, meaningful information
The Chunking Challenge
Poor chunking leads to poor RAG performance:
PROBLEM: Sentence Split in Middle
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Original: "To reset your password, go to Settings. Then click
Security and select Reset Password."
Bad Chunking:
Chunk 1: "To reset your password, go to Setti"
Chunk 2: "ngs. Then click Security and select Reset Password."
Result: Neither chunk contains complete instructions.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
PROBLEM: Context Lost Across Chunks
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Original: "The refund policy varies by product type. For
electronics, returns must be within 30 days. For clothing,
the window is 60 days."
Bad Chunking:
Chunk 1: "The refund policy varies by product type."
Chunk 2: "For electronics, returns must be within 30 days."
Chunk 3: "For clothing, the window is 60 days."
Result: Chunk 2 mentions "30 days" but without context about
what policy this relates to.
Chunking Strategies
1. Fixed-Size Chunking
Split text into chunks of a fixed character or token count:
function chunkByCharacters(text: string, chunkSize: number, overlap: number = 0): string[] {
const chunks: string[] = [];
let start = 0;
while (start < text.length) {
const end = Math.min(start + chunkSize, text.length);
chunks.push(text.slice(start, end));
start += chunkSize - overlap;
}
return chunks;
}
// Usage
const text = 'This is a long document that needs to be split into chunks...';
const chunks = chunkByCharacters(text, 100, 20);
Pros: Simple, predictable chunk sizes Cons: May split sentences, ignores document structure
2. Recursive Character Splitting
Split by progressively smaller separators until chunks are small enough:
function recursiveCharacterSplit(
text: string,
chunkSize: number,
overlap: number,
separators: string[] = ['\n\n', '\n', '. ', ' ', '']
): string[] {
const chunks: string[] = [];
function splitText(text: string, separatorIndex: number): string[] {
if (text.length <= chunkSize) {
return [text];
}
const separator = separators[separatorIndex];
if (separatorIndex >= separators.length - 1) {
// Last resort: split by characters
return chunkByCharacters(text, chunkSize, overlap);
}
const parts = text.split(separator);
const result: string[] = [];
let currentChunk = '';
for (const part of parts) {
const potentialChunk = currentChunk ? currentChunk + separator + part : part;
if (potentialChunk.length <= chunkSize) {
currentChunk = potentialChunk;
} else {
if (currentChunk) {
result.push(currentChunk);
}
if (part.length > chunkSize) {
// Recursively split with next separator
result.push(...splitText(part, separatorIndex + 1));
currentChunk = '';
} else {
currentChunk = part;
}
}
}
if (currentChunk) {
result.push(currentChunk);
}
return result;
}
return splitText(text, 0);
}
Pros: Respects natural boundaries (paragraphs, sentences) Cons: More complex, variable chunk sizes
3. Sentence-Based Chunking
Split text into sentences, then group sentences into chunks:
function chunkBySentences(
text: string,
maxChunkSize: number,
overlap: number = 1 // Number of sentences to overlap
): string[] {
// Simple sentence splitting (for production, use a proper NLP library)
const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
const chunks: string[] = [];
let currentChunk: string[] = [];
let currentLength = 0;
for (let i = 0; i < sentences.length; i++) {
const sentence = sentences[i].trim();
if (currentLength + sentence.length > maxChunkSize && currentChunk.length > 0) {
chunks.push(currentChunk.join(' '));
// Keep overlap sentences
const overlapSentences = currentChunk.slice(-overlap);
currentChunk = overlapSentences;
currentLength = overlapSentences.join(' ').length;
}
currentChunk.push(sentence);
currentLength += sentence.length + 1;
}
if (currentChunk.length > 0) {
chunks.push(currentChunk.join(' '));
}
return chunks;
}
Pros: Never splits mid-sentence, natural language boundaries Cons: May produce very small or large chunks
4. Semantic Chunking
Group semantically similar content together using embeddings:
import OpenAI from 'openai';
const openai = new OpenAI();
async function semanticChunk(
text: string,
maxChunkSize: number,
similarityThreshold: number = 0.8
): Promise<string[]> {
// First, split into sentences
const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
// Get embeddings for all sentences
const response = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: sentences,
});
const embeddings = response.data.map((d) => d.embedding);
// Group sentences by semantic similarity
const chunks: string[] = [];
let currentChunk: string[] = [];
let currentLength = 0;
for (let i = 0; i < sentences.length; i++) {
const sentence = sentences[i].trim();
// Check if we should start a new chunk
let shouldSplit = currentLength + sentence.length > maxChunkSize;
// Also check semantic similarity with previous sentence
if (i > 0 && !shouldSplit) {
const similarity = cosineSimilarity(embeddings[i], embeddings[i - 1]);
if (similarity < similarityThreshold) {
shouldSplit = true;
}
}
if (shouldSplit && currentChunk.length > 0) {
chunks.push(currentChunk.join(' '));
currentChunk = [];
currentLength = 0;
}
currentChunk.push(sentence);
currentLength += sentence.length + 1;
}
if (currentChunk.length > 0) {
chunks.push(currentChunk.join(' '));
}
return chunks;
}
function cosineSimilarity(a: number[], b: number[]): number {
let dotProduct = 0;
let magnitudeA = 0;
let magnitudeB = 0;
for (let i = 0; i < a.length; i++) {
dotProduct += a[i] * b[i];
magnitudeA += a[i] * a[i];
magnitudeB += b[i] * b[i];
}
return dotProduct / (Math.sqrt(magnitudeA) * Math.sqrt(magnitudeB));
}
Pros: Keeps related content together Cons: Expensive (requires embeddings), complex
Chunk Size Guidelines
| Use Case | Recommended Chunk Size | Rationale |
|---|---|---|
| Q&A over docs | 500-1000 chars | Focused answers |
| Summarization | 2000-4000 chars | More context needed |
| Code search | 1000-2000 chars | Functions/classes |
| Legal documents | 1000-1500 chars | Preserve clauses |
Factors to Consider
- Embedding Model Limits: Most models handle up to 8192 tokens
- LLM Context Window: Leave room for system prompt and response
- Query Specificity: Shorter chunks = more precise matching
- Information Density: Technical docs may need larger chunks
Implementing Overlap
Overlap helps preserve context that might be split across chunks:
interface ChunkOptions {
chunkSize: number;
chunkOverlap: number;
}
function chunkWithOverlap(
text: string,
options: ChunkOptions
): { content: string; startIndex: number; endIndex: number }[] {
const { chunkSize, chunkOverlap } = options;
const chunks: { content: string; startIndex: number; endIndex: number }[] = [];
// Split into sentences first
const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
let currentChunk: string[] = [];
let currentLength = 0;
let chunkStartIndex = 0;
let sentenceStartIndex = 0;
for (const sentence of sentences) {
const trimmedSentence = sentence.trim();
if (currentLength + trimmedSentence.length > chunkSize && currentChunk.length > 0) {
const chunkContent = currentChunk.join(' ');
chunks.push({
content: chunkContent,
startIndex: chunkStartIndex,
endIndex: sentenceStartIndex,
});
// Calculate overlap
const overlapTarget = chunkOverlap;
let overlapLength = 0;
let overlapStart = currentChunk.length;
for (let i = currentChunk.length - 1; i >= 0; i--) {
overlapLength += currentChunk[i].length + 1;
if (overlapLength >= overlapTarget) {
overlapStart = i;
break;
}
}
currentChunk = currentChunk.slice(overlapStart);
currentLength = currentChunk.join(' ').length;
chunkStartIndex = sentenceStartIndex - currentLength;
}
currentChunk.push(trimmedSentence);
currentLength += trimmedSentence.length + 1;
sentenceStartIndex += sentence.length;
}
if (currentChunk.length > 0) {
chunks.push({
content: currentChunk.join(' '),
startIndex: chunkStartIndex,
endIndex: text.length,
});
}
return chunks;
}
Overlap Guidelines
| Document Type | Recommended Overlap |
|---|---|
| Technical docs | 10-15% of chunk size |
| Conversational | 5-10% |
| Legal/formal | 15-20% |
| Code | 20-30% (preserve function context) |
Document-Specific Chunking
Markdown Documents
function chunkMarkdown(
markdown: string,
maxChunkSize: number
): { content: string; metadata: { heading?: string } }[] {
const chunks: { content: string; metadata: { heading?: string } }[] = [];
// Split by headers
const sections = markdown.split(/(?=^#{1,6}\s)/m);
for (const section of sections) {
if (!section.trim()) continue;
// Extract heading if present
const headingMatch = section.match(/^(#{1,6})\s+(.+?)(?:\n|$)/);
const heading = headingMatch ? headingMatch[2].trim() : undefined;
const content = headingMatch ? section.slice(headingMatch[0].length) : section;
if (content.length <= maxChunkSize) {
chunks.push({
content: `${heading ? `# ${heading}\n\n` : ''}${content}`.trim(),
metadata: { heading },
});
} else {
// Split large sections further
const subChunks = recursiveCharacterSplit(content, maxChunkSize, 100);
for (const subChunk of subChunks) {
chunks.push({
content: `${heading ? `# ${heading}\n\n` : ''}${subChunk}`.trim(),
metadata: { heading },
});
}
}
}
return chunks;
}
Code Files
function chunkCode(
code: string,
language: string,
maxChunkSize: number
): { content: string; metadata: { type: string } }[] {
const chunks: { content: string; metadata: { type: string } }[] = [];
// Different patterns for different languages
const patterns: Record<string, RegExp> = {
typescript: /(?=(?:export\s+)?(?:async\s+)?(?:function|class|interface|type|const|let)\s+\w+)/g,
javascript: /(?=(?:export\s+)?(?:async\s+)?(?:function|class|const|let)\s+\w+)/g,
python: /(?=(?:def|class|async def)\s+\w+)/g,
};
const pattern = patterns[language] || patterns.typescript;
const blocks = code.split(pattern).filter((b) => b.trim());
let currentChunk = '';
for (const block of blocks) {
if (currentChunk.length + block.length > maxChunkSize && currentChunk) {
chunks.push({
content: currentChunk.trim(),
metadata: { type: 'code' },
});
currentChunk = '';
}
if (block.length > maxChunkSize) {
// Large block - include as-is or split further
if (currentChunk) {
chunks.push({
content: currentChunk.trim(),
metadata: { type: 'code' },
});
currentChunk = '';
}
chunks.push({
content: block.trim(),
metadata: { type: 'code' },
});
} else {
currentChunk += block;
}
}
if (currentChunk) {
chunks.push({
content: currentChunk.trim(),
metadata: { type: 'code' },
});
}
return chunks;
}
Complete Chunking Implementation
Here is a complete, production-ready chunking utility:
interface ChunkingOptions {
chunkSize: number;
chunkOverlap: number;
separators?: string[];
}
interface Chunk {
content: string;
index: number;
metadata: {
startChar: number;
endChar: number;
wordCount: number;
};
}
class TextChunker {
private options: Required<ChunkingOptions>;
constructor(options: ChunkingOptions) {
this.options = {
chunkSize: options.chunkSize,
chunkOverlap: options.chunkOverlap,
separators: options.separators || ['\n\n', '\n', '. ', ', ', ' ', ''],
};
}
chunk(text: string): Chunk[] {
const chunks = this.recursiveSplit(text);
return this.addOverlap(chunks, text);
}
private recursiveSplit(text: string, separatorIndex: number = 0): string[] {
const { chunkSize, separators } = this.options;
if (text.length <= chunkSize) {
return [text];
}
if (separatorIndex >= separators.length) {
// Last resort: hard split
return this.hardSplit(text);
}
const separator = separators[separatorIndex];
const parts = separator ? text.split(separator) : [text];
if (parts.length === 1) {
return this.recursiveSplit(text, separatorIndex + 1);
}
const chunks: string[] = [];
let currentChunk = '';
for (const part of parts) {
const potentialChunk = currentChunk ? currentChunk + separator + part : part;
if (potentialChunk.length <= chunkSize) {
currentChunk = potentialChunk;
} else {
if (currentChunk) {
chunks.push(currentChunk);
}
if (part.length > chunkSize) {
chunks.push(...this.recursiveSplit(part, separatorIndex + 1));
currentChunk = '';
} else {
currentChunk = part;
}
}
}
if (currentChunk) {
chunks.push(currentChunk);
}
return chunks;
}
private hardSplit(text: string): string[] {
const { chunkSize, chunkOverlap } = this.options;
const chunks: string[] = [];
let start = 0;
while (start < text.length) {
const end = Math.min(start + chunkSize, text.length);
chunks.push(text.slice(start, end));
start += chunkSize - chunkOverlap;
}
return chunks;
}
private addOverlap(rawChunks: string[], originalText: string): Chunk[] {
const { chunkOverlap } = this.options;
const chunks: Chunk[] = [];
let position = 0;
for (let i = 0; i < rawChunks.length; i++) {
const content = rawChunks[i];
const startChar = originalText.indexOf(content, position);
const endChar = startChar + content.length;
// Add overlap from previous chunk if exists
let finalContent = content;
if (i > 0 && chunkOverlap > 0) {
const prevChunk = rawChunks[i - 1];
const overlapText = prevChunk.slice(-chunkOverlap);
// Only prepend if it makes sense contextually
if (!content.startsWith(overlapText)) {
finalContent = overlapText + content;
}
}
chunks.push({
content: finalContent.trim(),
index: i,
metadata: {
startChar,
endChar,
wordCount: finalContent.split(/\s+/).length,
},
});
position = startChar + 1;
}
return chunks;
}
}
// Usage
const chunker = new TextChunker({
chunkSize: 1000,
chunkOverlap: 100,
});
const document = `
Your long document text goes here...
It can span multiple paragraphs.
Each paragraph will be handled appropriately.
`;
const chunks = chunker.chunk(document);
console.log(`Created ${chunks.length} chunks`);
for (const chunk of chunks) {
console.log(`\nChunk ${chunk.index}:`);
console.log(`Words: ${chunk.metadata.wordCount}`);
console.log(`Content: ${chunk.content.substring(0, 100)}...`);
}
Testing Your Chunking Strategy
Always test chunking with real documents:
function analyzeChunks(chunks: Chunk[]): void {
const sizes = chunks.map((c) => c.content.length);
const avgSize = sizes.reduce((a, b) => a + b, 0) / sizes.length;
const minSize = Math.min(...sizes);
const maxSize = Math.max(...sizes);
console.log('Chunking Analysis:');
console.log(`Total chunks: ${chunks.length}`);
console.log(`Average size: ${avgSize.toFixed(0)} chars`);
console.log(`Min size: ${minSize} chars`);
console.log(`Max size: ${maxSize} chars`);
// Check for problematic chunks
const tooSmall = chunks.filter((c) => c.content.length < 100);
const tooLarge = chunks.filter((c) => c.content.length > 2000);
if (tooSmall.length > 0) {
console.log(`Warning: ${tooSmall.length} chunks are very small`);
}
if (tooLarge.length > 0) {
console.log(`Warning: ${tooLarge.length} chunks are very large`);
}
}
Key Takeaways
- Chunking directly affects RAG quality - poor chunking leads to poor retrieval
- Recursive splitting preserves natural boundaries better than fixed-size chunking
- Overlap prevents context loss at chunk boundaries
- Different document types need different strategies - markdown, code, and prose require different approaches
- Test with real documents to find optimal parameters for your use case
Resources
| Resource | Type | Level |
|---|---|---|
| LangChain Text Splitters | Documentation | Intermediate |
| Chunking Strategies - Pinecone | Article | Beginner |
| Semantic Chunking Research | Paper | Advanced |
Next Lesson
In the next lesson, you will put everything together and build a complete Q&A system over your own documents.