Lesson 4.4: Open Source Models
Duration: 45 minutes
Learning Objectives
By the end of this lesson, you will be able to:
- Understand the open-source AI model landscape
- Know the key players: Llama, Mistral, and others
- Run open-source models locally and via hosted services
- Use Ollama for local model deployment
- Access open-source models through hosted APIs
- Evaluate when to use open-source vs. proprietary models
Introduction
Open-source AI models offer an alternative to proprietary APIs like OpenAI and Anthropic. Models from Meta (Llama), Mistral AI, and others can be run locally or through hosted services, giving you more control over data privacy, costs, and customization. In this lesson, you will learn about the open-source model ecosystem and how to integrate these models into your applications.
The Open-Source Model Landscape
┌─────────────────────────────────────────────────────────────────┐
│ Open Source Model Families │
├─────────────────────────────────────────────────────────────────┤
│ Llama 3 │ Meta's flagship open model │
│ (Meta) │ 8B and 70B parameter versions │
│ │ Best for: General purpose, coding │
├─────────────────┼───────────────────────────────────────────────┤
│ Mistral │ Efficient, high-quality models │
│ (Mistral AI) │ Mistral 7B, Mixtral 8x7B (MoE) │
│ │ Best for: Cost-effective inference │
├─────────────────┼───────────────────────────────────────────────┤
│ Qwen │ Strong multilingual support │
│ (Alibaba) │ Various sizes from 1.8B to 72B │
│ │ Best for: Asian languages, coding │
├─────────────────┼───────────────────────────────────────────────┤
│ Gemma │ Google's open models │
│ (Google) │ 2B and 7B versions │
│ │ Best for: Lightweight applications │
├─────────────────┼───────────────────────────────────────────────┤
│ CodeLlama │ Specialized for code │
│ (Meta) │ Built on Llama, optimized for coding │
│ │ Best for: Code generation, completion │
└─────────────────────────────────────────────────────────────────┘
Why Consider Open-Source Models?
| Benefit | Description |
|---|---|
| Data Privacy | Your data never leaves your infrastructure |
| Cost Control | No per-token fees after initial setup |
| Customization | Fine-tune for your specific use case |
| No Vendor Lock-in | Switch models without changing providers |
| Offline Capability | Run without internet connection |
| Full Control | No content filters or usage policies |
Trade-offs
| Challenge | Description |
|---|---|
| Infrastructure | Requires GPU hardware or cloud compute |
| Maintenance | You handle updates, scaling, monitoring |
| Performance | Often behind proprietary models in quality |
| Expertise | Requires ML ops knowledge |
Running Models Locally with Ollama
Ollama makes it easy to run open-source models locally. It handles downloading, running, and exposing models through a simple API.
Installing Ollama
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows - download from https://ollama.com
Running Your First Model
# Start the Ollama service
ollama serve
# In another terminal, pull and run a model
ollama pull llama3.2
ollama run llama3.2
# Interactive chat starts
>>> Hello, what can you do?
Using Ollama API from TypeScript
Ollama exposes an OpenAI-compatible API:
import OpenAI from 'openai';
// Point to local Ollama server
const ollama = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama', // Ollama doesn't require a real API key
});
async function chat(prompt: string): Promise<string> {
const response = await ollama.chat.completions.create({
model: 'llama3.2',
messages: [{ role: 'user', content: prompt }],
});
return response.choices[0].message.content || '';
}
// Usage
const answer = await chat('Explain TypeScript generics in simple terms');
console.log(answer);
Using the Native Ollama Library
npm install ollama
import { Ollama } from 'ollama';
const ollama = new Ollama({ host: 'http://localhost:11434' });
async function chat(prompt: string): Promise<string> {
const response = await ollama.chat({
model: 'llama3.2',
messages: [{ role: 'user', content: prompt }],
});
return response.message.content;
}
// With streaming
async function chatStream(prompt: string): Promise<void> {
const stream = await ollama.chat({
model: 'llama3.2',
messages: [{ role: 'user', content: prompt }],
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.message.content);
}
}
Managing Models with Ollama
import { Ollama } from 'ollama';
const ollama = new Ollama();
// List available models
async function listModels() {
const models = await ollama.list();
console.log('Available models:');
for (const model of models.models) {
console.log(`- ${model.name} (${model.size} bytes)`);
}
}
// Pull a new model
async function pullModel(modelName: string) {
console.log(`Pulling ${modelName}...`);
const progress = await ollama.pull({ model: modelName, stream: true });
for await (const status of progress) {
if (status.total && status.completed) {
const percent = Math.round((status.completed / status.total) * 100);
console.log(`Progress: ${percent}%`);
}
}
console.log('Done!');
}
// Delete a model
async function deleteModel(modelName: string) {
await ollama.delete({ model: modelName });
console.log(`Deleted ${modelName}`);
}
Popular Open-Source Models
Llama 3 (Meta)
Meta's Llama 3 is one of the most capable open-source models:
# Pull Llama 3.2 (3B parameters, runs on most hardware)
ollama pull llama3.2
# Pull Llama 3.1 (8B parameters, needs more RAM)
ollama pull llama3.1
# Pull Llama 3.1 70B (largest, needs powerful GPU)
ollama pull llama3.1:70b
const response = await ollama.chat({
model: 'llama3.2',
messages: [
{
role: 'system',
content: 'You are a helpful TypeScript expert.',
},
{
role: 'user',
content: 'How do I create a generic function in TypeScript?',
},
],
});
Mistral Models
Mistral AI produces efficient, high-quality models:
# Mistral 7B - excellent quality for its size
ollama pull mistral
# Mixtral 8x7B - Mixture of Experts, very capable
ollama pull mixtral
// Mistral works well for coding tasks
const response = await ollama.chat({
model: 'mistral',
messages: [
{
role: 'user',
content: `Review this TypeScript code for bugs:
\`\`\`typescript
function divide(a: number, b: number) {
return a / b;
}
\`\`\``,
},
],
});
CodeLlama (Meta)
Specialized for code generation and understanding:
ollama pull codellama
const response = await ollama.chat({
model: 'codellama',
messages: [
{
role: 'user',
content: 'Write a TypeScript function to validate email addresses using regex',
},
],
});
Comparing Model Capabilities
| Model | Parameters | RAM Required | Best For |
|---|---|---|---|
| Llama 3.2 | 3B | 4GB | Quick tasks, chat |
| Llama 3.1 | 8B | 8GB | General purpose |
| Mistral 7B | 7B | 6GB | Efficient inference |
| Mixtral 8x7B | 46B (active 12B) | 32GB | Complex reasoning |
| CodeLlama | 7B/13B/34B | 6-32GB | Code tasks |
Hosted Open-Source Model APIs
Several services host open-source models, giving you the benefits without managing infrastructure.
Together AI
Together AI provides fast inference for open-source models:
npm install together-ai
import Together from 'together-ai';
const together = new Together({
apiKey: process.env.TOGETHER_API_KEY,
});
async function chat(prompt: string): Promise<string> {
const response = await together.chat.completions.create({
model: 'meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo',
messages: [{ role: 'user', content: prompt }],
});
return response.choices[0].message?.content || '';
}
Groq
Groq offers extremely fast inference:
npm install groq-sdk
import Groq from 'groq-sdk';
const groq = new Groq({
apiKey: process.env.GROQ_API_KEY,
});
async function fastChat(prompt: string): Promise<string> {
const response = await groq.chat.completions.create({
model: 'llama-3.1-70b-versatile',
messages: [{ role: 'user', content: prompt }],
});
return response.choices[0].message.content || '';
}
// Groq is known for very low latency
const start = Date.now();
const answer = await fastChat('What is TypeScript?');
console.log(`Response in ${Date.now() - start}ms`);
Hugging Face Inference API
Access thousands of models through Hugging Face:
npm install @huggingface/inference
import { HfInference } from '@huggingface/inference';
const hf = new HfInference(process.env.HF_TOKEN);
async function chat(prompt: string): Promise<string> {
const response = await hf.chatCompletion({
model: 'mistralai/Mistral-7B-Instruct-v0.2',
messages: [{ role: 'user', content: prompt }],
max_tokens: 500,
});
return response.choices[0].message.content || '';
}
// Text generation (non-chat models)
async function generateText(prompt: string): Promise<string> {
const result = await hf.textGeneration({
model: 'bigcode/starcoder',
inputs: prompt,
parameters: { max_new_tokens: 200 },
});
return result.generated_text;
}
Building a Provider-Agnostic Client
Create a unified interface that works with any provider:
import OpenAI from 'openai';
interface ChatMessage {
role: 'system' | 'user' | 'assistant';
content: string;
}
interface ChatOptions {
model: string;
temperature?: number;
maxTokens?: number;
}
interface AIProvider {
chat(messages: ChatMessage[], options: ChatOptions): Promise<string>;
}
// OpenAI-compatible provider (works with Ollama, Together, Groq, etc.)
class OpenAICompatibleProvider implements AIProvider {
private client: OpenAI;
constructor(config: { baseURL?: string; apiKey: string }) {
this.client = new OpenAI(config);
}
async chat(messages: ChatMessage[], options: ChatOptions): Promise<string> {
const response = await this.client.chat.completions.create({
model: options.model,
messages,
temperature: options.temperature,
max_tokens: options.maxTokens,
});
return response.choices[0].message.content || '';
}
}
// Create providers for different services
const providers = {
openai: new OpenAICompatibleProvider({
apiKey: process.env.OPENAI_API_KEY || '',
}),
ollama: new OpenAICompatibleProvider({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama',
}),
together: new OpenAICompatibleProvider({
baseURL: 'https://api.together.xyz/v1',
apiKey: process.env.TOGETHER_API_KEY || '',
}),
groq: new OpenAICompatibleProvider({
baseURL: 'https://api.groq.com/openai/v1',
apiKey: process.env.GROQ_API_KEY || '',
}),
};
// Use any provider with the same interface
async function askQuestion(provider: AIProvider, question: string, model: string): Promise<string> {
return provider.chat([{ role: 'user', content: question }], { model, temperature: 0.7 });
}
// Examples
const answer1 = await askQuestion(providers.ollama, 'What is TypeScript?', 'llama3.2');
const answer2 = await askQuestion(providers.groq, 'What is TypeScript?', 'llama-3.1-70b-versatile');
Running Models in Production
Using Docker with Ollama
# Dockerfile
FROM ollama/ollama
# Pre-pull models during build
RUN ollama pull llama3.2
RUN ollama pull mistral
# docker-compose.yml
version: '3.8'
services:
ollama:
image: ollama/ollama
ports:
- '11434:11434'
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
app:
build: .
depends_on:
- ollama
environment:
- OLLAMA_HOST=http://ollama:11434
volumes:
ollama_data:
Health Checks and Monitoring
import { Ollama } from 'ollama';
class OllamaHealthCheck {
private ollama: Ollama;
private model: string;
constructor(host: string, model: string) {
this.ollama = new Ollama({ host });
this.model = model;
}
async isHealthy(): Promise<boolean> {
try {
const response = await this.ollama.chat({
model: this.model,
messages: [{ role: 'user', content: 'Hi' }],
});
return response.message.content.length > 0;
} catch {
return false;
}
}
async waitForReady(timeoutMs: number = 60000): Promise<void> {
const start = Date.now();
while (Date.now() - start < timeoutMs) {
if (await this.isHealthy()) {
console.log('Ollama is ready');
return;
}
console.log('Waiting for Ollama...');
await new Promise((resolve) => setTimeout(resolve, 2000));
}
throw new Error('Ollama failed to start within timeout');
}
}
// Usage
const health = new OllamaHealthCheck('http://localhost:11434', 'llama3.2');
await health.waitForReady();
Comparing Open-Source vs. Proprietary
| Aspect | Open-Source (Self-Hosted) | Proprietary (API) |
|---|---|---|
| Data Privacy | Full control | Data sent to provider |
| Cost Structure | Fixed infrastructure cost | Pay per token |
| Latency | Can be lower (local) | Network dependent |
| Quality | Good, improving rapidly | Generally higher |
| Maintenance | You handle everything | Provider handles |
| Scaling | Manual scaling needed | Automatic |
| Availability | Depends on your setup | High SLAs |
When to Use Open-Source
- Strict data privacy requirements - Healthcare, finance, legal
- High-volume, predictable workloads - Cost-effective at scale
- Offline requirements - Edge deployments, air-gapped systems
- Customization needs - Fine-tuning for specific domains
- Latency-sensitive - When network round-trip is too slow
When to Use Proprietary
- Highest quality needed - GPT-4, Claude still lead
- Variable workloads - Pay only for what you use
- Quick start - No infrastructure to manage
- Limited ML expertise - Providers handle optimization
- Global scale - Built-in worldwide infrastructure
Exercises
Exercise 1: Local Chat Application
Create a simple chat application using Ollama:
// Your implementation here
async function localChatbot(): Promise<void> {
// TODO: Create an interactive chat loop using Ollama
// - Use llama3.2 model
// - Maintain conversation history
// - Handle exit commands
}
Solution
import { Ollama } from 'ollama';
import * as readline from 'readline';
interface Message {
role: 'system' | 'user' | 'assistant';
content: string;
}
async function localChatbot(): Promise<void> {
const ollama = new Ollama();
const messages: Message[] = [
{
role: 'system',
content: 'You are a helpful assistant. Be concise but thorough.',
},
];
const rl = readline.createInterface({
input: process.stdin,
output: process.stdout,
});
console.log("Chat started. Type 'exit' to quit.\n");
const prompt = (query: string): Promise<string> => {
return new Promise((resolve) => rl.question(query, resolve));
};
while (true) {
const userInput = await prompt('You: ');
if (userInput.toLowerCase() === 'exit') {
console.log('Goodbye!');
rl.close();
break;
}
messages.push({ role: 'user', content: userInput });
try {
process.stdout.write('Assistant: ');
const stream = await ollama.chat({
model: 'llama3.2',
messages,
stream: true,
});
let fullResponse = '';
for await (const chunk of stream) {
process.stdout.write(chunk.message.content);
fullResponse += chunk.message.content;
}
console.log('\n');
messages.push({ role: 'assistant', content: fullResponse });
} catch (error) {
console.error('Error:', error);
}
}
}
localChatbot();
Exercise 2: Model Comparison Tool
Create a tool that sends the same prompt to multiple models and compares responses:
// Your implementation here
interface ComparisonResult {
model: string;
response: string;
latencyMs: number;
}
async function compareModels(prompt: string, models: string[]): Promise<ComparisonResult[]> {
// TODO: Send prompt to each model and collect results
}
Solution
import { Ollama } from 'ollama';
interface ComparisonResult {
model: string;
response: string;
latencyMs: number;
}
async function compareModels(prompt: string, models: string[]): Promise<ComparisonResult[]> {
const ollama = new Ollama();
const results: ComparisonResult[] = [];
for (const model of models) {
console.log(`Testing ${model}...`);
const start = Date.now();
try {
const response = await ollama.chat({
model,
messages: [{ role: 'user', content: prompt }],
});
results.push({
model,
response: response.message.content,
latencyMs: Date.now() - start,
});
} catch (error) {
results.push({
model,
response: `Error: ${error}`,
latencyMs: Date.now() - start,
});
}
}
return results;
}
// Test
const results = await compareModels(
'Write a TypeScript function to reverse a string. Be concise.',
['llama3.2', 'mistral', 'codellama']
);
console.log('\n=== Comparison Results ===\n');
for (const result of results) {
console.log(`Model: ${result.model}`);
console.log(`Latency: ${result.latencyMs}ms`);
console.log(`Response:\n${result.response}\n`);
console.log('-'.repeat(50));
}
Exercise 3: Fallback Chain
Implement a system that tries multiple providers with fallback:
// Your implementation here
class FallbackChain {
// TODO: Implement a chain that:
// - Tries providers in order
// - Falls back to next on failure
// - Returns first successful response
}
Solution
import OpenAI from 'openai';
interface Provider {
name: string;
client: OpenAI;
model: string;
}
class FallbackChain {
private providers: Provider[];
constructor(providers: Provider[]) {
this.providers = providers;
}
async chat(
messages: Array<{ role: 'user' | 'system' | 'assistant'; content: string }>
): Promise<{ response: string; provider: string }> {
const errors: Array<{ provider: string; error: string }> = [];
for (const provider of this.providers) {
try {
console.log(`Trying ${provider.name}...`);
const response = await provider.client.chat.completions.create({
model: provider.model,
messages,
});
const content = response.choices[0].message.content || '';
console.log(`Success with ${provider.name}`);
return {
response: content,
provider: provider.name,
};
} catch (error) {
const errorMsg = error instanceof Error ? error.message : String(error);
console.log(`${provider.name} failed: ${errorMsg}`);
errors.push({ provider: provider.name, error: errorMsg });
}
}
throw new Error(
`All providers failed:\n${errors.map((e) => `${e.provider}: ${e.error}`).join('\n')}`
);
}
}
// Usage
const chain = new FallbackChain([
{
name: 'Ollama (local)',
client: new OpenAI({ baseURL: 'http://localhost:11434/v1', apiKey: 'ollama' }),
model: 'llama3.2',
},
{
name: 'Groq',
client: new OpenAI({
baseURL: 'https://api.groq.com/openai/v1',
apiKey: process.env.GROQ_API_KEY || '',
}),
model: 'llama-3.1-70b-versatile',
},
{
name: 'OpenAI',
client: new OpenAI({ apiKey: process.env.OPENAI_API_KEY || '' }),
model: 'gpt-4o-mini',
},
]);
const result = await chain.chat([{ role: 'user', content: 'What is TypeScript?' }]);
console.log(`\nUsed: ${result.provider}`);
console.log(`Response: ${result.response}`);
Key Takeaways
- Ollama: The easiest way to run open-source models locally
- Model Variety: Llama, Mistral, CodeLlama each have strengths
- OpenAI-Compatible: Most services use OpenAI-compatible APIs
- Hosted Options: Together AI, Groq, Hugging Face host models for you
- Trade-offs: Balance privacy, cost, quality, and maintenance
- Fallbacks: Use multiple providers for reliability
- Choose Wisely: Match the model to your specific requirements
Resources
| Resource | Type | Description |
|---|---|---|
| Ollama | Tool | Local model runner |
| Hugging Face | Platform | Model hub and hosting |
| Together AI | Platform | Fast inference API |
| Groq | Platform | Ultra-fast inference |
| LM Studio | Tool | GUI for local models |
Next Lesson
You have learned about open-source models and how to run them. In the next lesson, you will learn how to compare and choose between all the providers we have covered, with a practical decision framework.