Lesson 4.4: Open Source Models

Duration: 45 minutes

Learning Objectives

By the end of this lesson, you will be able to:

Understand the open-source AI model landscape
Know the key players: Llama, Mistral, and others
Run open-source models locally and via hosted services
Use Ollama for local model deployment
Access open-source models through hosted APIs
Evaluate when to use open-source vs. proprietary models

Introduction

Open-source AI models offer an alternative to proprietary APIs like OpenAI and Anthropic. Models from Meta (Llama), Mistral AI, and others can be run locally or through hosted services, giving you more control over data privacy, costs, and customization. In this lesson, you will learn about the open-source model ecosystem and how to integrate these models into your applications.

The Open-Source Model Landscape

Model	Description	Best For
Llama 3 (Meta)	Meta's flagship open model. 8B and 70B parameter versions.	General purpose, coding
Mistral (Mistral AI)	Efficient, high-quality models. Mistral 7B, Mixtral 8x7B (MoE).	Cost-effective inference
Qwen (Alibaba)	Strong multilingual support. Various sizes from 1.8B to 72B.	Asian languages, coding
Gemma (Google)	Google's open models. 2B and 7B versions.	Lightweight applications
CodeLlama (Meta)	Specialized for code. Built on Llama, optimized for coding.	Code generation, completion

Why Consider Open-Source Models?

Benefit	Description
Data Privacy	Your data never leaves your infrastructure
Cost Control	No per-token fees after initial setup
Customization	Fine-tune for your specific use case
No Vendor Lock-in	Switch models without changing providers
Offline Capability	Run without internet connection
Full Control	No content filters or usage policies

Trade-offs

Challenge	Description
Infrastructure	Requires GPU hardware or cloud compute
Maintenance	You handle updates, scaling, monitoring
Performance	Often behind proprietary models in quality
Expertise	Requires ML ops knowledge

Running Models Locally with Ollama

Ollama makes it easy to run open-source models locally. It handles downloading, running, and exposing models through a simple API.

Installing Ollama

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows - download from https://ollama.com

Running Your First Model

# Start the Ollama service
ollama serve

# In another terminal, pull and run a model
ollama pull llama3.2
ollama run llama3.2

# Interactive chat starts
>>> Hello, what can you do?

Using Ollama API from TypeScript

Ollama exposes an OpenAI-compatible API:

import OpenAI from 'openai';

// Point to local Ollama server
const ollama = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama', // Ollama doesn't require a real API key
});

async function chat(prompt: string): Promise<string> {
  const response = await ollama.chat.completions.create({
    model: 'llama3.2',
    messages: [{ role: 'user', content: prompt }],
  });

  return response.choices[0].message.content || '';
}

// Usage
const answer = await chat('Explain TypeScript generics in simple terms');
console.log(answer);

Using the Native Ollama Library

npm install ollama

import { Ollama } from 'ollama';

const ollama = new Ollama({ host: 'http://localhost:11434' });

async function chat(prompt: string): Promise<string> {
  const response = await ollama.chat({
    model: 'llama3.2',
    messages: [{ role: 'user', content: prompt }],
  });

  return response.message.content;
}

// With streaming
async function chatStream(prompt: string): Promise<void> {
  const stream = await ollama.chat({
    model: 'llama3.2',
    messages: [{ role: 'user', content: prompt }],
    stream: true,
  });

  for await (const chunk of stream) {
    process.stdout.write(chunk.message.content);
  }
}

Managing Models with Ollama

import { Ollama } from 'ollama';

const ollama = new Ollama();

// List available models
async function listModels() {
  const models = await ollama.list();
  console.log('Available models:');
  for (const model of models.models) {
    console.log(`- ${model.name} (${model.size} bytes)`);
  }
}

// Pull a new model
async function pullModel(modelName: string) {
  console.log(`Pulling ${modelName}...`);
  const progress = await ollama.pull({ model: modelName, stream: true });

  for await (const status of progress) {
    if (status.total && status.completed) {
      const percent = Math.round((status.completed / status.total) * 100);
      console.log(`Progress: ${percent}%`);
    }
  }
  console.log('Done!');
}

// Delete a model
async function deleteModel(modelName: string) {
  await ollama.delete({ model: modelName });
  console.log(`Deleted ${modelName}`);
}

Popular Open-Source Models

Llama 3 (Meta)

Meta's Llama 3 is one of the most capable open-source models:

# Pull Llama 3.2 (3B parameters, runs on most hardware)
ollama pull llama3.2

# Pull Llama 3.1 (8B parameters, needs more RAM)
ollama pull llama3.1

# Pull Llama 3.1 70B (largest, needs powerful GPU)
ollama pull llama3.1:70b

const response = await ollama.chat({
  model: 'llama3.2',
  messages: [
    {
      role: 'system',
      content: 'You are a helpful TypeScript expert.',
    },
    {
      role: 'user',
      content: 'How do I create a generic function in TypeScript?',
    },
  ],
});

Mistral Models

Mistral AI produces efficient, high-quality models:

# Mistral 7B - excellent quality for its size
ollama pull mistral

# Mixtral 8x7B - Mixture of Experts, very capable
ollama pull mixtral

// Mistral works well for coding tasks
const response = await ollama.chat({
  model: 'mistral',
  messages: [
    {
      role: 'user',
      content: `Review this TypeScript code for bugs:

\`\`\`typescript
function divide(a: number, b: number) {
  return a / b;
}
\`\`\``,
    },
  ],
});

CodeLlama (Meta)

Specialized for code generation and understanding:

ollama pull codellama

const response = await ollama.chat({
  model: 'codellama',
  messages: [
    {
      role: 'user',
      content: 'Write a TypeScript function to validate email addresses using regex',
    },
  ],
});

Comparing Model Capabilities

Model	Parameters	RAM Required	Best For
Llama 3.2	3B	4GB	Quick tasks, chat
Llama 3.1	8B	8GB	General purpose
Mistral 7B	7B	6GB	Efficient inference
Mixtral 8x7B	46B (active 12B)	32GB	Complex reasoning
CodeLlama	7B/13B/34B	6-32GB	Code tasks

Hosted Open-Source Model APIs

Several services host open-source models, giving you the benefits without managing infrastructure.

Together AI

Together AI provides fast inference for open-source models:

npm install together-ai

import Together from 'together-ai';

const together = new Together({
  apiKey: process.env.TOGETHER_API_KEY,
});

async function chat(prompt: string): Promise<string> {
  const response = await together.chat.completions.create({
    model: 'meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo',
    messages: [{ role: 'user', content: prompt }],
  });

  return response.choices[0].message?.content || '';
}

Groq

Groq offers extremely fast inference:

npm install groq-sdk

import Groq from 'groq-sdk';

const groq = new Groq({
  apiKey: process.env.GROQ_API_KEY,
});

async function fastChat(prompt: string): Promise<string> {
  const response = await groq.chat.completions.create({
    model: 'llama-3.1-70b-versatile',
    messages: [{ role: 'user', content: prompt }],
  });

  return response.choices[0].message.content || '';
}

// Groq is known for very low latency
const start = Date.now();
const answer = await fastChat('What is TypeScript?');
console.log(`Response in ${Date.now() - start}ms`);

Hugging Face Inference API

Access thousands of models through Hugging Face:

npm install @huggingface/inference

import { HfInference } from '@huggingface/inference';

const hf = new HfInference(process.env.HF_TOKEN);

async function chat(prompt: string): Promise<string> {
  const response = await hf.chatCompletion({
    model: 'mistralai/Mistral-7B-Instruct-v0.2',
    messages: [{ role: 'user', content: prompt }],
    max_tokens: 500,
  });

  return response.choices[0].message.content || '';
}

// Text generation (non-chat models)
async function generateText(prompt: string): Promise<string> {
  const result = await hf.textGeneration({
    model: 'bigcode/starcoder',
    inputs: prompt,
    parameters: { max_new_tokens: 200 },
  });

  return result.generated_text;
}

Building a Provider-Agnostic Client

Create a unified interface that works with any provider:

import OpenAI from 'openai';

interface ChatMessage {
  role: 'system' | 'user' | 'assistant';
  content: string;
}

interface ChatOptions {
  model: string;
  temperature?: number;
  maxTokens?: number;
}

interface AIProvider {
  chat(messages: ChatMessage[], options: ChatOptions): Promise<string>;
}

// OpenAI-compatible provider (works with Ollama, Together, Groq, etc.)
class OpenAICompatibleProvider implements AIProvider {
  private client: OpenAI;

  constructor(config: { baseURL?: string; apiKey: string }) {
    this.client = new OpenAI(config);
  }

  async chat(messages: ChatMessage[], options: ChatOptions): Promise<string> {
    const response = await this.client.chat.completions.create({
      model: options.model,
      messages,
      temperature: options.temperature,
      max_tokens: options.maxTokens,
    });

    return response.choices[0].message.content || '';
  }
}

// Create providers for different services
const providers = {
  openai: new OpenAICompatibleProvider({
    apiKey: process.env.OPENAI_API_KEY || '',
  }),

  ollama: new OpenAICompatibleProvider({
    baseURL: 'http://localhost:11434/v1',
    apiKey: 'ollama',
  }),

  together: new OpenAICompatibleProvider({
    baseURL: 'https://api.together.xyz/v1',
    apiKey: process.env.TOGETHER_API_KEY || '',
  }),

  groq: new OpenAICompatibleProvider({
    baseURL: 'https://api.groq.com/openai/v1',
    apiKey: process.env.GROQ_API_KEY || '',
  }),
};

// Use any provider with the same interface
async function askQuestion(provider: AIProvider, question: string, model: string): Promise<string> {
  return provider.chat([{ role: 'user', content: question }], { model, temperature: 0.7 });
}

// Examples
const answer1 = await askQuestion(providers.ollama, 'What is TypeScript?', 'llama3.2');

const answer2 = await askQuestion(providers.groq, 'What is TypeScript?', 'llama-3.1-70b-versatile');

Running Models in Production

Using Docker with Ollama

# Dockerfile
FROM ollama/ollama

# Pre-pull models during build
RUN ollama pull llama3.2
RUN ollama pull mistral

# docker-compose.yml
version: '3.8'
services:
  ollama:
    image: ollama/ollama
    ports:
      - '11434:11434'
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  app:
    build: .
    depends_on:
      - ollama
    environment:
      - OLLAMA_HOST=http://ollama:11434

volumes:
  ollama_data:

Health Checks and Monitoring

import { Ollama } from 'ollama';

class OllamaHealthCheck {
  private ollama: Ollama;
  private model: string;

  constructor(host: string, model: string) {
    this.ollama = new Ollama({ host });
    this.model = model;
  }

  async isHealthy(): Promise<boolean> {
    try {
      const response = await this.ollama.chat({
        model: this.model,
        messages: [{ role: 'user', content: 'Hi' }],
      });
      return response.message.content.length > 0;
    } catch {
      return false;
    }
  }

  async waitForReady(timeoutMs: number = 60000): Promise<void> {
    const start = Date.now();

    while (Date.now() - start < timeoutMs) {
      if (await this.isHealthy()) {
        console.log('Ollama is ready');
        return;
      }
      console.log('Waiting for Ollama...');
      await new Promise((resolve) => setTimeout(resolve, 2000));
    }

    throw new Error('Ollama failed to start within timeout');
  }
}

// Usage
const health = new OllamaHealthCheck('http://localhost:11434', 'llama3.2');
await health.waitForReady();

Comparing Open-Source vs. Proprietary

Aspect	Open-Source (Self-Hosted)	Proprietary (API)
Data Privacy	Full control	Data sent to provider
Cost Structure	Fixed infrastructure cost	Pay per token
Latency	Can be lower (local)	Network dependent
Quality	Good, improving rapidly	Generally higher
Maintenance	You handle everything	Provider handles
Scaling	Manual scaling needed	Automatic
Availability	Depends on your setup	High SLAs

When to Use Open-Source

Strict data privacy requirements - Healthcare, finance, legal
High-volume, predictable workloads - Cost-effective at scale
Offline requirements - Edge deployments, air-gapped systems
Customization needs - Fine-tuning for specific domains
Latency-sensitive - When network round-trip is too slow

When to Use Proprietary

Highest quality needed - GPT-4, Claude still lead
Variable workloads - Pay only for what you use
Quick start - No infrastructure to manage
Limited ML expertise - Providers handle optimization
Global scale - Built-in worldwide infrastructure

Exercises

Exercise 1: Local Chat Application

Create a simple chat application using Ollama:

// Your implementation here
async function localChatbot(): Promise<void> {
  // TODO: Create an interactive chat loop using Ollama
  // - Use llama3.2 model
  // - Maintain conversation history
  // - Handle exit commands
}

Solution

import { Ollama } from 'ollama';
import * as readline from 'readline';

interface Message {
  role: 'system' | 'user' | 'assistant';
  content: string;
}

async function localChatbot(): Promise<void> {
  const ollama = new Ollama();
  const messages: Message[] = [
    {
      role: 'system',
      content: 'You are a helpful assistant. Be concise but thorough.',
    },
  ];

  const rl = readline.createInterface({
    input: process.stdin,
    output: process.stdout,
  });

  console.log("Chat started. Type 'exit' to quit.\n");

  const prompt = (query: string): Promise<string> => {
    return new Promise((resolve) => rl.question(query, resolve));
  };

  while (true) {
    const userInput = await prompt('You: ');

    if (userInput.toLowerCase() === 'exit') {
      console.log('Goodbye!');
      rl.close();
      break;
    }

    messages.push({ role: 'user', content: userInput });

    try {
      process.stdout.write('Assistant: ');

      const stream = await ollama.chat({
        model: 'llama3.2',
        messages,
        stream: true,
      });

      let fullResponse = '';
      for await (const chunk of stream) {
        process.stdout.write(chunk.message.content);
        fullResponse += chunk.message.content;
      }
      console.log('\n');

      messages.push({ role: 'assistant', content: fullResponse });
    } catch (error) {
      console.error('Error:', error);
    }
  }
}

localChatbot();

Exercise 2: Model Comparison Tool

Create a tool that sends the same prompt to multiple models and compares responses:

// Your implementation here
interface ComparisonResult {
  model: string;
  response: string;
  latencyMs: number;
}

async function compareModels(prompt: string, models: string[]): Promise<ComparisonResult[]> {
  // TODO: Send prompt to each model and collect results
}

Solution

import { Ollama } from 'ollama';

interface ComparisonResult {
  model: string;
  response: string;
  latencyMs: number;
}

async function compareModels(prompt: string, models: string[]): Promise<ComparisonResult[]> {
  const ollama = new Ollama();
  const results: ComparisonResult[] = [];

  for (const model of models) {
    console.log(`Testing ${model}...`);
    const start = Date.now();

    try {
      const response = await ollama.chat({
        model,
        messages: [{ role: 'user', content: prompt }],
      });

      results.push({
        model,
        response: response.message.content,
        latencyMs: Date.now() - start,
      });
    } catch (error) {
      results.push({
        model,
        response: `Error: ${error}`,
        latencyMs: Date.now() - start,
      });
    }
  }

  return results;
}

// Test
const results = await compareModels(
  'Write a TypeScript function to reverse a string. Be concise.',
  ['llama3.2', 'mistral', 'codellama']
);

console.log('\n=== Comparison Results ===\n');
for (const result of results) {
  console.log(`Model: ${result.model}`);
  console.log(`Latency: ${result.latencyMs}ms`);
  console.log(`Response:\n${result.response}\n`);
  console.log('-'.repeat(50));
}

Exercise 3: Fallback Chain

Implement a system that tries multiple providers with fallback:

// Your implementation here
class FallbackChain {
  // TODO: Implement a chain that:
  // - Tries providers in order
  // - Falls back to next on failure
  // - Returns first successful response
}

Solution

import OpenAI from 'openai';

interface Provider {
  name: string;
  client: OpenAI;
  model: string;
}

class FallbackChain {
  private providers: Provider[];

  constructor(providers: Provider[]) {
    this.providers = providers;
  }

  async chat(
    messages: Array<{ role: 'user' | 'system' | 'assistant'; content: string }>
  ): Promise<{ response: string; provider: string }> {
    const errors: Array<{ provider: string; error: string }> = [];

    for (const provider of this.providers) {
      try {
        console.log(`Trying ${provider.name}...`);

        const response = await provider.client.chat.completions.create({
          model: provider.model,
          messages,
        });

        const content = response.choices[0].message.content || '';
        console.log(`Success with ${provider.name}`);

        return {
          response: content,
          provider: provider.name,
        };
      } catch (error) {
        const errorMsg = error instanceof Error ? error.message : String(error);
        console.log(`${provider.name} failed: ${errorMsg}`);
        errors.push({ provider: provider.name, error: errorMsg });
      }
    }

    throw new Error(
      `All providers failed:\n${errors.map((e) => `${e.provider}: ${e.error}`).join('\n')}`
    );
  }
}

// Usage
const chain = new FallbackChain([
  {
    name: 'Ollama (local)',
    client: new OpenAI({ baseURL: 'http://localhost:11434/v1', apiKey: 'ollama' }),
    model: 'llama3.2',
  },
  {
    name: 'Groq',
    client: new OpenAI({
      baseURL: 'https://api.groq.com/openai/v1',
      apiKey: process.env.GROQ_API_KEY || '',
    }),
    model: 'llama-3.1-70b-versatile',
  },
  {
    name: 'OpenAI',
    client: new OpenAI({ apiKey: process.env.OPENAI_API_KEY || '' }),
    model: 'gpt-4o-mini',
  },
]);

const result = await chain.chat([{ role: 'user', content: 'What is TypeScript?' }]);

console.log(`\nUsed: ${result.provider}`);
console.log(`Response: ${result.response}`);

Key Takeaways

Ollama: The easiest way to run open-source models locally
Model Variety: Llama, Mistral, CodeLlama each have strengths
OpenAI-Compatible: Most services use OpenAI-compatible APIs
Hosted Options: Together AI, Groq, Hugging Face host models for you
Trade-offs: Balance privacy, cost, quality, and maintenance
Fallbacks: Use multiple providers for reliability
Choose Wisely: Match the model to your specific requirements

Resources

Resource	Type	Description
Ollama	Tool	Local model runner
Hugging Face	Platform	Model hub and hosting
Together AI	Platform	Fast inference API
Groq	Platform	Ultra-fast inference
LM Studio	Tool	GUI for local models

Next Lesson

You have learned about open-source models and how to run them. In the next lesson, you will learn how to compare and choose between all the providers we have covered, with a practical decision framework.

Continue to Lesson 4.5: Comparing and Choosing Providers