From Zero to AI

Lesson 4.4: Open Source Models

Duration: 45 minutes

Learning Objectives

By the end of this lesson, you will be able to:

  • Understand the open-source AI model landscape
  • Know the key players: Llama, Mistral, and others
  • Run open-source models locally and via hosted services
  • Use Ollama for local model deployment
  • Access open-source models through hosted APIs
  • Evaluate when to use open-source vs. proprietary models

Introduction

Open-source AI models offer an alternative to proprietary APIs like OpenAI and Anthropic. Models from Meta (Llama), Mistral AI, and others can be run locally or through hosted services, giving you more control over data privacy, costs, and customization. In this lesson, you will learn about the open-source model ecosystem and how to integrate these models into your applications.


The Open-Source Model Landscape

┌─────────────────────────────────────────────────────────────────┐
│              Open Source Model Families                          │
├─────────────────────────────────────────────────────────────────┤
│  Llama 3       │ Meta's flagship open model                     │
│  (Meta)        │ 8B and 70B parameter versions                  │
│                │ Best for: General purpose, coding              │
├─────────────────┼───────────────────────────────────────────────┤
│  Mistral       │ Efficient, high-quality models                 │
│  (Mistral AI)  │ Mistral 7B, Mixtral 8x7B (MoE)                │
│                │ Best for: Cost-effective inference             │
├─────────────────┼───────────────────────────────────────────────┤
│  Qwen          │ Strong multilingual support                    │
│  (Alibaba)     │ Various sizes from 1.8B to 72B                │
│                │ Best for: Asian languages, coding              │
├─────────────────┼───────────────────────────────────────────────┤
│  Gemma         │ Google's open models                           │
│  (Google)      │ 2B and 7B versions                             │
│                │ Best for: Lightweight applications             │
├─────────────────┼───────────────────────────────────────────────┤
│  CodeLlama     │ Specialized for code                           │
│  (Meta)        │ Built on Llama, optimized for coding          │
│                │ Best for: Code generation, completion          │
└─────────────────────────────────────────────────────────────────┘

Why Consider Open-Source Models?

Benefit Description
Data Privacy Your data never leaves your infrastructure
Cost Control No per-token fees after initial setup
Customization Fine-tune for your specific use case
No Vendor Lock-in Switch models without changing providers
Offline Capability Run without internet connection
Full Control No content filters or usage policies

Trade-offs

Challenge Description
Infrastructure Requires GPU hardware or cloud compute
Maintenance You handle updates, scaling, monitoring
Performance Often behind proprietary models in quality
Expertise Requires ML ops knowledge

Running Models Locally with Ollama

Ollama makes it easy to run open-source models locally. It handles downloading, running, and exposing models through a simple API.

Installing Ollama

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows - download from https://ollama.com

Running Your First Model

# Start the Ollama service
ollama serve

# In another terminal, pull and run a model
ollama pull llama3.2
ollama run llama3.2

# Interactive chat starts
>>> Hello, what can you do?

Using Ollama API from TypeScript

Ollama exposes an OpenAI-compatible API:

import OpenAI from 'openai';

// Point to local Ollama server
const ollama = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama', // Ollama doesn't require a real API key
});

async function chat(prompt: string): Promise<string> {
  const response = await ollama.chat.completions.create({
    model: 'llama3.2',
    messages: [{ role: 'user', content: prompt }],
  });

  return response.choices[0].message.content || '';
}

// Usage
const answer = await chat('Explain TypeScript generics in simple terms');
console.log(answer);

Using the Native Ollama Library

npm install ollama
import { Ollama } from 'ollama';

const ollama = new Ollama({ host: 'http://localhost:11434' });

async function chat(prompt: string): Promise<string> {
  const response = await ollama.chat({
    model: 'llama3.2',
    messages: [{ role: 'user', content: prompt }],
  });

  return response.message.content;
}

// With streaming
async function chatStream(prompt: string): Promise<void> {
  const stream = await ollama.chat({
    model: 'llama3.2',
    messages: [{ role: 'user', content: prompt }],
    stream: true,
  });

  for await (const chunk of stream) {
    process.stdout.write(chunk.message.content);
  }
}

Managing Models with Ollama

import { Ollama } from 'ollama';

const ollama = new Ollama();

// List available models
async function listModels() {
  const models = await ollama.list();
  console.log('Available models:');
  for (const model of models.models) {
    console.log(`- ${model.name} (${model.size} bytes)`);
  }
}

// Pull a new model
async function pullModel(modelName: string) {
  console.log(`Pulling ${modelName}...`);
  const progress = await ollama.pull({ model: modelName, stream: true });

  for await (const status of progress) {
    if (status.total && status.completed) {
      const percent = Math.round((status.completed / status.total) * 100);
      console.log(`Progress: ${percent}%`);
    }
  }
  console.log('Done!');
}

// Delete a model
async function deleteModel(modelName: string) {
  await ollama.delete({ model: modelName });
  console.log(`Deleted ${modelName}`);
}

Llama 3 (Meta)

Meta's Llama 3 is one of the most capable open-source models:

# Pull Llama 3.2 (3B parameters, runs on most hardware)
ollama pull llama3.2

# Pull Llama 3.1 (8B parameters, needs more RAM)
ollama pull llama3.1

# Pull Llama 3.1 70B (largest, needs powerful GPU)
ollama pull llama3.1:70b
const response = await ollama.chat({
  model: 'llama3.2',
  messages: [
    {
      role: 'system',
      content: 'You are a helpful TypeScript expert.',
    },
    {
      role: 'user',
      content: 'How do I create a generic function in TypeScript?',
    },
  ],
});

Mistral Models

Mistral AI produces efficient, high-quality models:

# Mistral 7B - excellent quality for its size
ollama pull mistral

# Mixtral 8x7B - Mixture of Experts, very capable
ollama pull mixtral
// Mistral works well for coding tasks
const response = await ollama.chat({
  model: 'mistral',
  messages: [
    {
      role: 'user',
      content: `Review this TypeScript code for bugs:

\`\`\`typescript
function divide(a: number, b: number) {
  return a / b;
}
\`\`\``,
    },
  ],
});

CodeLlama (Meta)

Specialized for code generation and understanding:

ollama pull codellama
const response = await ollama.chat({
  model: 'codellama',
  messages: [
    {
      role: 'user',
      content: 'Write a TypeScript function to validate email addresses using regex',
    },
  ],
});

Comparing Model Capabilities

Model Parameters RAM Required Best For
Llama 3.2 3B 4GB Quick tasks, chat
Llama 3.1 8B 8GB General purpose
Mistral 7B 7B 6GB Efficient inference
Mixtral 8x7B 46B (active 12B) 32GB Complex reasoning
CodeLlama 7B/13B/34B 6-32GB Code tasks

Hosted Open-Source Model APIs

Several services host open-source models, giving you the benefits without managing infrastructure.

Together AI

Together AI provides fast inference for open-source models:

npm install together-ai
import Together from 'together-ai';

const together = new Together({
  apiKey: process.env.TOGETHER_API_KEY,
});

async function chat(prompt: string): Promise<string> {
  const response = await together.chat.completions.create({
    model: 'meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo',
    messages: [{ role: 'user', content: prompt }],
  });

  return response.choices[0].message?.content || '';
}

Groq

Groq offers extremely fast inference:

npm install groq-sdk
import Groq from 'groq-sdk';

const groq = new Groq({
  apiKey: process.env.GROQ_API_KEY,
});

async function fastChat(prompt: string): Promise<string> {
  const response = await groq.chat.completions.create({
    model: 'llama-3.1-70b-versatile',
    messages: [{ role: 'user', content: prompt }],
  });

  return response.choices[0].message.content || '';
}

// Groq is known for very low latency
const start = Date.now();
const answer = await fastChat('What is TypeScript?');
console.log(`Response in ${Date.now() - start}ms`);

Hugging Face Inference API

Access thousands of models through Hugging Face:

npm install @huggingface/inference
import { HfInference } from '@huggingface/inference';

const hf = new HfInference(process.env.HF_TOKEN);

async function chat(prompt: string): Promise<string> {
  const response = await hf.chatCompletion({
    model: 'mistralai/Mistral-7B-Instruct-v0.2',
    messages: [{ role: 'user', content: prompt }],
    max_tokens: 500,
  });

  return response.choices[0].message.content || '';
}

// Text generation (non-chat models)
async function generateText(prompt: string): Promise<string> {
  const result = await hf.textGeneration({
    model: 'bigcode/starcoder',
    inputs: prompt,
    parameters: { max_new_tokens: 200 },
  });

  return result.generated_text;
}

Building a Provider-Agnostic Client

Create a unified interface that works with any provider:

import OpenAI from 'openai';

interface ChatMessage {
  role: 'system' | 'user' | 'assistant';
  content: string;
}

interface ChatOptions {
  model: string;
  temperature?: number;
  maxTokens?: number;
}

interface AIProvider {
  chat(messages: ChatMessage[], options: ChatOptions): Promise<string>;
}

// OpenAI-compatible provider (works with Ollama, Together, Groq, etc.)
class OpenAICompatibleProvider implements AIProvider {
  private client: OpenAI;

  constructor(config: { baseURL?: string; apiKey: string }) {
    this.client = new OpenAI(config);
  }

  async chat(messages: ChatMessage[], options: ChatOptions): Promise<string> {
    const response = await this.client.chat.completions.create({
      model: options.model,
      messages,
      temperature: options.temperature,
      max_tokens: options.maxTokens,
    });

    return response.choices[0].message.content || '';
  }
}

// Create providers for different services
const providers = {
  openai: new OpenAICompatibleProvider({
    apiKey: process.env.OPENAI_API_KEY || '',
  }),

  ollama: new OpenAICompatibleProvider({
    baseURL: 'http://localhost:11434/v1',
    apiKey: 'ollama',
  }),

  together: new OpenAICompatibleProvider({
    baseURL: 'https://api.together.xyz/v1',
    apiKey: process.env.TOGETHER_API_KEY || '',
  }),

  groq: new OpenAICompatibleProvider({
    baseURL: 'https://api.groq.com/openai/v1',
    apiKey: process.env.GROQ_API_KEY || '',
  }),
};

// Use any provider with the same interface
async function askQuestion(provider: AIProvider, question: string, model: string): Promise<string> {
  return provider.chat([{ role: 'user', content: question }], { model, temperature: 0.7 });
}

// Examples
const answer1 = await askQuestion(providers.ollama, 'What is TypeScript?', 'llama3.2');

const answer2 = await askQuestion(providers.groq, 'What is TypeScript?', 'llama-3.1-70b-versatile');

Running Models in Production

Using Docker with Ollama

# Dockerfile
FROM ollama/ollama

# Pre-pull models during build
RUN ollama pull llama3.2
RUN ollama pull mistral
# docker-compose.yml
version: '3.8'
services:
  ollama:
    image: ollama/ollama
    ports:
      - '11434:11434'
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  app:
    build: .
    depends_on:
      - ollama
    environment:
      - OLLAMA_HOST=http://ollama:11434

volumes:
  ollama_data:

Health Checks and Monitoring

import { Ollama } from 'ollama';

class OllamaHealthCheck {
  private ollama: Ollama;
  private model: string;

  constructor(host: string, model: string) {
    this.ollama = new Ollama({ host });
    this.model = model;
  }

  async isHealthy(): Promise<boolean> {
    try {
      const response = await this.ollama.chat({
        model: this.model,
        messages: [{ role: 'user', content: 'Hi' }],
      });
      return response.message.content.length > 0;
    } catch {
      return false;
    }
  }

  async waitForReady(timeoutMs: number = 60000): Promise<void> {
    const start = Date.now();

    while (Date.now() - start < timeoutMs) {
      if (await this.isHealthy()) {
        console.log('Ollama is ready');
        return;
      }
      console.log('Waiting for Ollama...');
      await new Promise((resolve) => setTimeout(resolve, 2000));
    }

    throw new Error('Ollama failed to start within timeout');
  }
}

// Usage
const health = new OllamaHealthCheck('http://localhost:11434', 'llama3.2');
await health.waitForReady();

Comparing Open-Source vs. Proprietary

Aspect Open-Source (Self-Hosted) Proprietary (API)
Data Privacy Full control Data sent to provider
Cost Structure Fixed infrastructure cost Pay per token
Latency Can be lower (local) Network dependent
Quality Good, improving rapidly Generally higher
Maintenance You handle everything Provider handles
Scaling Manual scaling needed Automatic
Availability Depends on your setup High SLAs

When to Use Open-Source

  1. Strict data privacy requirements - Healthcare, finance, legal
  2. High-volume, predictable workloads - Cost-effective at scale
  3. Offline requirements - Edge deployments, air-gapped systems
  4. Customization needs - Fine-tuning for specific domains
  5. Latency-sensitive - When network round-trip is too slow

When to Use Proprietary

  1. Highest quality needed - GPT-4, Claude still lead
  2. Variable workloads - Pay only for what you use
  3. Quick start - No infrastructure to manage
  4. Limited ML expertise - Providers handle optimization
  5. Global scale - Built-in worldwide infrastructure

Exercises

Exercise 1: Local Chat Application

Create a simple chat application using Ollama:

// Your implementation here
async function localChatbot(): Promise<void> {
  // TODO: Create an interactive chat loop using Ollama
  // - Use llama3.2 model
  // - Maintain conversation history
  // - Handle exit commands
}
Solution
import { Ollama } from 'ollama';
import * as readline from 'readline';

interface Message {
  role: 'system' | 'user' | 'assistant';
  content: string;
}

async function localChatbot(): Promise<void> {
  const ollama = new Ollama();
  const messages: Message[] = [
    {
      role: 'system',
      content: 'You are a helpful assistant. Be concise but thorough.',
    },
  ];

  const rl = readline.createInterface({
    input: process.stdin,
    output: process.stdout,
  });

  console.log("Chat started. Type 'exit' to quit.\n");

  const prompt = (query: string): Promise<string> => {
    return new Promise((resolve) => rl.question(query, resolve));
  };

  while (true) {
    const userInput = await prompt('You: ');

    if (userInput.toLowerCase() === 'exit') {
      console.log('Goodbye!');
      rl.close();
      break;
    }

    messages.push({ role: 'user', content: userInput });

    try {
      process.stdout.write('Assistant: ');

      const stream = await ollama.chat({
        model: 'llama3.2',
        messages,
        stream: true,
      });

      let fullResponse = '';
      for await (const chunk of stream) {
        process.stdout.write(chunk.message.content);
        fullResponse += chunk.message.content;
      }
      console.log('\n');

      messages.push({ role: 'assistant', content: fullResponse });
    } catch (error) {
      console.error('Error:', error);
    }
  }
}

localChatbot();

Exercise 2: Model Comparison Tool

Create a tool that sends the same prompt to multiple models and compares responses:

// Your implementation here
interface ComparisonResult {
  model: string;
  response: string;
  latencyMs: number;
}

async function compareModels(prompt: string, models: string[]): Promise<ComparisonResult[]> {
  // TODO: Send prompt to each model and collect results
}
Solution
import { Ollama } from 'ollama';

interface ComparisonResult {
  model: string;
  response: string;
  latencyMs: number;
}

async function compareModels(prompt: string, models: string[]): Promise<ComparisonResult[]> {
  const ollama = new Ollama();
  const results: ComparisonResult[] = [];

  for (const model of models) {
    console.log(`Testing ${model}...`);
    const start = Date.now();

    try {
      const response = await ollama.chat({
        model,
        messages: [{ role: 'user', content: prompt }],
      });

      results.push({
        model,
        response: response.message.content,
        latencyMs: Date.now() - start,
      });
    } catch (error) {
      results.push({
        model,
        response: `Error: ${error}`,
        latencyMs: Date.now() - start,
      });
    }
  }

  return results;
}

// Test
const results = await compareModels(
  'Write a TypeScript function to reverse a string. Be concise.',
  ['llama3.2', 'mistral', 'codellama']
);

console.log('\n=== Comparison Results ===\n');
for (const result of results) {
  console.log(`Model: ${result.model}`);
  console.log(`Latency: ${result.latencyMs}ms`);
  console.log(`Response:\n${result.response}\n`);
  console.log('-'.repeat(50));
}

Exercise 3: Fallback Chain

Implement a system that tries multiple providers with fallback:

// Your implementation here
class FallbackChain {
  // TODO: Implement a chain that:
  // - Tries providers in order
  // - Falls back to next on failure
  // - Returns first successful response
}
Solution
import OpenAI from 'openai';

interface Provider {
  name: string;
  client: OpenAI;
  model: string;
}

class FallbackChain {
  private providers: Provider[];

  constructor(providers: Provider[]) {
    this.providers = providers;
  }

  async chat(
    messages: Array<{ role: 'user' | 'system' | 'assistant'; content: string }>
  ): Promise<{ response: string; provider: string }> {
    const errors: Array<{ provider: string; error: string }> = [];

    for (const provider of this.providers) {
      try {
        console.log(`Trying ${provider.name}...`);

        const response = await provider.client.chat.completions.create({
          model: provider.model,
          messages,
        });

        const content = response.choices[0].message.content || '';
        console.log(`Success with ${provider.name}`);

        return {
          response: content,
          provider: provider.name,
        };
      } catch (error) {
        const errorMsg = error instanceof Error ? error.message : String(error);
        console.log(`${provider.name} failed: ${errorMsg}`);
        errors.push({ provider: provider.name, error: errorMsg });
      }
    }

    throw new Error(
      `All providers failed:\n${errors.map((e) => `${e.provider}: ${e.error}`).join('\n')}`
    );
  }
}

// Usage
const chain = new FallbackChain([
  {
    name: 'Ollama (local)',
    client: new OpenAI({ baseURL: 'http://localhost:11434/v1', apiKey: 'ollama' }),
    model: 'llama3.2',
  },
  {
    name: 'Groq',
    client: new OpenAI({
      baseURL: 'https://api.groq.com/openai/v1',
      apiKey: process.env.GROQ_API_KEY || '',
    }),
    model: 'llama-3.1-70b-versatile',
  },
  {
    name: 'OpenAI',
    client: new OpenAI({ apiKey: process.env.OPENAI_API_KEY || '' }),
    model: 'gpt-4o-mini',
  },
]);

const result = await chain.chat([{ role: 'user', content: 'What is TypeScript?' }]);

console.log(`\nUsed: ${result.provider}`);
console.log(`Response: ${result.response}`);

Key Takeaways

  1. Ollama: The easiest way to run open-source models locally
  2. Model Variety: Llama, Mistral, CodeLlama each have strengths
  3. OpenAI-Compatible: Most services use OpenAI-compatible APIs
  4. Hosted Options: Together AI, Groq, Hugging Face host models for you
  5. Trade-offs: Balance privacy, cost, quality, and maintenance
  6. Fallbacks: Use multiple providers for reliability
  7. Choose Wisely: Match the model to your specific requirements

Resources

Resource Type Description
Ollama Tool Local model runner
Hugging Face Platform Model hub and hosting
Together AI Platform Fast inference API
Groq Platform Ultra-fast inference
LM Studio Tool GUI for local models

Next Lesson

You have learned about open-source models and how to run them. In the next lesson, you will learn how to compare and choose between all the providers we have covered, with a practical decision framework.

Continue to Lesson 4.5: Comparing and Choosing Providers