Performance Optimization for Ollama¶

Maximize the speed and efficiency of your local AI models with hardware optimization, configuration tuning, and workflow improvements.

Hardware Optimization¶

CPU Performance¶

# Check CPU capabilities
lscpu | grep -E "(Model name|Socket|Core|Thread)"

# Verify CPU features for AI workloads
lscpu | grep -E "(avx|avx2|avx512)"
# AVX2/AVX512 provide significant speedups

# Monitor CPU usage during inference
htop
# Look for high CPU utilization across all cores

Memory Optimization¶

# Check available memory
free -h

# Verify system limits
ulimit -v  # Virtual memory limit
sysctl vm.max_map_count  # Memory mapping limit

# Monitor memory usage per model
ps aux | grep ollama
# Note RSS (resident memory) usage

# Optimize memory allocation
echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

GPU Acceleration Setup¶

NVIDIA GPU (CUDA)

# Install NVIDIA drivers
sudo apt update
sudo apt install nvidia-driver-535  # Latest stable

# Install CUDA toolkit
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt update
sudo apt install cuda

# Verify CUDA installation
nvidia-smi
nvcc --version

# Test GPU acceleration
ollama run codellama:7b --verbose
# Should show GPU utilization in output

AMD GPU (ROCm)

# Install ROCm (Linux)
wget https://repo.radeon.com/amdgpu-install/5.6/ubuntu/jammy/amdgpu-install_5.6.50600-1_all.deb
sudo apt install ./amdgpu-install_5.6.50600-1_all.deb
sudo amdgpu-install --usecase=rocm

# Verify ROCm
rocm-smi

Apple Silicon (Metal)

# macOS with Apple Silicon has built-in optimization
# Verify Metal support
system_profiler SPDisplaysDataType | grep -A2 "Metal"

# Monitor GPU usage
sudo powermetrics --samplers gpu_power -n 1

Model Performance Tuning¶

Memory Management¶

# Keep models loaded in memory
ollama run codellama:7b --keep-alive 60m
# Model stays loaded for 60 minutes after last use

# Preload frequently used models
ollama pull codellama:7b
ollama pull mistral:7b
# Start background processes to keep them warm
(sleep 2; ollama run codellama:7b "warmup" >/dev/null) &
(sleep 5; ollama run mistral:7b "warmup" >/dev/null) &

Context Window Optimization¶

# Reduce context for faster responses
ollama run codellama:7b --num-ctx 1024 "Quick code snippet"
# Default: 2048, Reduced: 1024 = ~2x faster

# Increase context for complex tasks
ollama run codellama:7b --num-ctx 4096 "Analyze large file"
# More context = better understanding, slower speed

# Find optimal context size for your use case
for ctx in 512 1024 2048 4096; do
    echo "Context: $ctx"
    time ollama run codellama:7b --num-ctx $ctx "def fibonacci(n):"
done

Batch Processing¶

# Process multiple requests efficiently
# Instead of sequential calls:
ollama run codellama:7b "Review file1.py" < file1.py
ollama run codellama:7b "Review file2.py" < file2.py

# Use batch processing:
cat file1.py file2.py | ollama run codellama:7b "Review these Python files"

System-Level Optimizations¶

Process Priority¶

# Increase Ollama priority
sudo renice -5 $(pgrep ollama)

# Set CPU affinity for better performance
sudo taskset -cp 0-7 $(pgrep ollama)  # Use first 8 cores

# Persistent priority settings
echo 'ollama soft priority -5' | sudo tee -a /etc/security/limits.conf

Storage Optimization¶

# Move models to fastest storage
# Check current model location
ollama show codellama:7b | grep "model path"

# Move to SSD if needed
sudo mkdir -p /fast-ssd/ollama
sudo chown $USER:$USER /fast-ssd/ollama
export OLLAMA_MODELS="/fast-ssd/ollama"

# Verify faster loading
time ollama run codellama:7b "test" >/dev/null

Network Optimization (API calls)¶

# Optimize local network stack
sudo sysctl -w net.core.rmem_max=16777216
sudo sysctl -w net.core.wmem_max=16777216
sudo sysctl -w net.ipv4.tcp_rmem="4096 12582912 16777216"
sudo sysctl -w net.ipv4.tcp_wmem="4096 12582912 16777216"

# Use Unix sockets instead of TCP (when available)
export OLLAMA_HOST=unix:///tmp/ollama.sock

Model-Specific Optimizations¶

Choose Right Model Size¶

# Performance benchmarking script
import time
import requests
import json

def benchmark_model(model, prompt, iterations=5):
    times = []
    for i in range(iterations):
        start = time.time()

        response = requests.post('http://localhost:11434/api/generate', 
                               json={
                                   'model': model,
                                   'prompt': prompt,
                                   'stream': False
                               })

        end = time.time()
        times.append(end - start)

    avg_time = sum(times) / len(times)
    tokens = len(response.json().get('response', '').split())
    tokens_per_second = tokens / avg_time if avg_time > 0 else 0

    return {
        'model': model,
        'avg_time': avg_time,
        'tokens_per_second': tokens_per_second,
        'response_quality': len(response.json().get('response', ''))
    }

# Test different model sizes
models = ['codellama:7b', 'codellama:13b', 'codellama:34b']
prompt = "Write a Python function to sort a list of dictionaries by a key"

print("Model Performance Comparison:")
print("-" * 60)

for model in models:
    try:
        result = benchmark_model(model, prompt)
        print(f"{result['model']:15} | "
              f"{result['avg_time']:6.2f}s | "
              f"{result['tokens_per_second']:6.1f} tok/s | "
              f"{result['response_quality']:4d} chars")
    except Exception as e:
        print(f"{model:15} | Error: {e}")

Quantization Settings¶

# Use different quantization levels for speed vs quality
# Q4_0: Fastest, lower quality
# Q4_1: Balanced
# Q5_0: Better quality, slower
# Q5_1: Best quality for the size

# Most models default to optimal quantization
# Check model info
ollama show codellama:7b | grep -A10 "parameters"

Performance Monitoring¶

Real-Time Monitoring¶

# Monitor system resources during AI workload
# Terminal 1: Run model
ollama run codellama:13b

# Terminal 2: Monitor performance
watch -n 1 'echo "=== CPU ===" && \
            mpstat 1 1 | grep -A1 Average && \
            echo "=== Memory ===" && \
            free -h && \
            echo "=== GPU ===" && \
            nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv,noheader'

Automated Performance Testing¶

#!/usr/bin/env python3
# performance_monitor.py

import psutil
import requests
import time
import json
import subprocess

class OllamaPerformanceMonitor:
    def __init__(self):
        self.base_url = "http://localhost:11434"

    def get_system_metrics(self):
        """Get current system performance metrics"""
        cpu_percent = psutil.cpu_percent(interval=1)
        memory = psutil.virtual_memory()

        # GPU metrics (if NVIDIA)
        try:
            gpu_output = subprocess.check_output([
                'nvidia-smi', '--query-gpu=utilization.gpu,memory.used',
                '--format=csv,noheader,nounits'
            ], text=True)
            gpu_util, gpu_mem = map(int, gpu_output.strip().split(', '))
        except:
            gpu_util, gpu_mem = 0, 0

        return {
            'cpu_percent': cpu_percent,
            'memory_percent': memory.percent,
            'memory_used_gb': memory.used / (1024**3),
            'gpu_utilization': gpu_util,
            'gpu_memory_mb': gpu_mem
        }

    def benchmark_inference(self, model, prompt):
        """Benchmark a single inference"""
        start_metrics = self.get_system_metrics()
        start_time = time.time()

        response = requests.post(f"{self.base_url}/api/generate", 
                               json={
                                   'model': model,
                                   'prompt': prompt,
                                   'stream': False
                               })

        end_time = time.time()
        end_metrics = self.get_system_metrics()

        inference_time = end_time - start_time
        response_text = response.json().get('response', '')
        tokens = len(response_text.split())

        return {
            'model': model,
            'inference_time': inference_time,
            'tokens_generated': tokens,
            'tokens_per_second': tokens / inference_time if inference_time > 0 else 0,
            'peak_cpu': max(start_metrics['cpu_percent'], end_metrics['cpu_percent']),
            'peak_gpu': max(start_metrics['gpu_utilization'], end_metrics['gpu_utilization']),
            'memory_used': end_metrics['memory_used_gb'],
            'response_length': len(response_text)
        }

    def run_comprehensive_benchmark(self):
        """Run benchmarks across different scenarios"""
        test_cases = [
            ("Simple function", "def hello_world():"),
            ("Complex algorithm", "Write a Python implementation of quicksort with comments"),
            ("Code review", "Review this function for bugs:\ndef divide(a, b):\n    return a / b"),
            ("Documentation", "Generate docstring for a REST API endpoint"),
        ]

        models = ['codellama:7b', 'codellama:13b']
        results = []

        for model in models:
            print(f"\nTesting {model}...")
            for test_name, prompt in test_cases:
                try:
                    result = self.benchmark_inference(model, prompt)
                    result['test_case'] = test_name
                    results.append(result)
                    print(f"  {test_name}: {result['tokens_per_second']:.1f} tok/s")
                except Exception as e:
                    print(f"  {test_name}: Error - {e}")

        return results

    def generate_report(self, results):
        """Generate performance report"""
        report = "# Ollama Performance Report\n\n"

        for model in set(r['model'] for r in results):
            model_results = [r for r in results if r['model'] == model]
            avg_speed = sum(r['tokens_per_second'] for r in model_results) / len(model_results)
            avg_cpu = sum(r['peak_cpu'] for r in model_results) / len(model_results)

            report += f"## {model}\n"
            report += f"- Average Speed: {avg_speed:.1f} tokens/second\n"
            report += f"- Average CPU Usage: {avg_cpu:.1f}%\n"
            report += f"- Memory Usage: {model_results[0]['memory_used']:.1f}GB\n\n"

            for result in model_results:
                report += f"### {result['test_case']}\n"
                report += f"- Speed: {result['tokens_per_second']:.1f} tok/s\n"
                report += f"- Response Length: {result['response_length']} chars\n"
                report += f"- Inference Time: {result['inference_time']:.2f}s\n\n"

        return report

# Usage
if __name__ == "__main__":
    monitor = OllamaPerformanceMonitor()
    results = monitor.run_comprehensive_benchmark()
    report = monitor.generate_report(results)

    with open('performance_report.md', 'w') as f:
        f.write(report)

    print("\nPerformance report saved to performance_report.md")

Optimization Strategies by Use Case¶

Interactive Development¶

# Optimize for low latency
export OLLAMA_NUM_PARALLEL=1      # Single request processing
export OLLAMA_MAX_LOADED_MODELS=2 # Keep 2 models loaded
export OLLAMA_FLASH_ATTENTION=1   # Enable flash attention

# Use fast models for interactive work
alias code-help="ollama run codellama:7b"
alias quick-review="ollama run mistral:7b"

Batch Processing¶

# Optimize for throughput
export OLLAMA_NUM_PARALLEL=4      # Process multiple requests
export OLLAMA_MAX_LOADED_MODELS=1 # Focus resources on one model

# Use larger models for better quality
alias batch-process="ollama run codellama:13b"

Background Processing¶

# Optimize for efficiency
export OLLAMA_NUM_PARALLEL=2
export OLLAMA_FLASH_ATTENTION=1

# Lower priority to not interfere with other work
nice -n 10 ollama run codellama:7b

Workflow Optimizations¶

Smart Model Selection¶

# model_selector.py
import re

def select_optimal_model(prompt, context_length=0):
    """Select the best model based on task complexity"""

    # Simple tasks - use fastest model
    simple_patterns = [
        r"def \w+\(",           # Function definition
        r"fix.*bug",            # Simple bug fix
        r"explain.*briefly",    # Brief explanation
    ]

    # Complex tasks - use larger model
    complex_patterns = [
        r"design.*architecture", # System design
        r"review.*security",     # Security review
        r"optimize.*performance", # Performance optimization
    ]

    prompt_lower = prompt.lower()

    if any(re.search(pattern, prompt_lower) for pattern in complex_patterns):
        return "codellama:13b"
    elif context_length > 2000:
        return "codellama:13b"  # Large context needs bigger model
    elif any(re.search(pattern, prompt_lower) for pattern in simple_patterns):
        return "codellama:7b"
    else:
        return "codellama:7b"   # Default to fast model

# Usage
prompt = "def calculate_fibonacci(n):"
model = select_optimal_model(prompt)
print(f"Using {model} for: {prompt}")

Caching and Reuse¶

# smart_cache.py
import hashlib
import json
import os

class OllamaCache:
    def __init__(self, cache_dir="~/.ollama_cache"):
        self.cache_dir = os.path.expanduser(cache_dir)
        os.makedirs(self.cache_dir, exist_ok=True)

    def get_cache_key(self, model, prompt):
        """Generate cache key for prompt"""
        content = f"{model}:{prompt}"
        return hashlib.sha256(content.encode()).hexdigest()

    def get(self, model, prompt):
        """Get cached response if available"""
        cache_key = self.get_cache_key(model, prompt)
        cache_file = os.path.join(self.cache_dir, f"{cache_key}.json")

        if os.path.exists(cache_file):
            with open(cache_file, 'r') as f:
                return json.load(f)['response']
        return None

    def set(self, model, prompt, response):
        """Cache response"""
        cache_key = self.get_cache_key(model, prompt)
        cache_file = os.path.join(self.cache_dir, f"{cache_key}.json")

        with open(cache_file, 'w') as f:
            json.dump({
                'model': model,
                'prompt': prompt,
                'response': response
            }, f)

# Usage
cache = OllamaCache()

def cached_inference(model, prompt):
    # Check cache first
    cached_response = cache.get(model, prompt)
    if cached_response:
        print("Using cached response")
        return cached_response

    # Generate new response
    import requests
    response = requests.post('http://localhost:11434/api/generate', 
                           json={'model': model, 'prompt': prompt, 'stream': False})
    result = response.json()['response']

    # Cache for future use
    cache.set(model, prompt, result)
    return result

Hardware Recommendations¶

Budget Setup (8GB RAM)¶

# Optimal configuration for limited resources
# Use 7B models only
ollama pull codellama:7b
ollama pull mistral:7b

# Optimize for single model usage
export OLLAMA_MAX_LOADED_MODELS=1
export OLLAMA_NUM_PARALLEL=1

# Monitor memory closely
watch -n 1 free -h

Performance Setup (16GB+ RAM)¶

# Balanced configuration
ollama pull codellama:13b      # Primary coding model
ollama pull mistral:7b         # Fast general model
ollama pull codellama:7b       # Fallback/quick tasks

export OLLAMA_MAX_LOADED_MODELS=2
export OLLAMA_NUM_PARALLEL=2

Workstation Setup (32GB+ RAM + GPU)¶

# High-performance configuration
ollama pull codellama:34b      # Best quality coding
ollama pull codellama:13b      # Balanced option
ollama pull mixtral:8x7b       # Complex reasoning
ollama pull mistral:7b         # Fast responses

export OLLAMA_MAX_LOADED_MODELS=3
export OLLAMA_NUM_PARALLEL=4
export OLLAMA_FLASH_ATTENTION=1

Troubleshooting Performance Issues¶

Slow Inference¶

# Check model size vs available memory
ollama show codellama:34b | grep size
free -h

# Monitor during inference
htop          # CPU usage
iotop         # Disk I/O
nvidia-smi    # GPU usage

# Common fixes:
# 1. Use smaller model
ollama run codellama:7b instead of codellama:34b

# 2. Reduce context window
ollama run model --num-ctx 1024

# 3. Close unnecessary applications
# 4. Upgrade RAM if consistently swapping

Memory Issues¶

# Monitor memory pressure
cat /proc/meminfo | grep -E "(MemTotal|MemFree|MemAvailable|SwapTotal|SwapFree)"

# Clear system cache
sudo sysctl vm.drop_caches=3

# Restart Ollama service
sudo systemctl restart ollama

GPU Not Used¶

# Verify GPU detection
nvidia-smi

# Check CUDA installation
nvcc --version

# Reinstall Ollama with GPU support
curl -fsSL https://ollama.ai/install.sh | sh

# Force GPU usage
export CUDA_VISIBLE_DEVICES=0
ollama run codellama:7b --verbose

Model Selection - Choose the right model for your hardware
Privacy Setup - Secure configuration practices
API Integration - Optimize API calls
Ollama Installation - Complete setup guide

Performance Optimization for Ollama¶

Hardware Optimization¶

CPU Performance¶

Memory Optimization¶

GPU Acceleration Setup¶

Model Performance Tuning¶

Memory Management¶

Context Window Optimization¶

Batch Processing¶

System-Level Optimizations¶

Process Priority¶

Storage Optimization¶

Network Optimization (API calls)¶

Model-Specific Optimizations¶

Choose Right Model Size¶

Quantization Settings¶

Performance Monitoring¶

Real-Time Monitoring¶

Automated Performance Testing¶

Optimization Strategies by Use Case¶

Interactive Development¶

Batch Processing¶

Background Processing¶

Workflow Optimizations¶

Smart Model Selection¶

Caching and Reuse¶

Hardware Recommendations¶

Budget Setup (8GB RAM)¶

Performance Setup (16GB+ RAM)¶

Workstation Setup (32GB+ RAM + GPU)¶

Troubleshooting Performance Issues¶

Slow Inference¶

Memory Issues¶

GPU Not Used¶

Related Resources¶