Skip to content

K2/olol

Repository files navigation

OLOL - Ollama Load Balancing and Clustering

A distributed inference system that allows you to build a powerful multi-host cluster for Ollama AI models with transparent scaling and fault tolerance.

OLOL (Ollama Load Balancer) is a Python package providing gRPC interfaces with both synchronous and asynchronous support for distributed inference across multiple Ollama instances.

olol imagry

Overview

This system provides a unified API endpoint that transparently distributes inference requests across multiple Ollama instances running on different hosts. It maintains compatibility with the Ollama API while adding clustering capabilities.

Key Features

  • Transparent API Compatibility: Drop-in replacement for the Ollama API
  • Automatic Load Balancing: Distributes requests across available servers
  • Model Awareness: Routes requests to servers that have the requested model
  • Session Affinity: Maintains chat history consistently across requests
  • Redundancy: Can pull models to multiple servers for high availability
  • Monitoring: Built-in status endpoint to monitor the cluster
  • Distributed Inference: Automatically splits large models across multiple servers for faster inference

Architecture

The system consists of these main components:

  1. gRPC Server: Runs on each inference host with local Ollama installed
  2. API Proxy Client: Provides a unified Ollama API endpoint for applications
  3. RPC Server: Enables distributed inference by splitting model layers across servers
  4. Inference Coordinator: Manages distributed inference and model partitioning
  5. Protocol Buffer Definition: Defines the communication contract

Distributed Inference

OLOL supports distributed inference, which allows you to split large models across multiple servers:

  • Layer Partitioning: Automatically splits model layers across available servers
  • Auto-Detection: Automatically uses distributed inference for large models (13B+)
  • Hardware Optimization: Allocates model layers based on server hardware capabilities
  • Transparent API: No changes required to your client code
  • Advanced Options: Fine-tune distribution with API options

Distributed inference is particularly useful for:

  • Running very large models (>13B parameters) across multiple smaller machines
  • Speeding up inference by parallelizing model computation
  • Enabling models too large to fit in a single machine's memory

Model Quantization

OLOL intelligently handles model quantization:

  • Smart Quantization: Automatically selects the best quantization level based on hardware and model size
  • Compatibility Detection: Checks if a compatible quantization is already loaded
  • On-Demand Loading: Can load models with appropriate quantization when needed
  • Quantization Fallbacks: Can use higher-quality quantization to serve lower-quality requests

Auto-Discovery

OLOL includes an auto-discovery system for zero-configuration clustering:

  • Server Auto-Registration: RPC servers automatically find and register with the proxy
  • Capability Broadcasting: Servers advertise their hardware capabilities and device types
  • Dynamic Scaling: New servers are automatically added to the cluster when they come online
  • Subnet Detection: Servers automatically scan the subnet to find the proxy

Quantization compatibility rules:

  • q4_0 (smallest memory usage): Compatible with models loaded as q4_0, q4_1, q5_0, q5_1, or q8_0
  • q5_0 (balanced): Compatible with models loaded as q5_0, q5_1, or q8_0
  • q8_0 (highest quality): Only compatible with q8_0
  • f16 (unquantized): Only compatible with f16

When requested quantization isn't available, OLOL will:

  1. Try to find a model with compatible (higher-quality) quantization
  2. Try to load the model with the requested quantization
  3. If that fails, load with the best quantization for the available hardware

System Architecture

flowchart TD
    Client[Client Application] -->|HTTP API Requests| Proxy[API Proxy]
    
    subgraph Load Balancer
        Proxy -->|Model Registry| Registry[Model Registry]
        Proxy -->|Session Tracking| Sessions[Session Manager]
        Proxy -->|Server Monitoring| Monitor[Server Monitor]
    end
    
    Registry -.->|Updates| Proxy
    Sessions -.->|State| Proxy
    Monitor -.->|Status Updates| Proxy
    
    Proxy -->|gRPC| Server1[Inference Server 1]
    Proxy -->|gRPC| Server2[Inference Server 2]
    Proxy -->|gRPC| Server3[Inference Server 3]
    
    subgraph "Inference Server 1"
        Server1 -->|CLI Commands| Ollama1[Ollama]
        Ollama1 -->|Local Models| ModelDir1[Model Storage]
    end
    
    subgraph "Inference Server 2"
        Server2 -->|CLI Commands| Ollama2[Ollama]
        Ollama2 -->|Local Models| ModelDir2[Model Storage]
    end
    
    subgraph "Inference Server 3" 
        Server3 -->|CLI Commands| Ollama3[Ollama]
        Ollama3 -->|Local Models| ModelDir3[Model Storage]
    end
    
    class Client,Proxy,Registry,Sessions,Monitor,Server1,Server2,Server3 componentNode;
    class Ollama1,Ollama2,Ollama3,ModelDir1,ModelDir2,ModelDir3 resourceNode;
    
    classDef componentNode fill:#b3e0ff,stroke:#9cc,stroke-width:2px;
    classDef resourceNode fill:#ffe6cc,stroke:#d79b00,stroke-width:1px;
Loading

Entity Relationship Model

erDiagram
    APIProxy ||--o{ InferenceServer : "routes-requests-to"
    APIProxy ||--o{ Session : "manages"
    APIProxy ||--o{ ModelRegistry : "maintains"
    APIProxy ||--o{ LoadBalancer : "uses"
    InferenceServer ||--o{ Model : "hosts"
    InferenceServer ||--o{ Session : "maintains-state-for"
    InferenceServer ||--o{ Metrics : "generates"
    Session }o--|| Model : "uses"
    Model }|--|| ModelRegistry : "registered-in"
    Client }|--o{ APIProxy : "connects-to"
    Session }o--o{ ChatHistory : "contains"
    LoadBalancer }o--|| HealthCheck : "performs"
    
    APIProxy {
        string host
        int port
        array servers
        object session_map
        object model_map
        int max_workers
        bool async_mode
    }
    
    InferenceServer {
        string host
        int port
        int current_load
        bool online
        array loaded_models
        array active_sessions
        float cpu_usage
        float memory_usage
        int gpu_memory
        timestamp last_heartbeat
    }
    
    Model {
        string name
        string tag
        string family
        int size_mb
        string digest
        array compatible_servers
        json parameters
        float quantization
        string architecture
    }
    
    Session {
        string session_id
        string model_name
        array messages
        timestamp created_at
        timestamp last_active
        string server_host
        json model_parameters
        float timeout
        string status
    }
    
    ChatHistory {
        string role
        string content
        timestamp timestamp
        float temperature
        int tokens_used
        float completion_time
        json metadata
    }
    
    Client {
        string application_type
        string api_version
        string client_id
        json preferences
        timestamp connected_at
    }
    
    ModelRegistry {
        map model_to_servers
        int total_models
        timestamp last_updated
        json model_stats
        array pending_pulls
        json version_info
    }

    LoadBalancer {
        string algorithm
        int max_retries
        float timeout
        json server_weights
        bool sticky_sessions
        json routing_rules
    }

    Metrics {
        string server_id
        float response_time
        int requests_per_second
        float error_rate
        json resource_usage
        timestamp collected_at
    }

    HealthCheck {
        string check_id
        string status
        int interval_seconds
        timestamp last_check
        json error_details
        int consecutive_failures
    }
Loading

Request Flow Sequence

sequenceDiagram
    participant Client as Client Application
    participant Proxy as API Proxy
    participant Registry as Model Registry
    participant SessionMgr as Session Manager
    participant Server1 as Inference Server 1
    participant Server2 as Inference Server 2
    participant Ollama as Ollama CLI/HTTP
    
    Client->>+Proxy: POST /api/chat (model: llama2)
    Proxy->>+Registry: Find servers with llama2
    Registry-->>-Proxy: Server1 and Server2 available
    
    Proxy->>SessionMgr: Create/Get Session
    alt New Session
        SessionMgr-->>Proxy: New Session ID
        Note over Proxy,Server1: Select Server1 (lowest load)
        Proxy->>+Server1: CreateSession(session_id, llama2)
        Server1->>Ollama: ollama run llama2
        Server1-->>-Proxy: Session Created
    else Existing Session
        SessionMgr-->>Proxy: Existing Session on Server2
        Note over Proxy,Server2: Maintain Session Affinity
    end
    
    Proxy->>+Server1: ChatMessage(session_id, message)
    Server1->>Ollama: ollama run with history
    Ollama-->>Server1: Response
    Server1-->>-Proxy: Chat Response
    Proxy-->>-Client: JSON Response
    
    Note right of Client: Later: Model Update
    
    Client->>+Proxy: POST /api/pull (model: mistral)
    Proxy->>+Registry: Check model status
    Registry-->>-Proxy: Not available
    
    par Pull to Server1
        Proxy->>+Server1: PullModel("mistral")
        Server1->>Ollama: ollama pull mistral
        Server1-->>Proxy: Stream Progress
    and Pull to Server2
        Proxy->>+Server2: PullModel("mistral") 
        Server2->>Ollama: ollama pull mistral
        Server2-->>Proxy: Stream Progress
    end
    
    Server1-->>-Proxy: Pull Complete
    Server2-->>-Proxy: Pull Complete
    Proxy->>Registry: Update model->server map
    Proxy-->>-Client: Pull Complete Response
Loading

Installation

# Install from PyPI (once published)
uv pip install olol

# Install with extras
uv pip install "olol[proxy,async]"

# Development installation
git clone https://github.com/K2/olol.git
cd olol
uv pip install -e ".[dev]"

# Build and install from source
cd olol
./tools/build-simple.sh
uv pip install dist/olol-0.1.0-py3-none-any.whl

Quick Start

1. Start Ollama instances

Start multiple Ollama instances on different machines or ports.

2. Start the gRPC servers

# Start a synchronous server
olol server --host 0.0.0.0 --port 50051 --ollama-host http://localhost:11434

# Start an asynchronous server (on another machine)
olol server --host 0.0.0.0 --port 50052 --ollama-host http://localhost:11434 --async

3. Start the load balancing proxy

# Basic proxy with load balancing
olol proxy --host 0.0.0.0 --port 8000 --servers "192.168.1.10:50051,192.168.1.11:50051"

# Start with distributed inference enabled
olol proxy --host 0.0.0.0 --port 8000 --servers "192.168.1.10:50051,192.168.1.11:50051" --distributed

# With custom RPC servers for distributed inference
olol proxy --servers "192.168.1.10:50051,192.168.1.11:50051" --distributed --rpc-servers "192.168.1.10:50052,192.168.1.11:50052"

# Auto-discovery mode (will automatically find and add new servers)
olol proxy --distributed --discovery

# Specify network interface for multi-network-interface setups
olol proxy --distributed --interface 10.0.0.5

4. Set up distributed inference (optional)

For large models, you can shard the model across multiple servers for faster inference:

# Start RPC servers on each machine that will participate in distributed inference
olol rpc-server --host 0.0.0.0 --port 50052 --device auto

# With optimized settings for large models
olol rpc-server --device cuda --flash-attention --context-window 16384 --quantize q5_0

# Auto-discovery mode (servers will automatically find and register with proxies)
olol rpc-server --discovery

# Specify preferred network interface when multiple are available
olol rpc-server --device cuda --interface 192.168.1.10

# Testing distributed inference directly
olol dist --servers "192.168.1.10:50052,192.168.1.11:50052" --model llama2:13b --prompt "Hello, world!"

5. Use the client

# Test with the command-line client
olol client --host localhost --port 8000 --model llama2 --prompt "Hello, world!"

# Or use the async client
olol client --host localhost --port 8000 --model llama2 --prompt "Hello, world!" --async

Python API

Synchronous Client

from olol.sync import OllamaClient

client = OllamaClient(host="localhost", port=8000)
try:
    # Stream text generation
    for response in client.generate("llama2", "What is the capital of France?"):
        if not response.done:
            print(response.response, end="", flush=True)
        else:
            print(f"\nCompleted in {response.total_duration}ms")
finally:
    client.close()

Asynchronous Client

import asyncio
from olol.async import AsyncOllamaClient

async def main():
    client = AsyncOllamaClient(host="localhost", port=8000)
    try:
        # Stream text generation
        async for response in client.generate("llama2", "What is the capital of France?"):
            if not response.done:
                print(response.response, end="", flush=True)
            else:
                print(f"\nCompleted in {response.total_duration}ms")
    finally:
        await client.close()

asyncio.run(main())

HTTP API Usage

Once the proxy is running, connect your client applications to it using the standard Ollama API:

# Example: Chat with a model
curl -X POST http://localhost:8000/api/chat -d '{
  "model": "llama2",
  "messages": [{"role": "user", "content": "Hello, how are you?"}]
}'

# Using distributed inference explicitly
curl -X POST http://localhost:8000/api/generate -d '{
  "model": "llama2:13b",
  "prompt": "Write a poem about distributed computing",
  "options": {
    "distributed": true
  }
}'

Status Endpoint

Monitor the status of your cluster:

curl http://localhost:8000/api/status

Configuration

Command-Line Interface

The main command-line interface accepts various arguments:

# Show available commands
olol --help

# Show options for a specific command
olol server --help
olol proxy --help

Direct Command Tools

OLOL also provides direct command tools that can be used with uv run:

# Start a proxy server
uv run olol-proxy --distributed --discovery

# Start an RPC server
uv run olol-rpc --device cuda --quantize q5_0 --context-window 8192

# Start a standard server
uv run olol-server --host 0.0.0.0 --port 50051

# Run distributed inference
uv run olol-dist --servers "server1:50052,server2:50052" --model llama2:13b --prompt "Hello!"

# Use the client
uv run olol-client --model llama2 --prompt "Tell me about distributed systems"

These command tools accept the same options as their corresponding olol commands:

Environment variables:

OLOL configuration:

  • OLLAMA_SERVERS: Comma-separated list of gRPC server addresses (default: "localhost:50051")
  • OLOL_PORT: HTTP port for the API proxy (default: 8000)
  • OLOL_LOG_LEVEL: Set logging level (default: INFO)

Ollama optimization settings:

  • OLLAMA_FLASH_ATTENTION: Enable FlashAttention for faster inference
  • OLLAMA_NUMA: Enable NUMA optimization if available
  • OLLAMA_KEEP_ALIVE: How long to keep models loaded (e.g., "1h")
  • OLLAMA_MEMORY_LOCK: Lock memory to prevent swapping
  • OLLAMA_LOAD_TIMEOUT: Longer timeout for loading large models
  • OLLAMA_QUANTIZE: Quantization level (e.g., "q8_0", "q5_0", "f16")
  • OLLAMA_CONTEXT_WINDOW: Default context window size (e.g., "8192", "16384")
  • OLLAMA_DEBUG: Enable debug mode with additional logging
  • OLLAMA_LOG_LEVEL: Set Ollama log level

Contributing

Contributions are welcome! Please check out our Contribution Guidelines.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Releases

No releases published

Packages

No packages published