Skip to content

[Bug]: Context size ignored on PDF analysis & Model load failure after restart (qwen3-vl-4b-instruct / llama-cpp / cuda12-llama-cpp) #7426

@heinsenberg82

Description

@heinsenberg82

LocalAI version: v3.8.0 (c0d1d02)

Environment, CPU architecture, OS, and Version: Linux openmediavault 6.12.9+bpo-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.12.9-1~bpo12+1 (2025-01-19) x86_64 GNU/Linux

❯ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.247.01             Driver Version: 535.247.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2070        Off | 00000000:02:00.0 Off |                  N/A |
| 29%   40C    P8              15W / 175W |   1437MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   1726766      C   /usr/bin/python3                           1434MiB |
+---------------------------------------------------------------------------------------+

Description
I just installed LocalAI and I am encountering issues when attempting to use a model named qwen3-vl-4b-instruct to analyze small PDF documents. I am facing two distinct behaviors:

  1. Context Size Error: When the model loads successfully, attempting to summarize a small PDF results in an immediate error stating the request exceeds the context size. This happens regardless of the context_size defined in the YAML.
  2. Load Failure after Restart: If I restart the container and try to use the model again, it fails to load entirely with a "Canceled" RPC error.

I have explicitly tried setting the backend to both llama-cpp and cuda12-llama-cpp in the YAML configuration, but the issues persist.

Error Messages

  • Scenario 1 (Processing PDF):
    the request exceeds the available context size, try increasing itInternal error: rpc error: code = Internal desc = the request exceeds the available context size, try increasing it
    
  • Scenario 2 (After restarting container):
    Internal error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc =
    

Docker Compose:

services:
  localai:
    image: localai/localai:latest-gpu-nvidia-cuda-12
    container_name: localai
    hostname: localai
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
      interval: 1m
      timeout: 20m
      retries: 5
    networks:
      - all-services_default
    ports:
      - 8079:8080
    environment:
      - DEBUG=true
      - LOCALAI_SINGLE_ACTIVE_BACKEND=true
    volumes:
      - ${DOCKER_DATA_PATH}/config/localai/models:/models
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions