Skip to content

max_pixels parameter ignored when loading Qwen2VL/Qwen3VL image processors via from_pretrained() #41955

@ReinforcedKnowledge

Description

@ReinforcedKnowledge

Bug description

When loading Qwen2VL or Qwen3VL image processors using from_pretrained() with the max_pixels parameter, the parameter is accepted without error but silently ignored during image processing. This causes images to be resized using the default max_pixels=16,777,216 instead of the user-specified value, resulting in significantly higher token counts than expected.

System Info

  • transformers version: 4.57.1 (and likely all previous versions with Qwen2VL/Qwen3VL support)
  • Affected models: all models that rely on Qwen2VL's image processor (so Qwen2VL and Qwen3VL)
  • CPython version: 3.13
  • Platform: Ubuntu 22.04.5 LTS

Who can help?

@ArthurZucker and @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Steps to reproduce

from transformers import Qwen3VLProcessor
from PIL import Image

# Load processor with custom max_pixels
processor = Qwen3VLProcessor.from_pretrained(
    'Qwen/Qwen3-VL-2B-Instruct',
    trust_remote_code=True,
    max_pixels=200_000  # Expect images to be resized to ~200k pixels
)

# Check the internal state
print(f"max_pixels attribute: {processor.image_processor.max_pixels}")
print(f"size['longest_edge']: {processor.image_processor.size['longest_edge']}")

# Process a 2000×2000 image (4M pixels)
test_image = Image.new('RGB', (2000, 2000), color='red')
print(f"Input image: {test_image.size[0]}×{test_image.size[1]} = {test_image.size[0] * test_image.size[1]:,} pixels")

# Further investigations
messages = [{"role": "user", "content": [
    {"type": "image", "image": test_image},
    {"type": "text", "text": "Describe this image."}
]}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
result = processor(text=text, images=[test_image], return_tensors="pt")

# Check actual processed dimensions
grid_thw = result['image_grid_thw'][0]
temporal, height_patches, width_patches = grid_thw
total_patches = (temporal * height_patches * width_patches).item()
effective_pixels = total_patches * 16 * 16  # Each patch is 16×16 pixels

print(f"Grid (T, H, W): {temporal}×{height_patches}×{width_patches}")
print(f"Total patches: {total_patches}")
print(f"Effective pixels: {effective_pixels:,}")

Expected behavior

  • max_pixels attribute: 200000
  • size['longest_edge']: 200000
  • Input image: 2000×2000 = 4,000,000 pixels
  • Grid (T, H, W): 1×26×26
  • Total patches: 676
  • Effective pixels: 173,056

Actual behavior:

  • max_pixels attribute: 200000
  • size['longest_edge']: 16777216
  • Input image: 2000×2000 = 4,000,000 pixels
  • Grid (T, H, W): 1×124×124
  • Total patches: 15376
  • Effective pixels: 3,936,256

Related PR

#41954

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions