-
Notifications
You must be signed in to change notification settings - Fork 31k
Open
Labels
Description
Bug description
When loading Qwen2VL or Qwen3VL image processors using from_pretrained() with the max_pixels parameter, the parameter is accepted without error but silently ignored during image processing. This causes images to be resized using the default max_pixels=16,777,216 instead of the user-specified value, resulting in significantly higher token counts than expected.
System Info
- transformers version: 4.57.1 (and likely all previous versions with Qwen2VL/Qwen3VL support)
- Affected models: all models that rely on Qwen2VL's image processor (so Qwen2VL and Qwen3VL)
- CPython version: 3.13
- Platform: Ubuntu 22.04.5 LTS
Who can help?
@ArthurZucker and @itazap
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Steps to reproduce
from transformers import Qwen3VLProcessor
from PIL import Image
# Load processor with custom max_pixels
processor = Qwen3VLProcessor.from_pretrained(
'Qwen/Qwen3-VL-2B-Instruct',
trust_remote_code=True,
max_pixels=200_000 # Expect images to be resized to ~200k pixels
)
# Check the internal state
print(f"max_pixels attribute: {processor.image_processor.max_pixels}")
print(f"size['longest_edge']: {processor.image_processor.size['longest_edge']}")
# Process a 2000×2000 image (4M pixels)
test_image = Image.new('RGB', (2000, 2000), color='red')
print(f"Input image: {test_image.size[0]}×{test_image.size[1]} = {test_image.size[0] * test_image.size[1]:,} pixels")
# Further investigations
messages = [{"role": "user", "content": [
{"type": "image", "image": test_image},
{"type": "text", "text": "Describe this image."}
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
result = processor(text=text, images=[test_image], return_tensors="pt")
# Check actual processed dimensions
grid_thw = result['image_grid_thw'][0]
temporal, height_patches, width_patches = grid_thw
total_patches = (temporal * height_patches * width_patches).item()
effective_pixels = total_patches * 16 * 16 # Each patch is 16×16 pixels
print(f"Grid (T, H, W): {temporal}×{height_patches}×{width_patches}")
print(f"Total patches: {total_patches}")
print(f"Effective pixels: {effective_pixels:,}")Expected behavior
- max_pixels attribute: 200000
- size['longest_edge']: 200000
- Input image: 2000×2000 = 4,000,000 pixels
- Grid (T, H, W): 1×26×26
- Total patches: 676
- Effective pixels: 173,056
Actual behavior:
- max_pixels attribute: 200000
- size['longest_edge']: 16777216
- Input image: 2000×2000 = 4,000,000 pixels
- Grid (T, H, W): 1×124×124
- Total patches: 15376
- Effective pixels: 3,936,256