Qwen-3VL: Multimodal Understanding

This Gradio-based web application leverages the Qwen3-VL-4B-Instruct model from Alibaba's Qwen series for multimodal tasks involving images and text. It enables users to upload an image and perform various vision-language tasks, such as querying details, generating captions, detecting points of interest, or identifying bounding boxes for objects. The app includes visual annotations for point and detection tasks using the supervision library. Powered by Hugging Face Transformers, PyTorch, and Gradio, this demo showcases the model's capabilities in real-time image understanding.

Screenshot 2025-11-18 at 17-20-08 Qwen3 VL HF Demo - a Hugging Face Space by prithivMLmods

Screenshot 2025-11-18 at 17-04-03 Qwen3 VL HF Demo - a Hugging Face Space by prithivMLmods

Features

Query: Ask open-ended questions about the image (e.g., "Count the total number of boats and describe the environment.").
Caption: Generate image captions of varying lengths (e.g., short, detailed).
Point: Detect and annotate 2D keypoints for specific elements (e.g., "The gun held by the person.").
Detect: Identify and annotate bounding boxes for objects (e.g., "The headlight of the car.").
Visual Annotations: Automatically overlays keypoints (red dots) or bounding boxes on the output image.
Custom Theme: Steel-blue themed interface for a modern look.
Examples: Pre-loaded sample images and prompts to get started quickly.
GPU Acceleration: Optimized for CUDA devices if available.

The app processes images at a thumbnail resolution (512x512) for efficiency and supports JSON-formatted outputs for structured tasks.

Demo

Try the live demo on Hugging Face Spaces:
https://huggingface.co/spaces/prithivMLmods/Qwen3-VL-HF-Demo

Installation

Prerequisites

Python 3.10+ (recommended)
pip >= 23.0.0

Requirements

Install the dependencies using the provided requirements.txt or run the following:

pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu121  # For CUDA 12.1; adjust for your setup
pip install transformers==4.57.0
pip install supervision==0.26.1
pip install accelerate==1.10.1
pip install Pillow==11.3.0
pip install gradio==5.49.1

For a full list, see requirements.txt:

transformers==4.57.0
torchvision==0.23.0
supervision==0.26.1
accelerate==1.10.1
Pillow==11.3.0
gradio==5.49.1
torch==2.8.0

Setup

Clone the repository:

git clone https://github.com/PRITHIVSAKTHIUR/Qwen-3VL-Multimodal-Understanding.git
cd Qwen-3VL-Multimodal-Understanding

Install dependencies (as above).
Download model weights automatically on first run (requires internet).

Usage

Run the app locally:
```
python app.py
```
This launches a Gradio interface at http://127.0.0.1:7860.
In the interface:
- Upload an image.
- Select a task category (Query, Caption, Point, Detect).
- Enter a prompt tailored to the category.
- Click "Process Image" to generate results.
Outputs:
- Text: Generated response or JSON (with copy button).
- Image: Annotated version if applicable (points or boxes).

Example Prompts

Category	Example Prompt	Expected Output
Query	"Count the total number of boats and describe the environment."	Descriptive text with counts.
Caption	"detailed"	A long, descriptive caption.
Point	"The gun held by the person."	JSON with normalized (0-1) coordinates; red dots on image.
Detect	"Headlight of the car."	JSON with bounding boxes; colored rectangles on image.

Sample images are included in the examples/ folder for testing.

Model Details

Model: Qwen/Qwen3-VL-4B-Instruct (4B parameters, vision-language model).
Processor: Handles chat templating and tokenization.
Device: Auto-detects CUDA; falls back to CPU.
Limitations:
- Max new tokens: 512.
- Coordinates normalized to [0, 1000] in model output, scaled to [0, 1] in app.
- No fine-tuning; relies on zero-shot prompting.

Contributing

Feel free to fork the repo, submit issues, or pull requests. Contributions for new tasks, themes, or optimizations are welcome!

Repository

GitHub: https://github.com/PRITHIVSAKTHIUR/Qwen-3VL-Multimodal-Understanding
Hugging Face Space: https://huggingface.co/spaces/prithivMLmods/Qwen3-VL-HF-Demo

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
examples		examples
LICENSE		LICENSE
README.md		README.md
app.py		app.py
pre-requirements.txt		pre-requirements.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Qwen-3VL: Multimodal Understanding

Features

Demo

Installation

Prerequisites

Requirements

Setup

Usage

Example Prompts

Model Details

Contributing

Repository

About

Uh oh!

Languages

License

PRITHIVSAKTHIUR/Qwen-3VL-Multimodal-Understanding

Folders and files

Latest commit

History

Repository files navigation

Qwen-3VL: Multimodal Understanding

Features

Demo

Installation

Prerequisites

Requirements

Setup

Usage

Example Prompts

Model Details

Contributing

Repository

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages