Qwen3-VL-HF-Demo

This is a Gradio-based demo application for the Qwen3-VL multimodal model. It allows users to perform inference on various media types, including images, videos, PDFs, GIFs, and image captioning. The app supports querying and analyzing content using the powerful Qwen3-VL-30B-A3B-Instruct model from Hugging Face.

Features

Image Inference: Upload an image and query it for analysis, OCR, captioning, or problem-solving.
Video Inference: Upload a video and describe or explain its content in detail.
PDF Inference: Upload a PDF, preview pages with navigation, and query for summarization, extraction, or analysis.
GIF Inference: Upload a GIF and describe its animation or actions.
Image Captioning: Generate detailed captions and attributes for uploaded images.
Advanced Options: Customize generation parameters like max new tokens, temperature, top-p, top-k, and repetition penalty.
Examples: Pre-loaded examples for each tab to get started quickly.

Requirements

To run this app, install the following dependencies:

git+https://github.com/huggingface/accelerate.git
git+https://github.com/huggingface/peft.git
transformers-stream-generator
transformers==4.57.0
huggingface_hub
albumentations
qwen-vl-utils
pyvips-binary
sentencepiece
opencv-python
docling-core
python-docx
torchvision
supervision
matplotlib
pdf2image
num2words
reportlab
html2text
xformers
markdown
requests
pymupdf
loguru
hf_xet
spaces
pyvips
pillow
gradio
einops
httpx
click
torch
fpdf
timm
av

You can install them using pip:

pip install -r requirements.txt

(Note: Ensure you have CUDA-enabled GPU for optimal performance, as the model uses torch.float16.)

Installation

Clone the repository:

git clone https://github.com/PRITHIVSAKTHIUR/Qwen3-VL-HF-Demo.git
cd Qwen3-VL-HF-Demo

Install the requirements (as listed above).
Download the model if needed (the script loads it automatically from Hugging Face).

Usage

Run the app using Python:

python app.py

The app will launch a Gradio interface in your browser.
Select a tab (Image, Video, PDF, GIF, or Caption).
Upload media and enter a query.
Adjust advanced options if desired.
Click "Submit" to generate output.
For PDFs, use navigation buttons to preview pages.

The app supports streaming output for real-time generation and renders results in both raw text and Markdown formats.

Acknowledgements

App by: Prithiv Sakthi U R

Model: Qwen/Qwen3-VL-30B-A3B-Instruct

License

Apache License: Version 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
examples		examples
LICENSE		LICENSE
README.md		README.md
app.py		app.py
pre-requirements.txt		pre-requirements.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Qwen3-VL-HF-Demo

Features

Requirements

Installation

Usage

Acknowledgements

License

About

Uh oh!

Languages

License

PRITHIVSAKTHIUR/Qwen3-VL-HF-Demo

Folders and files

Latest commit

History

Repository files navigation

Qwen3-VL-HF-Demo

Features

Requirements

Installation

Usage

Acknowledgements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages