Skip to content

The demo of Qwen3-VL-30B-A3B-Instruct, the next-generation and powerful vision-language model in the Qwen series, delivers comprehensive upgrades across the board — including superior text understanding and generation, deeper visual perception and reasoning, extended context length, enhanced spatial and video dynamics comprehensions.

License

Notifications You must be signed in to change notification settings

PRITHIVSAKTHIUR/Qwen3-VL-HF-Demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Qwen3-VL-HF-Demo

This is a Gradio-based demo application for the Qwen3-VL multimodal model. It allows users to perform inference on various media types, including images, videos, PDFs, GIFs, and image captioning. The app supports querying and analyzing content using the powerful Qwen3-VL-30B-A3B-Instruct model from Hugging Face.

Screenshot 1 Screenshot 2025-10-12 at 19-44-46 Qwen3 VL HF Demo - a Hugging Face Space by prithivMLmods Screenshot 2025-10-12 at 19-17-12 Qwen3 VL HF Demo - a Hugging Face Space by prithivMLmods Screenshot 2025-10-12 at 19-50-02 Qwen3 VL HF Demo - a Hugging Face Space by prithivMLmods

Features

  • Image Inference: Upload an image and query it for analysis, OCR, captioning, or problem-solving.
  • Video Inference: Upload a video and describe or explain its content in detail.
  • PDF Inference: Upload a PDF, preview pages with navigation, and query for summarization, extraction, or analysis.
  • GIF Inference: Upload a GIF and describe its animation or actions.
  • Image Captioning: Generate detailed captions and attributes for uploaded images.
  • Advanced Options: Customize generation parameters like max new tokens, temperature, top-p, top-k, and repetition penalty.
  • Examples: Pre-loaded examples for each tab to get started quickly.

Requirements

To run this app, install the following dependencies:

git+https://github.com/huggingface/accelerate.git
git+https://github.com/huggingface/peft.git
transformers-stream-generator
transformers==4.57.0
huggingface_hub
albumentations
qwen-vl-utils
pyvips-binary
sentencepiece
opencv-python
docling-core
python-docx
torchvision
supervision
matplotlib
pdf2image
num2words
reportlab
html2text
xformers
markdown
requests
pymupdf
loguru
hf_xet
spaces
pyvips
pillow
gradio
einops
httpx
click
torch
fpdf
timm
av

You can install them using pip:

pip install -r requirements.txt

(Note: Ensure you have CUDA-enabled GPU for optimal performance, as the model uses torch.float16.)

Installation

  1. Clone the repository:

    git clone https://github.com/PRITHIVSAKTHIUR/Qwen3-VL-HF-Demo.git
    cd Qwen3-VL-HF-Demo
  2. Install the requirements (as listed above).

  3. Download the model if needed (the script loads it automatically from Hugging Face).

Usage

Run the app using Python:

python app.py
  • The app will launch a Gradio interface in your browser.
  • Select a tab (Image, Video, PDF, GIF, or Caption).
  • Upload media and enter a query.
  • Adjust advanced options if desired.
  • Click "Submit" to generate output.
  • For PDFs, use navigation buttons to preview pages.

The app supports streaming output for real-time generation and renders results in both raw text and Markdown formats.

Acknowledgements

App by: Prithiv Sakthi U R

Model: Qwen/Qwen3-VL-30B-A3B-Instruct

License

Apache License: Version 2.0

About

The demo of Qwen3-VL-30B-A3B-Instruct, the next-generation and powerful vision-language model in the Qwen series, delivers comprehensive upgrades across the board — including superior text understanding and generation, deeper visual perception and reasoning, extended context length, enhanced spatial and video dynamics comprehensions.

Topics

Resources

License

Stars

Watchers

Forks

Languages