Simple Captioner

A minimal media captioning tool powered by Qwen2.5/3 VL Instruct from Alibaba Group.

This tool uses a Gradio UI to batch process folders of images and videos and generate descriptive captions.

Written by Olli S.

✨ Features

Version 1.0.1

✅ Uses Qwen2.5/3 VL Instruct for high-quality understanding
✅ Support for:
- Qwen/Qwen3-VL-4B-Instruct
- Qwen/Qwen3-VL-8B-Instruct
- Qwen/Qwen2.5-VL-3B-Instruct
- Qwen/Qwen2.5-VL-7B-Instruct
✅ Flash attention 2 support (with toggle)
✅ Quantization via BitsAndBytes (None / 8-bit / 4-bit)
✅ Caption multiple images or videos from a selected folder
✅ Sub-folder support
✅ Supports prompt customization
✅ "Summary Mode" and "One-Sentence Mode" options for different caption styles
✅ Can skip already-captioned images
✅ Image previews with real-time progress
✅ Abort long runs safely

Requirements

Python 3.9+
A modern NVIDIA GPU with CUDA (tested on Ampere and newer)
~16GB VRAM recommended for smooth operation

Setup Instructions

Clone the repository:

git clone https://github.com/o-l-l-i/simple-captioner.git
cd simple-captioner

Create a virtual environment (optional but recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install the dependencies:
```
pip install -r requirements.txt
```
Install Torch with GPU support:
- You have to install GPU compatible Torch yourself, get it from here:
- https://pytorch.org/get-started/locally/
- Copy the "Run this Command" string from the page after selecting correct version.
  - i.e. if you have Cuda 12.8, select that option. (Windows, Pip, Python, CUDA 12.8.)
```
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
```

Install Triton:

On Windows, install woctordho's Triton fork for Windows

pip install triton
pip install triton-windows # On Windows use this

Run the app:
```
python app.py
```
To run this app later:
- When you need to return back to use this, the virtual environment (venv) needs to be activated again.
- Use/modify the included start up scripts.
Windows:
- run_app.bat
```
@echo off
call venv\Scripts\activate
python app.py
```
Linux/macOS:
- run_app.sh
```
#!/bin/bash
source venv/bin/activate
python app.py
```
Make it executable:
```
chmod +x run_app.sh
```

Model Files

When you run the app for the first time, the model (Qwen/Qwen2.5-VL-7B-Instruct) is automatically downloaded from Hugging Face. This download is cached locally, so subsequent runs are much faster and offline-compatible.

By default, Hugging Face stores downloaded models in:

Linux/macOS: ~/.cache/huggingface/

Windows: C:\Users\<YourUsername>\.cache\huggingface\

You can inspect, manage, or clear this cache manually, or change the location by setting the HF_HOME environment variable:

export HF_HOME=/custom/path/to/huggingface
# On Windows: set HF_HOME=E:\huggingface_cache

This is useful if you're working with limited disk space or want to centralize model caches across multiple projects.

Video Support Note

To enable video processing, make sure qwen-vl-utils is installed. On Linux:

pip install qwen-vl-utils[decord]==0.0.8

On other platforms (Windows/macOS):
pip install qwen-vl-utils

This will fall back to using torchvision for video loading if decord does not work, which is slower. For better performance, you can try to install decord from source

Usage Notes

Place your images in a folder (recursively scanned, subfolders are supported.)
Text files with the same name (e.g. image1.jpg → image1.txt) are created alongside the images.
Use the “Skip already captioned” checkbox to avoid reprocessing.
Captions can be styled with prompt modifiers or sentence-length constraints.

Customization

Prompt handling is adjustable with toggles.
Modify the base prompt or model behavior in generate_caption() inside the code.
Want more control over output format? Adjust the file writing or UI code.

Troubleshooting

Make sure you’re using a CUDA-compatible GPU.
On Windows you have to install GPU compatible Torch yourself, get it from here:
- https://pytorch.org/get-started/locally/
- Select a Torch version which matches your CUDA version.
If VRAM usage is too high, reduce max_tokens. This is only tested on 3090 and 5090, but I did monitor the VRAM usage.

Versions

1.0.1 - 2025-10-15
- Model dropdown with multiple model support.
- Quantization (None / 8-bit / 4-bit.)
- Attention implementation toggle (flash attention 2 supported) + auto-fallback to eager
- Model is no longer loaded at import; loads via UI or on app UI start.
- Defaults to Qwen/Qwen3-VL-8B-Instruct, this can be memory intensive, so use quantization or 4B model.
- Improved VRAM cleanup.
1.0.0
- Initial release.
- Qwen/Qwen2.5-VL-7B-Instruct support for image and video captioning.

Early Development Notice

This project is currently in a very early phase of development. While it aims to provide useful image and video captioning capabilities, you may encounter bugs, unexpected behavior, or incomplete features.

If you run into any issues:

Please check the console or logs for error messages.
Try to use supported media formats as listed.
Feel free to report problems or request features via the project’s GitHub Issues page.

License & Usage Terms

This project is source-available, but not open-source under a standard open-source license, and not freeware. You may use and experiment with it freely, and any results you create with it are yours to use however you like.

However:

Redistribution, resale, rebranding, or claiming authorship of this code or extension is strictly prohibited without explicit written permission.

Use at your own risk. No warranties or guarantees are provided.

The only official repository for this project is: 👉 https://github.com/o-l-l-i/simple-captioner

Author

Created by @o-l-l-i

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
images		images
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
run_app.bat		run_app.bat
run_app.sh		run_app.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Simple Captioner

✨ Features

Requirements

Setup Instructions

Model Files

Video Support Note

Usage Notes

Customization

Troubleshooting

Versions

Early Development Notice

License & Usage Terms

Author

About

Uh oh!

Releases

Packages

Languages

License

o-l-l-i/simple-captioner

Folders and files

Latest commit

History

Repository files navigation

Simple Captioner

✨ Features

Requirements

Setup Instructions

Model Files

Video Support Note

Usage Notes

Customization

Troubleshooting

Versions

Early Development Notice

License & Usage Terms

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages