A Gradio-based web UI for voice cloning and voice design, powered by Qwen3-TTS and VibeVoice. Supports both Whisper or VibeVoice-asr for automatic Transcription.
Clone voices from your own audio samples. Just provide a short reference audio clip with its transcript, and generate new speech in that voice. Choose Your Engine:
-
Qwen Small/Fast or VibeVoice Small/Fast -
-
Voice prompt caching - First generation processes the sample, subsequent ones are instant
-
Seed control - Reproducible results with saved seeds
-
Metadata tracking - Each output saves generation info (sample, seed, text)
Create multi-speaker dialogues using either Qwen's premium voices or your own custom voice samples using VibeVoice:
Choose Your Engine:
- Qwen - Fast generation with 9 preset voices, optimized for their native languages
- VibeVoice - High-quality custom voices, up to 90 minutes continuous, perfect for podcasts/audiobooks
Unified Script Format:
Write scripts using [N]: format - works seamlessly with both engines:
[1]: Hey, how's it going?
[2]: I'm doing great, thanks for asking!
[3]: Mind if I join this conversation?
Qwen Mode:
- Mix any of the 9 premium speakers
- Adjustable pause duration between lines
- Fast generation with cached prompts
Speaker Mapping:
- [1] = Vivian, [2] = Serena, [3] = Uncle_Fu, [4] = Dylan, [5] = Eric
- [6] = Ryan, [7] = Aiden, [8] = Ono_Anna, [9] = Sohee
VibeVoice Mode:
- Up to 90 minutes of continuous speech
- Up to 4 distinct speakers using your own voice samples
- Cross-lingual support
- May spontaneously add background music/sounds for realism
- Numbers beyond 4 wrap around (5→1, 6→2, 7→3, 8→4, etc.)
Perfect for:
- Podcasts
- Audiobooks
- Long-form conversations
- Multi-speaker narratives
Models:
- Small - Faster generation (Qwen: 0.6B, VibeVoice: 1.5B)
- Large - Best quality (Qwen: 1.7B, VibeVoice: Large model)
Generate with premium pre-built voices with optional style instructions using Qwen3-TTS Custom Model:
| Speaker | Description | Language |
|---|---|---|
| Vivian | Bright, slightly edgy young female | Chinese |
| Serena | Warm, gentle young female | Chinese |
| Uncle_Fu | Seasoned male, low mellow timbre | Chinese |
| Dylan | Youthful Beijing male, clear natural | Chinese (Beijing) |
| Eric | Lively Chengdu male, husky brightness | Chinese (Sichuan) |
| Ryan | Dynamic male, strong rhythmic drive | English |
| Aiden | Sunny American male, clear midrange | English |
| Ono_Anna | Playful Japanese female, light nimble | Japanese |
| Sohee | Warm Korean female, rich emotion | Korean |
- Style instructions supported (emotion, tone, speed)
- Each speaker works best in native language but supports all
Create voices from natural language descriptions - no audio needed, using Qwen3-TTS Voice Design Model:
- Describe age, gender, emotion, accent, speaking style
- Generate unique voices matching your description
Fine-tune your own custom voice models with your training data:
- Dataset Management - Organize training samples in the
datasets/folder - Audio Preparation - Auto-converts to 24kHz 16-bit mono format
- Training Pipeline - Complete 3-step workflow (validation → extract codes → train)
- Epoch Selection - Compare different training checkpoints
- Live Progress - Real-time training logs and loss monitoring
- Voice Presets Integration - Use trained models alongside premium speakers
Requirements:
- CUDA GPU required
- Multiple audio samples with transcripts
- Training time: ~10-30 minutes depending on dataset size
Workflow:
- Prepare audio files (WAV/MP3) and organize in
datasets/YourSpeakerName/folder - Use Batch Transcribe to automatically transcribe all files at once
- Review and edit individual transcripts as needed
- Configure training parameters (model size, epochs, learning rate)
- Monitor training progress in real-time
- Use trained model in Voice Presets tab
Full audio preparation workspace:
- Trim - Use waveform selection to cut audio
- Normalize - Balance audio levels
- Convert to Mono - Ensure single-channel audio
- Transcribe - Whisper or VibeVoice ASR automatic transcription
- Batch Transcribe - Process entire folders of audio files at once
- Save as Sample - One-click sample creation
View, play back, and manage your previously generated audio files.
- Python 3.10+ (recommended for all platforms)
- CUDA-compatible GPU (recommended: 8GB+ VRAM)
- SOX (Sound eXchange) - Required for audio processing
- FFMPEG - Multimedia framework required for audio format conversion
- Flash Attention 2 (optional but recommended)
Note for Linux users: The Linux installation skips openai-whisper (compatibility issues). VibeVoice ASR is used for transcription instead.
- Clone the repository:
git clone https://github.com/FranckyB/Voice-Clone-Studio.git
cd Voice-Clone-Studio- Run the setup script:
setup-windows.batThis will automatically:
- Install SOX (audio processing)
- Create virtual environment
- Install PyTorch with CUDA support
- Install all dependencies
- Display your Python version
- Show instructions for optional Flash Attention 2 installation
- Clone the repository:
git clone https://github.com/FranckyB/Voice-Clone-Studio.git
cd Voice-Clone-Studio- Make the setup script executable and run it:
chmod +x setup-linux.sh
./setup-linux.shThis will automatically:
- Detect your Python version
- Create virtual environment
- Install PyTorch with CUDA support
- Install all dependencies (using requirements file)
- Handle ONNX Runtime installation issues
- Warn about Whisper compatibility if needed
- Clone the repository:
git clone https://github.com/FranckyB/Voice-Clone-Studio.git
cd Voice-Clone-Studio- Create a virtual environment:
python -m venv venv
# Windows
venv\Scripts\activate
# Linux/MacOs
source venv/bin/activate- (NVIDIA GPU) Install PyTorch with CUDA support:
# Linux/Windows
pip install torch==2.9.1 torchaudio --index-url https://download.pytorch.org/whl/cu130- Install dependencies:
# All platforms (Windows, Linux, macOS)
pip install -r requirements.txtNote: The requirements file uses platform markers to automatically install the correct packages:
- Windows: Includes
openai-whisperfor transcription - Linux/macOS: Excludes
openai-whisper(uses VibeVoice ASR instead)
- Install Sox
# Windows
winget install -e --id ChrisBagwell.SoX
# Linux
# Debian/Ubuntu
sudo apt install sox libsox-dev
# Fedora/RHEL
sudo dnf install sox sox-devel
# MacOs
brew install sox- Install ffmpeg
# Windows
winget install -e --id Gyan.FFmpeg
# Linux
# Debian/Ubuntu
sudo apt install ffmpeg
# Fedora/RHEL
sudo dnf install ffmpeg
# MacOs
brew install ffmpeg- (Optional) Install FlashAttention 2 for faster generation:
Note: The application automatically detects and uses the best available attention mechanism. Configure in Settings tab:
auto(recommended) →flash_attention_2→sdpa→eager
For troubleshooting solutions, see docs/troubleshooting.md.
-
Install NVIDIA Drivers (Windows Side)
- Install the latest standard NVIDIA driver (Game Ready or Studio) for Windows from the NVIDIA Drivers page.
- Crucial: Do not try to install NVIDIA drivers inside your WSL Linux terminal. It will conflict with the host driver.
-
Update WSL 2
- Open PowerShell as Administrator and ensure your WSL kernel is up to date:
wsl --update - (If you don't have WSL installed yet, run
wsl --installand restart your computer).
- Open PowerShell as Administrator and ensure your WSL kernel is up to date:
-
Configure Docker Desktop
- Install the latest version of Docker Desktop for Windows.
- Open Docker Desktop Settings (gear icon).
- Under General, ensure "Use the WSL 2 based engine" is checked.
- Under Resources > WSL Integration, ensure the switch is enabled for your default Linux distro (e.g., Ubuntu).
-
Run with Docker Compose
- Run the following command in the repository root:
docker-compose up --build
- The application will be accessible at
http://127.0.0.1:7860.
- Run the following command in the repository root:
To verify the installation and features (like the DeepFilterNet denoiser), runs the integration tests inside the container:
# Run the Denoiser Integration Test
docker-compose exec voice-clone-studio python tests/integration_test_denoiser.pypython voice_clone_studio.pyOr use the batch file (Windows):
launch.batThe UI will open at http://127.0.0.1:7860
- Go to the Prep Samples tab
- Upload or record audio (3-10 seconds of clear speech)
- Trim and normalize as needed
- Transcribe or manually enter the text
- Save as a sample with a name
- Go to the Voice Clone tab
- Select your sample from the dropdown
- Enter the text you want to speak
- Click Generate
- Go to the Voice Design tab
- Enter the text to speak
- Describe the voice (e.g., "Young female, warm and friendly, slight British accent")
- Click Generate
Qwen3-TTS-Voice-Clone-Studio/
├── voice_clone_ui.py # Main Gradio application
├── requirements.txt # Python dependencies
├── __Launch_UI.bat # Windows launcher
├── samples/ # Voice samples (.wav + .txt pairs)
│ └── example.wav
│ └── example.txt
├── output/ # Generated audio outputs
├── vendor # Included Technology
│ └── vibevoice_asr # newest version of vibevoice with asr support
│ └── vibevoice_tts # prior version of vibevoice with tts support
Each tab lets you choose between model sizes:
| Model | Sizes | Use Case |
|---|---|---|
| Qwen3-TTS Base | Small, Large | Voice cloning from samples |
| Qwen3-TTS CustomVoice | Small, Large | Premium speakers with style control |
| Qwen3-TTS VoiceDesign | 1.7B only | Voice design from descriptions |
| VibeVoice-TTS | Small, Large | Voice cloning & Long-form multi-speaker (up to 90 min) |
| VibeVoice-ASR | Large | Audio transcription |
| Whisper | Medium | Audio transcription |
- Small = Faster, less VRAM (Qwen: 0.6B ~4GB, VibeVoice: 1.5B)
- Large = Better quality, more expressive (Qwen: 1.7B ~8GB, VibeVoice: Large model)
- 4 Bit Quantized version of the Large model is also included for VibeVoice.
Models are automatically downloaded on first use via HuggingFace.
- Reference Audio: Use clear, noise-free recordings (3-10 seconds)
- Transcripts: Should exactly match what's spoken in the audio
- Caching: Voice prompts are cached - first generation is slow, subsequent ones are fast
- Seeds: Use the same seed to reproduce identical outputs
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
This project is based on and uses code from:
- Qwen3-TTS - Apache 2.0 License (Alibaba)
- VibeVoice - MIT License
- Gradio - Apache 2.0 License
- OpenAI Whisper - MIT License
- Qwen3-TTS by Alibaba
- VibeVoice by Microsoft
- Gradio for the web UI framework
- OpenAI Whisper for transcription
For detailed version history and release notes, see docs/updates.md.
Latest Version: 0.6.0 - Enhanced Model Support & Settings (January 27, 2026)