Orpheus-Edge-TTS-Demo

Orpheus-TTS-Edge.mp4

Orpheus TTS, a Llama-based Speech-LLM designed for high-quality, empathetic text-to-speech generation. This model has been fine-tuned to deliver human-level speech synthesis

Warning

Don't forget to add the HF_TOKEN to the environment to access the gated Hugging Face models.

Features

1. Multimodal Input Support

Text Input: Process text-based queries with DeepHermes Llama for natural language understanding.
Image Input: Analyze and describe images using Qwen2-VL.
Video Input: Process videos by extracting key frames and summarizing content.

2. Advanced Text-to-Speech (TTS)

Orpheus TTS: Generate realistic speech with customizable voices (tara, dan, emma, josh).
Emotion Support: Add emotions like <laugh>, <sigh>, <gasp>, etc., to make the speech more expressive.
Direct TTS: Convert text to speech directly using @<voice>-tts (e.g., @tara-tts).
LLM-Augmented TTS: Generate a response using DeepHermes Llama and then convert it to speech using @<voice>-llm (e.g., @tara-llm).

3. Video Processing

Use the @video-infer command to analyze and summarize video content. The system extracts key frames and processes them with Qwen2-VL.

4. Customizable Parameters

Adjust generation parameters like temperature, top-p, top-k, and repetition penalty to fine-tune responses.

Usage

Commands

Direct TTS:
- Use @<voice>-tts to directly convert text to speech.
- Example: @tara-tts Hey, I’m Tara, [laugh] and I’m a speech generation model!
LLM-Augmented TTS:
- Use @<voice>-llm to generate a response with DeepHermes Llama and then convert it to speech.
- Example: @tara-llm Explain the causes of rainbows.
Video Processing:
- Use @video-infer to analyze and summarize video content.
- Example: @video-infer Summarize the event in this video.
Regular Chat:
- Input text or upload images/videos for multimodal processing.
- Example: Write a Python program for array rotation.

Examples

Text-to-Speech (TTS)

@josh-tts Hey! I’m Josh, [gasp] and wow, did I just surprise you with my realistic voice?
@emma-tts Hey, I’m Emma, [sigh] and yes, I can talk just like a person… even when I’m tired.

LLM-Augmented TTS

@dan-llm Explain the General Relativity theorem in short.
@tara-llm Who is Nikola Tesla, and why did he die?

Video Processing

@video-infer Summarize the event in this video.
@video-infer Describe the video.

Multimodal Input

summarize the letter (with an uploaded image).
Explain the causes of rainbows (with an uploaded video).

Setup

Install Dependencies: Ensure you have the required Python packages installed:
```
pip install torch gradio transformers huggingface-hub snac dotenv
```
Environment Variables:
- Set MAX_INPUT_TOKEN_LENGTH in .env to control the maximum input token length for the LLM.
Run the Application:
```
python app.py
```
Access the Interface:
- The Gradio interface will launch locally. Use the provided examples or input your own queries.

Models Used

DeepHermes Llama:
- A fine-tuned Llama model for natural language understanding and generation.
- Model ID: prithivMLmods/DeepHermes-3-Llama-3-3B-Preview-abliterated.
Qwen2-VL:
- A multimodal model for image and video processing.
- Model ID: prithivMLmods/Qwen2-VL-OCR2-2B-Instruct.
Orpheus TTS:
- A high-quality text-to-speech model for generating realistic speech.
- Model ID: canopylabs/orpheus-3b-0.1-ft.
SNAC:
- A neural audio codec used for decoding TTS outputs.
- Model ID: hubertsiuzdak/snac_24khz.

Customization

Voices: Choose from tara, dan, emma, or josh for TTS.
Emotions: Add emotions like <laugh>, <sigh>, <gasp>, etc., to make the speech more expressive.
Generation Parameters: Adjust temperature, top-p, top-k, and repetition penalty to fine-tune responses.

Notes

Hardware Requirements: A GPU is recommended for optimal performance, especially for TTS and video processing.
Limitations:
- Video processing is limited to 10 key frames per video.
- TTS generation may take longer for longer texts.

Acknowledgments

Hugging Face for providing the models and tools.
Gradio for the intuitive interface.
SNAC and Orpheus TTS for high-quality speech synthesis.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
examples		examples
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Orpheus-Edge-TTS-Demo

Features

1. Multimodal Input Support

2. Advanced Text-to-Speech (TTS)

3. Video Processing

4. Customizable Parameters

Usage

Commands

Examples

Text-to-Speech (TTS)

LLM-Augmented TTS

Video Processing

Multimodal Input

Setup

Models Used

Customization

Notes

Acknowledgments

About

Uh oh!

Uh oh!

Languages

PRITHIVSAKTHIUR/Orpheus-TTS-Edge

Folders and files

Latest commit

History

Repository files navigation

Orpheus-Edge-TTS-Demo

Features

1. Multimodal Input Support

2. Advanced Text-to-Speech (TTS)

3. Video Processing

4. Customizable Parameters

Usage

Commands

Examples

Text-to-Speech (TTS)

LLM-Augmented TTS

Video Processing

Multimodal Input

Setup

Models Used

Customization

Notes

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages