Orpheus-TTS-Edge.mp4
Orpheus TTS, a Llama-based Speech-LLM designed for high-quality, empathetic text-to-speech generation. This model has been fine-tuned to deliver human-level speech synthesis
Warning
Don't forget to add the HF_TOKEN
to the environment to access the gated Hugging Face models.
- Text Input: Process text-based queries with DeepHermes Llama for natural language understanding.
- Image Input: Analyze and describe images using Qwen2-VL.
- Video Input: Process videos by extracting key frames and summarizing content.
- Orpheus TTS: Generate realistic speech with customizable voices (
tara
,dan
,emma
,josh
). - Emotion Support: Add emotions like
<laugh>
,<sigh>
,<gasp>
, etc., to make the speech more expressive. - Direct TTS: Convert text to speech directly using
@<voice>-tts
(e.g.,@tara-tts
). - LLM-Augmented TTS: Generate a response using DeepHermes Llama and then convert it to speech using
@<voice>-llm
(e.g.,@tara-llm
).
- Use the
@video-infer
command to analyze and summarize video content. The system extracts key frames and processes them with Qwen2-VL.
- Adjust generation parameters like
temperature
,top-p
,top-k
, andrepetition penalty
to fine-tune responses.
-
Direct TTS:
- Use
@<voice>-tts
to directly convert text to speech. - Example:
@tara-tts Hey, I’m Tara, [laugh] and I’m a speech generation model!
- Use
-
LLM-Augmented TTS:
- Use
@<voice>-llm
to generate a response with DeepHermes Llama and then convert it to speech. - Example:
@tara-llm Explain the causes of rainbows.
- Use
-
Video Processing:
- Use
@video-infer
to analyze and summarize video content. - Example:
@video-infer Summarize the event in this video.
- Use
-
Regular Chat:
- Input text or upload images/videos for multimodal processing.
- Example:
Write a Python program for array rotation.
@josh-tts Hey! I’m Josh, [gasp] and wow, did I just surprise you with my realistic voice?
@emma-tts Hey, I’m Emma, [sigh] and yes, I can talk just like a person… even when I’m tired.
@dan-llm Explain the General Relativity theorem in short.
@tara-llm Who is Nikola Tesla, and why did he die?
@video-infer Summarize the event in this video.
@video-infer Describe the video.
summarize the letter
(with an uploaded image).Explain the causes of rainbows
(with an uploaded video).
-
Install Dependencies: Ensure you have the required Python packages installed:
pip install torch gradio transformers huggingface-hub snac dotenv
-
Environment Variables:
- Set
MAX_INPUT_TOKEN_LENGTH
in.env
to control the maximum input token length for the LLM.
- Set
-
Run the Application:
python app.py
-
Access the Interface:
- The Gradio interface will launch locally. Use the provided examples or input your own queries.
-
DeepHermes Llama:
- A fine-tuned Llama model for natural language understanding and generation.
- Model ID:
prithivMLmods/DeepHermes-3-Llama-3-3B-Preview-abliterated
.
-
Qwen2-VL:
- A multimodal model for image and video processing.
- Model ID:
prithivMLmods/Qwen2-VL-OCR2-2B-Instruct
.
-
Orpheus TTS:
- A high-quality text-to-speech model for generating realistic speech.
- Model ID:
canopylabs/orpheus-3b-0.1-ft
.
-
SNAC:
- A neural audio codec used for decoding TTS outputs.
- Model ID:
hubertsiuzdak/snac_24khz
.
- Voices: Choose from
tara
,dan
,emma
, orjosh
for TTS. - Emotions: Add emotions like
<laugh>
,<sigh>
,<gasp>
, etc., to make the speech more expressive. - Generation Parameters: Adjust
temperature
,top-p
,top-k
, andrepetition penalty
to fine-tune responses.
- Hardware Requirements: A GPU is recommended for optimal performance, especially for TTS and video processing.
- Limitations:
- Video processing is limited to 10 key frames per video.
- TTS generation may take longer for longer texts.
- Hugging Face for providing the models and tools.
- Gradio for the intuitive interface.
- SNAC and Orpheus TTS for high-quality speech synthesis.