A simple POC for a fast real-time voice chat application using FastAPI and FastRTC by rohanprichard. I wanted to make one as an example with more production-ready languages, rather than just Gradio.
-
Set your OpenAI and ElevenLabs API key in an
.envfile based on the.env.examplefile -
Create a virtual environment and install the dependencies
python3 -m venv env source env/bin/activate pip install -r requirements.txtFor windows,
python -m venv env .\env\Scripts\activate pip install -r requirements.txt
-
Run the server
./run.sh
Windows:
uvicorn backend.server:app --host 0.0.0.0 --port 8000
-
Navigate into the frontend directory
cd frontend/fastrtc-demo -
Run the frontend
npm install npm run dev
-
Click the microphone icon and start chatting!
-
Reset chats by clicking the trash button on the bottom right
- The STT is currently using the ElevenLabs API.
- The LLM is currently using the OpenAI API.
- The TTS is currently using the ElevenLabs API.
- The VAD is currently using the Silero VAD model.
- You may need to install ffmpeg if you get errors in STT
The prompt can be changed in the backend/server.py file and modified as you like.
- audio_chunk_duration: Length of audio chunks in seconds. Smaller values allow for faster processing but may be less accurate.
- started_talking_threshold: If a chunk has more than this many seconds of speech, the system considers that the user has started talking.
- speech_threshold: After the user has started speaking, if a chunk has less than this many seconds of speech, the system considers that the user has paused.
- threshold: Speech probability threshold (0.0-1.0). Values above this are considered speech. Higher values are more strict.
- min_speech_duration_ms: Speech segments shorter than this (in milliseconds) are filtered out.
- min_silence_duration_ms: The system waits for this duration of silence (in milliseconds) before considering speech to be finished.
- speech_pad_ms: Padding added to both ends of detected speech segments to prevent cutting off words.
- max_speech_duration_s: Maximum allowed duration for a speech segment in seconds. Prevents indefinite listening.
-
If the AI interrupts you too early:
- Increase
min_silence_duration_ms - Increase
speech_threshold - Increase
speech_pad_ms
- Increase
-
If the AI is slow to respond after you finish speaking:
- Decrease
min_silence_duration_ms - Decrease
speech_threshold
- Decrease
-
If the system fails to detect some speech:
- Lower the
thresholdvalue - Decrease
started_talking_threshold
- Lower the
Credit for the UI components goes to Shadcn, Aceternity UI and Kokonut UI.