Voice mode, recreated.
The Inspiration
ChatGPT's voice mode changed how I interact with AI. Conversational, natural, no typing required. I wanted to understand how it works by building my own.
What I Built
A voice interface for AI conversations:
- Real-time speech-to-text
- LLM response generation
- Text-to-speech output
- Continuous conversation flow
Stack: Web Speech API, OpenAI API, ElevenLabs
Key Learnings
1. Latency kills conversation
Humans expect responses in ~300ms. Anything longer feels laggy. Had to optimize every step of the pipeline.
2. Wake words are hard
"Hey Siri" style activation is surprisingly complex. False positives are annoying, false negatives are frustrating. Settled on push-to-talk.
3. Voice has different needs than text
Spoken responses should be shorter, more conversational. Long text paragraphs read aloud become unbearable.
4. Web Speech API is good enough
Browser-native speech recognition is surprisingly accurate for most use cases. Don't need expensive APIs for basic voice input.
Agent Quick Start
# Voice Mode Replica
Voice interface for AI conversations.
## Pipeline
1. Speech-to-text (Web Speech API)
2. LLM processing (OpenAI)
3. Text-to-speech (ElevenLabs)
## Stack
JavaScript, Web Speech API, OpenAI, ElevenLabs
## Latency Optimization
- Streaming responses
- Audio chunking
- Parallel processing
## Links
- Repo: https://github.com/sergiopesch/VoiceModeReplica