Voice Mode Replica | sergiopesch.com

Voice mode, recreated.

The Inspiration

ChatGPT's voice mode changed how I interact with AI. Conversational, natural, no typing required. I wanted to understand how it works by building my own.

What I Built

A voice interface for AI conversations:

Real-time speech-to-text
LLM response generation
Text-to-speech output
Continuous conversation flow

Stack: Web Speech API, OpenAI API, ElevenLabs

Key Learnings

1. Latency kills conversation

Humans expect responses in ~300ms. Anything longer feels laggy. Had to optimize every step of the pipeline.

2. Wake words are hard

"Hey Siri" style activation is surprisingly complex. False positives are annoying, false negatives are frustrating. Settled on push-to-talk.

3. Voice has different needs than text

Spoken responses should be shorter, more conversational. Long text paragraphs read aloud become unbearable.

4. Web Speech API is good enough

Browser-native speech recognition is surprisingly accurate for most use cases. Don't need expensive APIs for basic voice input.

Agent Quick Start

# Voice Mode Replica

Voice interface for AI conversations.

## Pipeline
1. Speech-to-text (Web Speech API)
2. LLM processing (OpenAI)
3. Text-to-speech (ElevenLabs)

## Stack
JavaScript, Web Speech API, OpenAI, ElevenLabs

## Latency Optimization
- Streaming responses
- Audio chunking
- Parallel processing

## Links
- Repo: https://github.com/sergiopesch/VoiceModeReplica