Who said what. Solved.
The Problem
Transcription tells you what was said. Diarization tells you who said it. For meeting notes, interviews, and podcasts, knowing the speaker is essential.
What I Built
A demo exploring speaker diarization:
- Upload audio with multiple speakers
- AI identifies unique voices
- Transcript labeled by speaker
- Export formatted notes
Stack: Python, pyannote.audio, Whisper
Key Learnings
1. Diarization is hard
Speakers interrupting each other, similar voices, background noise — all create challenges. Accuracy drops in real-world conditions.
2. pyannote.audio is impressive
Open-source diarization that actually works. Not perfect, but far better than building from scratch.
3. Combined with transcription is powerful
Whisper for text + pyannote for speakers = structured meeting notes. The combination is more valuable than either alone.
4. Edge cases everywhere
One person quoting another person? Accents changing mid-sentence? Laughter? Real audio is messy.
Agent Quick Start
# Diarization Demo
Speaker identification + transcription.
## Pipeline
1. Audio input (any format)
2. Voice activity detection
3. Speaker embeddings
4. Clustering into speakers
5. Combine with Whisper transcript
## Stack
Python, pyannote.audio, Whisper
## Output
Speaker-labeled transcript with timestamps
## Links
- Repo: https://github.com/sergiopesch/diarization-demo