Back to Projects

Diarization Demo

Started: February 07, 2025

Who said what. Solved.


The Problem

Transcription tells you what was said. Diarization tells you who said it. For meeting notes, interviews, and podcasts, knowing the speaker is essential.

What I Built

A demo exploring speaker diarization:

  • Upload audio with multiple speakers
  • AI identifies unique voices
  • Transcript labeled by speaker
  • Export formatted notes

Stack: Python, pyannote.audio, Whisper

Key Learnings

1. Diarization is hard

Speakers interrupting each other, similar voices, background noise — all create challenges. Accuracy drops in real-world conditions.

2. pyannote.audio is impressive

Open-source diarization that actually works. Not perfect, but far better than building from scratch.

3. Combined with transcription is powerful

Whisper for text + pyannote for speakers = structured meeting notes. The combination is more valuable than either alone.

4. Edge cases everywhere

One person quoting another person? Accents changing mid-sentence? Laughter? Real audio is messy.


Agent Quick Start

# Diarization Demo

Speaker identification + transcription.

## Pipeline
1. Audio input (any format)
2. Voice activity detection
3. Speaker embeddings
4. Clustering into speakers
5. Combine with Whisper transcript

## Stack
Python, pyannote.audio, Whisper

## Output
Speaker-labeled transcript with timestamps

## Links
- Repo: https://github.com/sergiopesch/diarization-demo