Multilingual automatic speech recognition (ASR) with speaker segmentation (SS) / speaker diarization (SD) and word-level timestamps (WLT)
pip install git+https://github.com/urroxyz/whisper@v0.3.0Details
Latest update: The whisperer's transcript parameter allows you to align existing transcripts!
pip install git+https://github.com/urroxyz/whisperYes, Whisper can segment speakers and timestamp words! And WHISPER + URRO is here to offer an easy solution therefor.
By modifying the thinking process of the OpenAI model, we can force it to delimit new speakers with symbols like hyphens (-) or greater-thans (>), or even with complete labels such as [SPEAKER 1] and [SPEAKER 2] to keep track of who is speaking and when.1 By extracting cross-attentions and processesing them with dynamic-time warping, we can reconstruct timestamps on the word level rather than relying on occasional generated time tokens.2
| Size | Parameters | New-speaker segmentation | Speaker diarization | Word-level timestamps |
|---|---|---|---|---|
| tiny3 tiny.en4 |
39 M | ✓ | x | ✓ |
| base5 base.en6 |
74 M | ✓ | x | ✓ |
| small7 small.en8 |
244 M | ✓ | ✓ x |
✓ |
| medium9 medium.en10 |
769 M | ✓ | ✓ x |
✓ |
| large-v311 | 1550 M | ✓ | ✓ | x |
| large-v3-turbo12 | 809 M | ✓ | x | ✓ |
| Model | Parameters | New-speaker segmentation | Speaker diarization | Word-level timestamps |
|---|---|---|---|---|
| whisper-d-v1a13 | 1550 M | ✓ | ✓ | x |
Details
video.mp4
| Source | Transcript |
|---|---|
| Ground truth | [SPEAKER 1] Down in front. [SPEAKER 2] Hey, sit down, that’s wrong of you. [SPEAKER 1] The little lady who is to become Mrs. Harvey Yates over my dead body. [SPEAKER 3] I know I have the sincere wishes of all my friends… and can only tell you how much I appreciate it. I think I can honestly say this is the happiest moment of my life. Look what I have here. |
Pretrained model (medium)No speaker labels |
Down in front. Hey, sit down, The little lady who is to become Mrs. Harvey Yates over my dead body. [APPLAUSE] I know I have the sincere wishes of all my friends, and can only tell you how much I appreciate it. I think I can honestly say this is the happiest moment of my life. Look what I have here… |
Pretrained model (medium)with WHISPER + URRO delimiter=SPEAKER()prompt=SPEAKERS(3, "en")Correct speaker labels |
[SPEAKER 1] Down in front. [SPEAKER 2] Hey, sit down, [SPEAKER 1] The little lady who is to become Mrs. Harvey Yates over my dead body. [APPLAUSE] [SPEAKER 3] I know I have the sincere wishes of all my friends, and can only tell you how much I appreciate it. I think I can honestly say this is the happiest moment of my life. Look what I have here… |
Finetuned model (d-v1a)Incorrect speaker labels |
[S1] Down in front. [S2] Hey, sit down, [S1] The little lady who is to become Mrs. Harvey Yates, over my dead body. and can only tell you how much I appreciate it. I think I can honestly say this is the happiest moment of my life. Look what I have here. |
Finetuned model (d-v1a)with WHISPER + URRO delimiter=SPEAKER(short=True)prompt=SPEAKERS(3, "en", short=True)Correct speaker labels |
[S1] Down in front. [S2] Hey, sit down, [S1] The little lady who is to become Mrs. Harvey Yates, over my dead body. [S3] I know I have the sincere wishes of all my friends, and can only tell you how much I appreciate it. I think I can honestly say this is the happiest moment of my life. Look what I have here. |
from urro_whisper import whisperer
from urro_whisper.delimiters import HYPHEN, GREATER_THAN, SPEAKER, PERSON
from urro_whisper.prompts import SPEAKERS, PERSONSto segment speakers:
model = "tiny"
audio = "audio.wav"
language = "en"
delimiter = HYPHENto label speakers:
model = "medium"
audio = "audio.wav"
language = "en"
prompt = SPEAKERS
delimiter = SPEAKER
speakers = 3to segment speakers:
result = whisperer(
model=model,
audio=audio,
language=language,
delimiter=delimiter(),
verbose=False,
)to label speakers:
result = whisperer(
model=model,
audio=audio,
language=language,
prompt=prompt(speakers, language),
delimiter=delimiter(),
verbose=False,
)import re
print("\n--- Transcript ---")
texts = re.split(delimiter.regex, result["text"])
for _, text in enumerate(texts):
if len(text) > 0:
print(text)
def format_timestamp(seconds):
if seconds is None: return "N/A"
milliseconds = round(seconds * 1000)
ss = milliseconds // 1000
ms = milliseconds % 1000
mm = ss // 60
ss %= 60
hh = mm // 60
mm %= 60
return f"{hh:02d}:{mm:02d}:{ss:02d}.{ms:03d}"
try:
from IPython.display import display, HTML, Audio
import soundfile as sf
import math
import numpy as np
import librosa
audio_original, sr_original = sf.read(audio)
if audio_original.ndim > 1:
audio_original = audio_original.mean(axis=1)
target_sample_rate = 16000
if sr_original != target_sample_rate:
audio_playback = librosa.resample(
y=audio_original.astype(np.float32),
orig_sr=sr_original,
target_sr=target_sample_rate
)
else:
audio_playback = audio_original.astype(np.float32)
html_rows = []
html_rows.append("<tr><th>Timestamp</th><th>Text</th><th>Audio</th></tr>")
for idx, word_info in enumerate(result["words"]):
start_time = word_info['start']
end_time = word_info['end']
word_text = word_info['text']
ts_str = f"[{format_timestamp(start_time)} --> {format_timestamp(end_time)}]"
audio_player_html = "N/A"
if (
start_time is not None
and end_time is not None
and end_time > start_time
):
start_sample = max(0, math.floor(start_time * target_sample_rate))
end_sample = min(len(audio_playback), math.ceil(end_time * target_sample_rate))
if end_sample > start_sample:
audio_segment = audio_playback[start_sample:end_sample]
max_abs = np.max(np.abs(audio_segment))
if max_abs > 1.0:
audio_segment = audio_segment / max_abs
elif max_abs == 0:
pass
try:
audio_obj = Audio(data=audio_segment, rate=target_sample_rate, autoplay=False)
audio_player_html = audio_obj._repr_html_()
except Exception as audio_err:
print(f"Warning: Could not create audio player for segment '{word_text}': {audio_err}")
audio_player_html = "(Error creating player)"
else:
audio_player_html = "(empty segment)"
html_rows.append(
f"<tr><td>{ts_str}</td><td>{word_text}</td><td>{audio_player_html}</td></tr>"
)
html_table = (
"<table border='1' style='border-collapse: collapse; width: 100%;'>"
"<thead></thead><tbody>"
+ "".join(html_rows)
+ "</tbody></table>"
)
display(HTML(html_table))
except ImportError as e:
print(f"\nSkipping HTML table generation due to missing libraries: {e}")
print("You might need to install: pip install ipython soundfile librosa")
print("\n--- Word-level Timestamps (Text Fallback) ---")
if "words" in result:
for word_info in result["words"]:
start = word_info['start']
end = word_info['end']
text_ = word_info['text']
print(f"[{format_timestamp(start)} --> {format_timestamp(end)}]\t{text_}")
else:
print("No word timestamp information available in results.")
except FileNotFoundError:
print(f"\nError: Audio file not found at '{audio}'. Please provide a valid path.")
except Exception as e:
print(f"\nAn error occurred during HTML table generation or fallback: {e}")
import traceback
traceback.print_exc()print("\n--- transcript stream ---")
tokens = whisperer(
model=model,
audio=audio,
language=language,
delimiter=delimiter(),
prompt=prompt(speakers, language),
stream=True,
verbose=False,
)
i = 0
for token in tokens:
if i == 0:
print(delimiter() + token, end="", flush=True)
else:
print(token, end="", flush=True)
i += 1
print("\n--- end of stream ---")- Regroup word ouput
- Speaker diarization
- User prompting
- Stream text output
- Align existing transcript
- Stream audio input
- openai-whisper by OpenAI
- mel spectrogram handling
- whisper-timestamped by Linto AI
- word-level timestamp extraction
Footnotes
-
Unique to WHISPER + URRO. ↩
-
As explicitly implemented in
whisper-timestamped, alongside other libraries, such asopenai-whisper. ↩ -
https://huggingface.co/onnx-community/whisper-tiny_timestamped ↩
-
https://huggingface.co/onnx-community/whisper-tiny.en_timestamped ↩
-
https://huggingface.co/onnx-community/whisper-base_timestamped ↩
-
https://huggingface.co/onnx-community/whisper-base.en_timestamped ↩
-
https://huggingface.co/onnx-community/whisper-small_timestamped ↩
-
https://huggingface.co/onnx-community/whisper-small.en_timestamped ↩
-
https://huggingface.co/urroxyz/whisper-medium.en_timestamped ↩
-
https://huggingface.co/onnx-community/whisper-large-v3-ONNX ↩
-
https://huggingface.co/onnx-community/whisper-large-v3-turbo_timestamped ↩