Whisper with Urro `WHISPER + URRO`

Multilingual automatic speech recognition (ASR) with speaker segmentation (SS) / speaker diarization (SD) and word-level timestamps (WLT)

Installation

Latest

pip install git+https://github.com/urroxyz/whisper@v0.3.0

Development

Details

Latest update: The whisperer's transcript parameter allows you to align existing transcripts!

pip install git+https://github.com/urroxyz/whisper

Introduction

Yes, Whisper can segment speakers and timestamp words! And WHISPER + URRO is here to offer an easy solution therefor.

By modifying the thinking process of the OpenAI model, we can force it to delimit new speakers with symbols like hyphens (-) or greater-thans (>), or even with complete labels such as [SPEAKER 1] and [SPEAKER 2] to keep track of who is speaking and when.¹ By extracting cross-attentions and processesing them with dynamic-time warping, we can reconstruct timestamps on the word level rather than relying on occasional generated time tokens.²

Supported models

Official

Size	Parameters	New-speaker segmentation	Speaker diarization	Word-level timestamps
tiny³ tiny.en⁴	39 M	✓	x	✓
base⁵ base.en⁶	74 M	✓	x	✓
small⁷ small.en⁸	244 M	✓	✓ x	✓
medium⁹ medium.en¹⁰	769 M	✓	✓ x	✓
large-v3¹¹	1550 M	✓	✓	x
large-v3-turbo¹²	809 M	✓	x	✓

Third-party

Model	Parameters	New-speaker segmentation	Speaker diarization	Word-level timestamps
whisper-d-v1a¹³	1550 M	✓	✓	x

Comparison

Details

video.mp4

Source	Transcript
Ground truth	[SPEAKER 1] Down in front. [SPEAKER 2] Hey, sit down, that’s wrong of you. [SPEAKER 1] The little lady who is to become Mrs. Harvey Yates over my dead body. [SPEAKER 3] I know I have the sincere wishes of all my friends… and can only tell you how much I appreciate it. I think I can honestly say this is the happiest moment of my life. Look what I have here.
Pretrained model (`medium`) No speaker labels	Down in front. Hey, sit down, ~~that’s fine~~. The little lady who is to become Mrs. Harvey Yates over my dead body. [APPLAUSE] I know I have the sincere wishes of all my friends, and can only tell you how much I appreciate it. I think I can honestly say this is the happiest moment of my life. Look what I have here…
Pretrained model (`medium`) with WHISPER + URRO `delimiter=SPEAKER()` `prompt=SPEAKERS(3, "en")` Correct speaker labels	[SPEAKER 1] Down in front. [SPEAKER 2] Hey, sit down, ~~that’s fine~~. [SPEAKER 1] The little lady who is to become Mrs. Harvey Yates over my dead body. [APPLAUSE] [SPEAKER 3] I know I have the sincere wishes of all my friends, and can only tell you how much I appreciate it. I think I can honestly say this is the happiest moment of my life. Look what I have here…
Finetuned model (`d-v1a`) Incorrect speaker labels	[S1] Down in front. [S2] Hey, sit down, ~~it’s warm~~. [S1] The little lady who is to become Mrs. Harvey Yates, over my dead body. ~~[S2]~~ I know I have the sincere wishes of all my friends, and can only tell you how much I appreciate it. I think I can honestly say this is the happiest moment of my life. Look what I have here.
Finetuned model (`d-v1a`) with WHISPER + URRO `delimiter=SPEAKER(short=True)` `prompt=SPEAKERS(3, "en", short=True)` Correct speaker labels	[S1] Down in front. [S2] Hey, sit down, ~~it’s warm~~. [S1] The little lady who is to become Mrs. Harvey Yates, over my dead body. [S3] I know I have the sincere wishes of all my friends, and can only tell you how much I appreciate it. I think I can honestly say this is the happiest moment of my life. Look what I have here.

Quickstart

1. Import the library

from urro_whisper import whisperer
from urro_whisper.delimiters import HYPHEN, GREATER_THAN, SPEAKER, PERSON
from urro_whisper.prompts import SPEAKERS, PERSONS

2. Set variables

to segment speakers:

model = "tiny"
audio = "audio.wav"
language = "en"
delimiter = HYPHEN

to label speakers:

model = "medium"
audio = "audio.wav"
language = "en"
prompt = SPEAKERS
delimiter = SPEAKER
speakers = 3

3. Create the `whisperer`

to segment speakers:

result = whisperer(
    model=model,
    audio=audio,
    language=language,
    delimiter=delimiter(),
    verbose=False,
)

to label speakers:

result = whisperer(
    model=model,
    audio=audio,
    language=language,
    prompt=prompt(speakers, language),
    delimiter=delimiter(),
    verbose=False,
)

3. Print results

import re

print("\n--- Transcript ---")
texts = re.split(delimiter.regex, result["text"])
for _, text in enumerate(texts):
    if len(text) > 0:
      print(text)

def format_timestamp(seconds):
    if seconds is None: return "N/A"
    milliseconds = round(seconds * 1000)
    ss = milliseconds // 1000
    ms = milliseconds % 1000
    mm = ss // 60
    ss %= 60
    hh = mm // 60
    mm %= 60
    return f"{hh:02d}:{mm:02d}:{ss:02d}.{ms:03d}"

try:
    from IPython.display import display, HTML, Audio
    import soundfile as sf
    import math
    import numpy as np
    import librosa

    audio_original, sr_original = sf.read(audio)
    if audio_original.ndim > 1:
        audio_original = audio_original.mean(axis=1)

    target_sample_rate = 16000

    if sr_original != target_sample_rate:
        audio_playback = librosa.resample(
            y=audio_original.astype(np.float32),
            orig_sr=sr_original,
            target_sr=target_sample_rate
        )
    else:
        audio_playback = audio_original.astype(np.float32)

    html_rows = []
    html_rows.append("<tr><th>Timestamp</th><th>Text</th><th>Audio</th></tr>")

    for idx, word_info in enumerate(result["words"]):
        start_time = word_info['start']
        end_time = word_info['end']
        word_text = word_info['text']
        ts_str = f"[{format_timestamp(start_time)} --> {format_timestamp(end_time)}]"
        audio_player_html = "N/A"
        if (
            start_time is not None
            and end_time is not None
            and end_time > start_time
        ):
            start_sample = max(0, math.floor(start_time * target_sample_rate))
            end_sample = min(len(audio_playback), math.ceil(end_time * target_sample_rate))

            if end_sample > start_sample:
                audio_segment = audio_playback[start_sample:end_sample]

                max_abs = np.max(np.abs(audio_segment))
                if max_abs > 1.0:
                    audio_segment = audio_segment / max_abs
                elif max_abs == 0:
                    
                     pass

                try:
                    audio_obj = Audio(data=audio_segment, rate=target_sample_rate, autoplay=False)
                    audio_player_html = audio_obj._repr_html_()
                except Exception as audio_err:
                    print(f"Warning: Could not create audio player for segment '{word_text}': {audio_err}")
                    audio_player_html = "(Error creating player)"

            else:
                audio_player_html = "(empty segment)"
        html_rows.append(
            f"<tr><td>{ts_str}</td><td>{word_text}</td><td>{audio_player_html}</td></tr>"
        )
    html_table = (
        "<table border='1' style='border-collapse: collapse; width: 100%;'>"
        "<thead></thead><tbody>"
        + "".join(html_rows)
        + "</tbody></table>"
    )
    display(HTML(html_table))

except ImportError as e:
    print(f"\nSkipping HTML table generation due to missing libraries: {e}")
    print("You might need to install: pip install ipython soundfile librosa")
    print("\n--- Word-level Timestamps (Text Fallback) ---")
   
    if "words" in result:
        for word_info in result["words"]:
            start = word_info['start']
            end = word_info['end']
            text_ = word_info['text']
            print(f"[{format_timestamp(start)} --> {format_timestamp(end)}]\t{text_}")
    else:
        print("No word timestamp information available in results.")

except FileNotFoundError:
    print(f"\nError: Audio file not found at '{audio}'. Please provide a valid path.")
except Exception as e:
    print(f"\nAn error occurred during HTML table generation or fallback: {e}")
    import traceback
    traceback.print_exc()

Stream

print("\n--- transcript stream ---")
tokens = whisperer(
    model=model,
    audio=audio,
    language=language,
    delimiter=delimiter(),
    prompt=prompt(speakers, language),
    stream=True,
    verbose=False,
)

i = 0
for token in tokens:
    if i == 0:
        print(delimiter() + token, end="", flush=True)
    else: 
        print(token, end="", flush=True)
    i += 1

print("\n--- end of stream ---")

To-Do

Acknowledgements

openai-whisper by OpenAI
- mel spectrogram handling
whisper-timestamped by Linto AI
- word-level timestamp extraction

Notes and links

Unique to WHISPER + URRO. ↩
As explicitly implemented in whisper-timestamped, alongside other libraries, such as openai-whisper. ↩
https://huggingface.co/onnx-community/whisper-tiny_timestamped ↩
https://huggingface.co/onnx-community/whisper-tiny.en_timestamped ↩
https://huggingface.co/onnx-community/whisper-base_timestamped ↩
https://huggingface.co/onnx-community/whisper-base.en_timestamped ↩
https://huggingface.co/onnx-community/whisper-small_timestamped ↩
https://huggingface.co/onnx-community/whisper-small.en_timestamped ↩
https://huggingface.co/urroxyz/whisper-medium_timestamped ↩
https://huggingface.co/urroxyz/whisper-medium.en_timestamped ↩
https://huggingface.co/onnx-community/whisper-large-v3-ONNX ↩
https://huggingface.co/onnx-community/whisper-large-v3-turbo_timestamped ↩
onnx-community/whisper-d-v1a-ONNX ↩

Name		Name	Last commit message	Last commit date
Latest commit History 324 Commits
.github/workflows		.github/workflows
urro_whisper		urro_whisper
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
audio.wav		audio.wav
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Whisper with Urro `WHISPER + URRO`

Installation

Latest

Development

Introduction

Supported models

Official

Third-party

Comparison

Quickstart

1. Import the library

2. Set variables

3. Create the `whisperer`

3. Print results

Stream

To-Do

Acknowledgements

Notes and links

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors 9

Uh oh!

Languages

License

urroxyz/whisper

Folders and files

Latest commit

History

Repository files navigation

Whisper with Urro WHISPER + URRO

Installation

Latest

Development

Introduction

Supported models

Official

Third-party

Comparison

Quickstart

1. Import the library

2. Set variables

3. Create the whisperer

3. Print results

Stream

To-Do

Acknowledgements

Notes and links

Footnotes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors 9

Uh oh!

Languages

Whisper with Urro `WHISPER + URRO`

3. Create the `whisperer`

Packages