Skip to content

Multilingual automatic speech recognition (ASR) with speaker segmentation (SS) / speaker diarization (SD) and word-level timestamps (WLT)

License

Notifications You must be signed in to change notification settings

urroxyz/whisper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Whisper with Urro WHISPER + URRO

Multilingual automatic speech recognition (ASR) with speaker segmentation (SS) / speaker diarization (SD) and word-level timestamps (WLT)

Installation

Latest

pip install git+https://github.com/urroxyz/whisper@v0.3.0

Development

Details

Latest update: The whisperer's transcript parameter allows you to align existing transcripts!

pip install git+https://github.com/urroxyz/whisper

Introduction

Yes, Whisper can segment speakers and timestamp words! And WHISPER + URRO is here to offer an easy solution therefor.

By modifying the thinking process of the OpenAI model, we can force it to delimit new speakers with symbols like hyphens (-) or greater-thans (>), or even with complete labels such as [SPEAKER 1] and [SPEAKER 2] to keep track of who is speaking and when.1 By extracting cross-attentions and processesing them with dynamic-time warping, we can reconstruct timestamps on the word level rather than relying on occasional generated time tokens.2

Supported models

Official

Size Parameters New-speaker segmentation Speaker diarization Word-level timestamps
tiny3
tiny.en4
39 M x
base5
base.en6
74 M x
small7
small.en8
244 M
x
medium9
medium.en10
769 M
x
large-v311 1550 M x
large-v3-turbo12 809 M x

Third-party

Model Parameters New-speaker segmentation Speaker diarization Word-level timestamps
whisper-d-v1a13 1550 M x

Comparison

Details
video.mp4
Source Transcript
Ground truth [SPEAKER 1] Down in front.

[SPEAKER 2] Hey, sit down, that’s wrong of you.

[SPEAKER 1] The little lady who is to become Mrs. Harvey Yates over my dead body.

[SPEAKER 3] I know I have the sincere wishes of all my friends…

and can only tell you how much I appreciate it.

I think I can honestly say this is the happiest moment of my life.

Look what I have here.
Pretrained model (medium)

No speaker labels
Down in front.

Hey, sit down, that’s fine.

The little lady who is to become Mrs. Harvey Yates over my dead body.

[APPLAUSE]

I know I have the sincere wishes of all my friends,

and can only tell you how much I appreciate it.

I think I can honestly say this is the happiest moment of my life.

Look what I have here…
Pretrained model (medium)
with WHISPER + URRO

delimiter=SPEAKER()
prompt=SPEAKERS(3, "en")

Correct speaker labels
[SPEAKER 1] Down in front.

[SPEAKER 2] Hey, sit down, that’s fine.

[SPEAKER 1] The little lady who is to become Mrs. Harvey Yates over my dead body.

[APPLAUSE]

[SPEAKER 3] I know I have the sincere wishes of all my friends,

and can only tell you how much I appreciate it.

I think I can honestly say this is the happiest moment of my life.

Look what I have here…
Finetuned model (d-v1a)

Incorrect speaker labels
[S1] Down in front.

[S2] Hey, sit down, it’s warm.

[S1] The little lady who is to become Mrs. Harvey Yates, over my dead body.

[S2] I know I have the sincere wishes of all my friends,

and can only tell you how much I appreciate it.

I think I can honestly say this is the happiest moment of my life.

Look what I have here.
Finetuned model (d-v1a)
with WHISPER + URRO

delimiter=SPEAKER(short=True)
prompt=SPEAKERS(3, "en", short=True)

Correct speaker labels
[S1] Down in front.

[S2] Hey, sit down, it’s warm.

[S1] The little lady who is to become Mrs. Harvey Yates, over my dead body.

[S3] I know I have the sincere wishes of all my friends,

and can only tell you how much I appreciate it.

I think I can honestly say this is the happiest moment of my life.

Look what I have here.

Quickstart

1. Import the library

from urro_whisper import whisperer
from urro_whisper.delimiters import HYPHEN, GREATER_THAN, SPEAKER, PERSON
from urro_whisper.prompts import SPEAKERS, PERSONS

2. Set variables

to segment speakers:

model = "tiny"
audio = "audio.wav"
language = "en"
delimiter = HYPHEN

to label speakers:

model = "medium"
audio = "audio.wav"
language = "en"
prompt = SPEAKERS
delimiter = SPEAKER
speakers = 3

3. Create the whisperer

to segment speakers:

result = whisperer(
    model=model,
    audio=audio,
    language=language,
    delimiter=delimiter(),
    verbose=False,
)

to label speakers:

result = whisperer(
    model=model,
    audio=audio,
    language=language,
    prompt=prompt(speakers, language),
    delimiter=delimiter(),
    verbose=False,
)

3. Print results

import re

print("\n--- Transcript ---")
texts = re.split(delimiter.regex, result["text"])
for _, text in enumerate(texts):
    if len(text) > 0:
      print(text)

def format_timestamp(seconds):
    if seconds is None: return "N/A"
    milliseconds = round(seconds * 1000)
    ss = milliseconds // 1000
    ms = milliseconds % 1000
    mm = ss // 60
    ss %= 60
    hh = mm // 60
    mm %= 60
    return f"{hh:02d}:{mm:02d}:{ss:02d}.{ms:03d}"

try:
    from IPython.display import display, HTML, Audio
    import soundfile as sf
    import math
    import numpy as np
    import librosa

    audio_original, sr_original = sf.read(audio)
    if audio_original.ndim > 1:
        audio_original = audio_original.mean(axis=1)

    target_sample_rate = 16000

    if sr_original != target_sample_rate:
        audio_playback = librosa.resample(
            y=audio_original.astype(np.float32),
            orig_sr=sr_original,
            target_sr=target_sample_rate
        )
    else:
        audio_playback = audio_original.astype(np.float32)

    html_rows = []
    html_rows.append("<tr><th>Timestamp</th><th>Text</th><th>Audio</th></tr>")

    for idx, word_info in enumerate(result["words"]):
        start_time = word_info['start']
        end_time = word_info['end']
        word_text = word_info['text']
        ts_str = f"[{format_timestamp(start_time)} --> {format_timestamp(end_time)}]"
        audio_player_html = "N/A"
        if (
            start_time is not None
            and end_time is not None
            and end_time > start_time
        ):
            start_sample = max(0, math.floor(start_time * target_sample_rate))
            end_sample = min(len(audio_playback), math.ceil(end_time * target_sample_rate))

            if end_sample > start_sample:
                audio_segment = audio_playback[start_sample:end_sample]

                max_abs = np.max(np.abs(audio_segment))
                if max_abs > 1.0:
                    audio_segment = audio_segment / max_abs
                elif max_abs == 0:
                    
                     pass

                try:
                    audio_obj = Audio(data=audio_segment, rate=target_sample_rate, autoplay=False)
                    audio_player_html = audio_obj._repr_html_()
                except Exception as audio_err:
                    print(f"Warning: Could not create audio player for segment '{word_text}': {audio_err}")
                    audio_player_html = "(Error creating player)"

            else:
                audio_player_html = "(empty segment)"
        html_rows.append(
            f"<tr><td>{ts_str}</td><td>{word_text}</td><td>{audio_player_html}</td></tr>"
        )
    html_table = (
        "<table border='1' style='border-collapse: collapse; width: 100%;'>"
        "<thead></thead><tbody>"
        + "".join(html_rows)
        + "</tbody></table>"
    )
    display(HTML(html_table))

except ImportError as e:
    print(f"\nSkipping HTML table generation due to missing libraries: {e}")
    print("You might need to install: pip install ipython soundfile librosa")
    print("\n--- Word-level Timestamps (Text Fallback) ---")
   
    if "words" in result:
        for word_info in result["words"]:
            start = word_info['start']
            end = word_info['end']
            text_ = word_info['text']
            print(f"[{format_timestamp(start)} --> {format_timestamp(end)}]\t{text_}")
    else:
        print("No word timestamp information available in results.")

except FileNotFoundError:
    print(f"\nError: Audio file not found at '{audio}'. Please provide a valid path.")
except Exception as e:
    print(f"\nAn error occurred during HTML table generation or fallback: {e}")
    import traceback
    traceback.print_exc()

Stream

print("\n--- transcript stream ---")
tokens = whisperer(
    model=model,
    audio=audio,
    language=language,
    delimiter=delimiter(),
    prompt=prompt(speakers, language),
    stream=True,
    verbose=False,
)

i = 0
for token in tokens:
    if i == 0:
        print(delimiter() + token, end="", flush=True)
    else: 
        print(token, end="", flush=True)
    i += 1

print("\n--- end of stream ---")

To-Do

  • Regroup word ouput
  • Speaker diarization
  • User prompting
  • Stream text output
  • Align existing transcript
  • Stream audio input

Acknowledgements

Notes and links

Footnotes

  1. Unique to WHISPER + URRO.

  2. As explicitly implemented in whisper-timestamped, alongside other libraries, such as openai-whisper.

About

Multilingual automatic speech recognition (ASR) with speaker segmentation (SS) / speaker diarization (SD) and word-level timestamps (WLT)

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 9

Languages