Speech Rec With Google, PHP and Python

Speech Rec With Google, PHP and Python

I wanted to move away from just using key presses on a telephone to navigate a telephone service. Wouldn't it be great to use speech as well as DTMF to jump around the service? The scenario is that our telephone service will ask the caller to say a word or press a key on their telephone keypad to select a menu option. The caller's voice or DTMF tone is saved as a wav file on the server. We then analyse this wav file and use logic to decide what action to take on it. Below you will find all the PHP and Python code to achieve this. All the code is PHP unless otherwise stated. Scroll to the end for the full code examples or view Google's online documentation.

Code Explained

We wait for 1 second before running the speech recognition code to make sure our file has saved properly. Then we load our Google credentials and import the Google Cloud client library that will make all of the speech rec magic work.

sleep(1);
putenv('GOOGLE_APPLICATION_CREDENTIALS=C:\Users\User\.aws\creds.json');
require __DIR__ . '/vendor/autoload.php';
use Google\Cloud\Speech\V1\SpeechClient;
use Google\Cloud\Speech\V1\RecognitionAudio;
use Google\Cloud\Speech\V1\RecognitionConfig;
use Google\Cloud\Speech\V1\RecognitionConfig\AudioEncoding;

The 3 variables we will use for our code are the file name, the service/directory this is saved in and which DTMF keys are allowed. If we're just using speech rec or we have forgotten to set which keys are allowed, we can set our DTMF variable to be 99 which means it won't check for key entries.

$fileName     = $_GET['recording'];
$service      = $_GET['service'];
$alloweddtmf  = $_GET['alloweddtmf'];
if(empty($alloweddtmf)) $alloweddtmf = "99";

Our caller's recording is stored in a specific place depending on the service they rang so we need to go and find that recording and then get the contents of that file into a string.

$fileName =  "\\xampp\\htdocs\\Recordings\\".$service."\\".$fileName;
$audioFile = $fileName;
$content = file_get_contents($audioFile);

Next we need to use an external Python script to check for DTMF tones. There's no point doing this though if we've told the service we're not using DTMF so we put a little if statement in there that writes what we're doing to the log.

if($alloweddtmf === "99") {
  $log.= "\nNOT checking for DTMF.";
} else {
  $log.= "\nChecking for DTMF [$alloweddtmf]...";

If we're not checking for DTMF then we skip past the else and start checking for ASR. Under the else, we run a Python script using the command line. This Python script can be found below. It uses the Goertzel algorithm to decode DTMF tones and outputs an array of results. We just want the first result which is why we use $outArray[0].

  $command  = escapeshellcmd("python dtmf/dtmf-decoder.py $service $fileName");
  $output   = shell_exec($command);
  $outArray = explode("\n", $output);
  $output   = $outArray[0];
  $log.= "\ndtmf result = [$output]";

If the result from that Python script is numeric, then we know that we've got a DTMF tone and we can echo that back to the phone service as XML and skip to the end of the code.

if(is_numeric($output)) {
    if(strpos($alloweddtmf, $output) !== false) {
      #send the IVR replyecho "<root><row>";
      echo "<col name='transcription'>$output</col>";
      echo "<col name='status'>0</col>";
      echo "<col name='confidence'>0</col>";
      echo "</row></root>";
      goto finish;

If the result from the DTMF checker isn't numeric, we want to run speech recognition on it. We set the audio content to be a string. We need to know what the sample rate of the file is because Google demands we set the sample rate before analysing the recording. To do this I used SoX which is a command line tool that can analyse and manipulate audio. We use shell_exec to run the simple sox command.

$audio = (new RecognitionAudio())->setContent($content);
$soxSampleRate = shell_exec("sox --i -r ".$audioFile);

Once we have the sample rate we can set the variables Google needs.

if ($soxSampleRate == 8000) {
  $config = new RecognitionConfig([
    'encoding' => AudioEncoding::LINEAR16,
    'sample_rate_hertz' => 8000,
    'language_code' => 'en-GB'
  ]);
} else {
  $config = new RecognitionConfig([
    'encoding' => AudioEncoding::LINEAR16,
    'sample_rate_hertz' => 16000,
    'language_code' => 'en-GB'
  ]);
}

Finally we use the code Google supplied to instantiate a client, detect speech and transcribe it.

$client = new SpeechClient();
$response = $client->recognize($config, $audio);
$transcript = "";
$status = 1;
$confidence = 0;
$log.= "\nDoing ASR...";

The telephone service needs to receive the result of this transcription so we print the most likely transcription with the confidence number from Google, echo the results back to the IVR as XML and write to a log before finishing the code.

foreach ($response->getResults() as $result) {
    $alternatives = $result->getAlternatives();
    $mostLikely = $alternatives[0];
    $transcript = $mostLikely->getTranscript();
    $confidence = $mostLikely->getConfidence();
    $status = 0;
}
$client->close();
$log.= "\nstatus (0 is good) = $status";
$log.= "\nTranscription = $transcript";

echo "<root><row>";
echo "<col name='transcription'>$transcript</col>";
echo "<col name='status'>$status</col>";
echo "<col name='confidence'>$confidence</col>";
echo "</row></root>";

finish:
$timerStop = microtime(true);
$timer = $timerStop - $timerStart;
$log.= "\nFinished in ".round($timer,4)." seconds.";
file_put_contents($logFile, $log, FILE_APPEND);

?>

PHP script for ASR

<?php
# Sleep for 1 second to make sure the audio file has had time to save
sleep(1);
# Load your Google credentials
putenv('GOOGLE_APPLICATION_CREDENTIALS=C:\Users\User\.aws\creds.json');

# Includes the autoloader for libraries installed with composer
require __DIR__ . '/vendor/autoload.php';

# Imports the Google Cloud client library
use Google\Cloud\Speech\V1\SpeechClient;
use Google\Cloud\Speech\V1\RecognitionAudio;
use Google\Cloud\Speech\V1\RecognitionConfig;
use Google\Cloud\Speech\V1\RecognitionConfig\AudioEncoding;

# start writing to a log file
$timerStart = microtime(true);
$logFile = 'dtmfwill.log';
$log = "\n\n".date("d/m/Y H:i:s");
$log.= "\nasr_ivr_dtmf_will.php";

# Use the commented code below to test specific audio files
# The name of the audio file to transcribe for testing
#$_GET['recording'] = "english8000.wav";
# The name of the service for testing
#$_GET['service'] = "servicename";
# Setting DTMF keys 1 and 2 to be allowed
#$_GET['alloweddtmf'] = "12";

$fileName     = $_GET['recording'];
$service      = $_GET['service'];
$alloweddtmf  = $_GET['alloweddtmf'];

# $alloweddtmf = 99 means ignore dtmf. If it hasn't been set then default to ignore
if(empty($alloweddtmf)) $alloweddtmf = "99";

# The recorded file name and location have been set as variables so we can easily search for them with 1 line of code
$fileName =  "\\xampp\\htdocs\\Recordings\\".$service."\\".$fileName;
$audioFile = $fileName;

# get contents of a file into a string
$content = file_get_contents($audioFile);

######
# check for dtmf tones before running speech rec
# using external python code
######

if($alloweddtmf === "99") {
  $log.= "\nNOT checking for DTMF.";
} else {
  $log.= "\nChecking for DTMF [$alloweddtmf]...";

  # execute python dtmf code as a command line function
  $command  = escapeshellcmd("python dtmf/dtmf-decoder.py $service $fileName");
  $output   = shell_exec($command);
  $outArray = explode("\n", $output);
  $output   = $outArray[0];
  $log.= "\ndtmf result = [$output]";

  # if the result is numeric, echo result then end the script# no need to check for speech if we recognise dtmfif(is_numeric($output)) {
    if(strpos($alloweddtmf, $output) !== false) {
      #send the IVR replyecho "<root><row>";
      echo "<col name='transcription'>$output</col>";
      echo "<col name='status'>0</col>";
      echo "<col name='confidence'>0</col>";
      echo "</row></root>";
      # skips the asr codegoto finish;
    }
  }
}

#####
# use google spech rec to do asr and return the results
#####

# set string as audio content
$audio = (new RecognitionAudio())->setContent($content);
# use sox to get the sample rate of the audio file
$soxSampleRate = shell_exec("sox --i -r ".$audioFile);
# Google needs the sameple rate to be set
# set sample rate by using our sox result
if ($soxSampleRate == 8000) {
  # echo "sample rate is 8000";
  $config = new RecognitionConfig([
    'encoding' => AudioEncoding::LINEAR16,
    'sample_rate_hertz' => 8000,
    'language_code' => 'en-GB'
  ]);
} else {
  # echo "sample rate is 16000";
  $config = new RecognitionConfig([
    'encoding' => AudioEncoding::LINEAR16,
    'sample_rate_hertz' => 16000,
    'language_code' => 'en-GB'
  ]);
}

# Instantiates a client
$client = new SpeechClient();
#$client->useApplicationDefaultCredentials();
# Detects speech in the audio file
$response = $client->recognize($config, $audio);

$transcript = "";
$status = 1;
$confidence = 0;
$log.= "\nDoing ASR...";

# Print most likely transcription
foreach ($response->getResults() as $result) {
    $alternatives = $result->getAlternatives();
    $mostLikely = $alternatives[0];
    $transcript = $mostLikely->getTranscript();
    $confidence = $mostLikely->getConfidence();
    #printf('Transcript: %s' . PHP_EOL, $transcript);
    $status = 0;
}
$client->close();
$log.= "\nstatus (0 is good) = $status";
$log.= "\nTranscription = $transcript";
# echo the response to the IVR
echo "<root><row>";
echo "<col name='transcription'>$transcript</col>";
echo "<col name='status'>$status</col>";
echo "<col name='confidence'>$confidence</col>";
echo "</row></root>";

# this is where dtmf skips to and writes to a log
finish:
$timerStop = microtime(true);
$timer = $timerStop - $timerStart;
$log.= "\nFinished in ".round($timer,4)." seconds.";
file_put_contents($logFile, $log, FILE_APPEND);

?>

Python script for DTMF recognition

'''
A python implementation of the Goertzel algorithm to decode DTMF tones.
The wave file is split into bins and each bin is analyzed
for all the DTMF frequencies. The method run() will return a numeric
representation of the DTMF tone.
'''
import wave
import struct
import math
import sys
import os.path
class pygoertzel_dtmf:def __init__(self, samplerate):
        self.samplerate = samplerate
        self.goertzel_freq = [1209.0,1336.0,1477.0,1633.0,697.0,770.0,852.0,941.0]
        self.s_prev = {}
        self.s_prev2 = {}
        self.totalpower = {}
        self.N = {}
        self.coeff = {}
        # create goertzel parameters for each frequency so that# all the frequencies are analyzed in parallelfor k in self.goertzel_freq:
            self.s_prev[k] = 0.0
            self.s_prev2[k] = 0.0
            self.totalpower[k] = 0.0
            self.N[k] = 0.0
            normalizedfreq = k / self.samplerate
            self.coeff[k] = 2.0*math.cos(2.0 * math.pi * normalizedfreq)
    def __get_number(self, freqs):
        hi = [1209.0,1336.0,1477.0,1633.0]
        lo = [697.0,770.0,852.0,941.0]
        # get hi freq
        hifreq = 0.0
        hifreq_v = 0.0for f in hi:
            if freqs[f]>hifreq_v:
                hifreq_v = freqs[f]
                hifreq = f
        # get lo freq
        lofreq = 0.0
        lofreq_v = 0.0for f in lo:
            if freqs[f]>lofreq_v:
                lofreq_v = freqs[f]
                lofreq = f
        if lofreq==697.0:
            if hifreq==1209.0:
                return "1"elif hifreq==1336.0:
                return "2"elif hifreq==1477.0:
                return "3"elif hifreq==1633.0:
                return "A"elif lofreq==770.0:
            if hifreq==1209.0:
                return "4"elif hifreq==1336.0:
                return "5"elif hifreq==1477.0:
                return "6"elif hifreq==1633.0:
                return "B"elif lofreq==852.0:
            if hifreq==1209.0:
                return "7"elif hifreq==1336.0:
                return "8"elif hifreq==1477.0:
                return "9"elif hifreq==1633.0:
                return "C"elif lofreq==941.0:
            if hifreq==1209.0:
                return "*"elif hifreq==1336.0:
                return "0"elif hifreq==1477.0:
                return "#"elif hifreq==1633.0:
                return "D"def run(self, sample):
        freqs = {}
        for freq in self.goertzel_freq:
            s = sample + (self.coeff[freq] * self.s_prev[freq]) - self.s_prev2[freq]
            self.s_prev2[freq] = self.s_prev[freq]
            self.s_prev[freq] = s
            self.N[freq]+=1
            power = (self.s_prev2[freq]*self.s_prev2[freq]) + (self.s_prev[freq]*self.s_prev[freq]) - (self.coeff[freq]*self.s_prev[freq]*self.s_prev2[freq])
            self.totalpower[freq]+=sample*sample
            if (self.totalpower[freq] == 0):
                self.totalpower[freq] = 1
            freqs[freq] = power / self.totalpower[freq] / self.N[freq]
        return self.__get_number(freqs)
if __name__ == '__main__':
    # load wav file
    fullPath = os.path.join('\\xampp\\htdocs\\informcomms.co.uk\\voxeo\\Recordings', sys.argv[1], sys.argv[2])
    wav = wave.open(fullPath, 'r')
    #wav = wave.open('/home/michael/Downloads/dtmf.wav', 'r')
    (nchannels, sampwidth, framerate, nframes, comptype, compname) = wav.getparams()
    frames = wav.readframes(nframes * nchannels)
    # convert wave file to array of integers
    frames = struct.unpack_from("%dH" % nframes * nchannels, frames)
    # if stereo get left/rightif nchannels == 2:
        left = [frames[i] for i in range(0,len(frames),2)]
        right = [frames[i] for i in range(1,len(frames),2)]
    else:
        left = frames
        right = left
    binsize = 400# Split the bin in 4 to average out errors due to noise
    binsize_split = 4
    prevvalue = ""
    prevcounter = 0for i in range(0,len(left)-binsize,binsize//binsize_split):
        goertzel = pygoertzel_dtmf(framerate)
        for j in left[i:i+binsize]:
            value = goertzel.run(j)
        if value==prevvalue:
            prevcounter+=1if prevcounter==10:
                print(value)
        else:
            prevcounter=0
            prevvalue=value

To view or add a comment, sign in

More articles by Will Helliwell

  • When your “simple Airtable” quietly becomes your whole business

    Most teams don’t wake up one day and say, “Right, our system is broken.” It just kinda… creeps up on them.

    2 Comments
  • How To Use Google Cloud Text-to-Speech

    Intro I believe text-to-speech (TTS) technology is now at a level where it can be used to replace the human voice in…

  • Automating Your Jira Workflow

    I have 2 projects set up in Jira. One is a Service Desk project (called CST) where clients and the customer service…

    2 Comments
  • Using Next-gen Projects In Jira

    Jira recently came out with something called next-gen projects. These are agile software projects with a few new…

  • Fixing A Strange Telephone Service Bug

    Data Collection Bug Hypothesis I build IVR services as my day job and had never come across an issue like this before…

  • New IVR Platform, New Logic

    Moving IVR platforms can be tricky and sometimes the logic you worked so hard on doesn't carry over. The problem arose…

  • Building An Icebox In Jira

    As I move away from using Pivotal Tracker and more towards tracking my work in Jira, it became apparent that I needed…

    1 Comment
  • A History Of Video Game Music (Part 2)

    Commodore 64 Release Date: 1982 Five years after the Atari exploded on the home console scene, the Commodore 64 was…

  • A History Of Video Game Music (Part 1)

    Atari 2600 Release Date: 1977 I’m going back to where it all started with home consoles, the Atari 2600. It had a…

  • Jira Setup And Slack Integration

    I've been tasked with setting up an entire Jira instance at work and I love it already. That includes Jira Service Desk…

    2 Comments

Explore content categories