BiGRU encoder + Attention decoder, based on "Listen, Attend and Spell"1.
The acoustic features are 80-dimensional filter banks. They are stacked every 3 consecutive frames, so the time resolution is reduced.
Following the standard recipe, we use 462-speaker training set with all SA records removed. Outputs are mapped to 39 phonemes when evalauting.
With this code you can achieve ~22% PER on the core test set.
$ pip install -r requirements.txtThis will create lists (*.csv) of audio file paths along with their transcripts:
$ python prepare_data.py --root ${DIRECTORY_OF_TIMIT}Check available options:
$ python train.py -hUse the default configuration for training:
$ python train.py exp/default.yamlYou can also write your own configuration file based on exp/default.yaml.
$ python train.py ${PATH_TO_YOUR_CONFIG}With the default configuration, the training logs are stored in exp/default/history.csv.
Specify your training logs accordingly.
$ python show_history.py exp/default/history.csvDuring training, the program will keep monitoring the error rate on development set.
The checkpoint with the lowest error rate will be saved in the logging directory (by default exp/default/best.pth).
To evalutate the checkpoint on test set, run:
$ python eval.py exp/default/best.pthOr you can test random audio from the test set and see the attentions:
$ python inference.py exp/default/best.pth
Predict:
h# hh ih l pcl p gcl g r ey tcl d ix pcl p ih kcl k ix pcl p eh kcl k ix v dcl d ix tcl t ey dx ah v z h#
Ground-truth:
h# hh eh l pcl p gcl g r ey gcl t ix pcl p ih kcl k ix pcl p eh kcl k ix v pcl p ix tcl t ey dx ow z h#[1] W. Chan et al., "Listen, Attend and Spell", https://arxiv.org/pdf/1508.01211.pdf
[2] J. Chorowski et al., "Attention-Based Models for Speech Recognition", https://arxiv.org/pdf/1506.07503.pdf
[3] M. Luong et al., "Effective Approaches to Attention-based Neural Machine Translation", https://arxiv.org/pdf/1508.04025.pdf

