Projects:2018s1-103 Improving Usability and User Interaction with KALDI Open-Source Speech Recogniser
Contents
Project Team
Students
- Shi Yik Chin
- Yasasa Saman Tennakoon
Supervisors
- Dr. Said Al-Sarawi
- Dr. Ahmad Hashemi-Sakhtsari (DST Group)
Abstract
This project aims to refine and improve the capabilities of KALDI (an Open Source Speech Recogniser). This will require:
- Improving the current GUI's flexibility
- Introducing new elements or replacing older elements in the GUI for ease of use
- Including a methodology that users (of any skill level) can use to improve or introduce Language or Acoustic models into the software
- Refining current Language and Acoustic models in the software to reduce the Word Error Rate (WER)
- Introducing a neural network in the software to reduce the Word Error Rate (WER)
- Introducing a feedback loop into the software to reduce the Word Error Rate (WER)
- Introducing Binarized Neural Networks into the training methods to reduce training times and increase efficiency
This project will involve the use of Deep Learning algorithms (Automatic Speech Recognition related), software development (C++) and performance evaluation through the Word Error Rate formula. Very little hardware will be involved through its entirety.
Introduction
KALDI is an open source speech transcription toolkit intended for use by speech recognition researchers. The software allows the utilisation of integration of newly developed speech transcription algorithms. The software usability is limited due to the requirements of using complex scripting language and operating system specific commands. In this project, a Graphical User Interface (GUI) has been developed, that allows non-technical individuals to use the software and make it easier to interact with. The GUI allows for different language and acoustic models selections and transcription either from a file or live input – live decoding. Also, two newly trained models have been added, one which uses Gaussian Mixture Models, while the second uses a Neural Network model. The first one was selected to allow for benchmark performance evaluation, while the second to demonstrate an improved transcription accuracy.
Background
Automatic Speech Recognition (ASR)
Automatic Speech Recognition (ASR) is the process of converting audio into text [1]. In general, ASR occurs in the following process:
- Feature Representation
- Phoneme mapping the features through the Acoustic Model (AM)
- Word mapping the phonemes through the Dictionary Model (Lexicon)
- Sentence construction the words through the Language Model (LM)
The final three processes are collectively known as 'Decoding', which is highlighted more clearly through the diagram provided.
Feature Representation
Feature Representation is the process of extracting important bits and pieces of the frequencies of sound files, mainly using Spectrograms and other frequency analysis tools. A good Feature Representation manages to capture only salient spectral characters (features that are not speaker-specific).
Obtaining such features can be done in several ways. The method used by the KALDI program is the Mel-frequency Cepstrum Coefficients (MFCC) method.
The Acoustic Model
The Acoustic Model (AM) essentially converts the values of the parameterised waveform into phonemes. Phonemes, by definition, are a unit of sound in speech [4]. They do not have any inherent meaning by themselves, but words are constructed when they are considered collectively in different patterns. English is estimated to consist of roughly 40 phonemes.
The Dictionary Model
The Dictionary Model, also known as a Lexicon, maps the written representations of words or phrases with the pronunciations of them. The pronunciations are described using phonemes that are relevant to the specific language the lexicon is built upon [5].
The Language Model
The task of a Language Model (LM) is to predict the next character/word that may occur in a sentence, given the previous words that have been spoken [6]. A good LM can result in contrastingly different results.
The above mentioned elements function together to produce an automatic speech recogniser/transcriber. This methodology is also adapted by the KALDI ASR toolkit used.
Research and Development
Results
Improvement On GUI
- Editable transcript
- Interactive display window
- Ability to choose AM/LM
- Audio recording & playback
- Timestamp usage on utterances
- Recording status indicator
- Microphone mute/unmute control
- Detailed documentation of decoding session
Improvement of Functionality of the ASR
- New Acoustic Models is trained
- Librispeech AM
- Librispeech corpus – 960 hours of audio
- Several models – from Mono to Tri6b
- Pre-trained nnet2
- Fisher corpus – over 2000 hours of audio
- Librispeech AM
- New Language Model
- 3-gram LM
- Trained by plain text of size 34 MB
- Ability to transcribe from recorded or live audio