Difference between revisions of "Projects:2018s1-103 Improving Usability and User Interaction with KALDI Open-Source Speech Recogniser"

Revision as of 15:42, 20 October 2018

Project Team

Students

Shi Yik Chin
Yasasa Saman Tennakoon

Supervisors

Dr. Said Al-Sarawi
Dr. Ahmad Hashemi-Sakhtsari (DST Group)

Abstract

This project aims to refine and improve the capabilities of KALDI (an Open Source Speech Recogniser). This will require:

Improving the current GUI's flexibility
Introducing new elements or replacing older elements in the GUI for ease of use
Including a methodology that users (of any skill level) can use to improve or introduce Language or Acoustic models into the software
Refining current Language and Acoustic models in the software to reduce the Word Error Rate (WER)
Introducing a neural network in the software to reduce the Word Error Rate (WER)
Introducing a feedback loop into the software to reduce the Word Error Rate (WER)
Introducing Binarized Neural Networks into the training methods to reduce training times and increase efficiency

This project will involve the use of Deep Learning algorithms (Automatic Speech Recognition related), software development (C++) and performance evaluation through the Word Error Rate formula. Very little hardware will be involved through its entirety.

Introduction

KALDI is an open source speech transcription toolkit intended for use by speech recognition researchers. The software allows the utilisation of integration of newly developed speech transcription algorithms. The software usability is limited due to the requirements of using complex scripting language and operating system specific commands. In this project, a Graphical User Interface (GUI) has been developed, that allows non-technical individuals to use the software and make it easier to interact with. The GUI allows for different language and acoustic models selections and transcription either from a file or live input – live decoding. Also, two newly trained models have been added, one which uses Gaussian Mixture Models, while the second uses a Neural Network model. The first one was selected to allow for benchmark performance evaluation, while the second to demonstrate an improved transcription accuracy.

Background

The KALDI Toolkit

Initially developed in 2009 by Dan Povey and several others [20], KALDI has been aimed at being a low development cost, high quality speech recognition toolkit. Having undergone several improvements over the years, it now consists of recipes on creating custom Acoustic and Language Models that are based on several large corpuses such as the Wall Street Journal, Fisher, TIMIT etc. It's latest addition has been Neural Network Acoustic/Language models that are still in development [8].

KALDI's directory structure, and overall code layout is quite complex. It also requires the user to have a sound knowledge on Ubuntu terminal commands as well as knowledge on C++ and Python. It is, in fact, this complexity that the first-time users face when approaching the toolkit that has led to the inception of the previous iterations of the ASR program.

Automatic Speech Recognition (ASR)

Error creating thumbnail: Unable to save thumbnail to destination

ASR Process[3]

Automatic Speech Recognition (ASR) is the process of converting audio into text [1]. In general, ASR occurs in the following process:

Feature Representation
Phoneme mapping the features through the Acoustic Model (AM)
Word mapping the phonemes through the Dictionary Model (Lexicon)
Sentence construction the words through the Language Model (LM)

The final three processes are collectively known as 'Decoding', which is highlighted more clearly through the diagram provided.

Feature Representation

Feature Representation is the process of extracting important bits and pieces of the frequencies of sound files, mainly using Spectrograms and other frequency analysis tools. A good Feature Representation manages to capture only salient spectral characters (features that are not speaker-specific).

Obtaining such features can be done in several ways. The method used by the KALDI program is the Mel-frequency Cepstrum Coefficients (MFCC) method.

Error creating thumbnail: Unable to save thumbnail to destination

Method of obtaining MFCC[3]

The Acoustic Model

The Acoustic Model (AM) essentially converts the values of the parameterised waveform into phonemes. Phonemes, by definition, are a unit of sound in speech [4]. They do not have any inherent meaning by themselves, but words are constructed when they are considered collectively in different patterns. English is estimated to consist of roughly 40 phonemes.

The Dictionary Model

The Dictionary Model, also known as a Lexicon, maps the written representations of words or phrases with the pronunciations of them. The pronunciations are described using phonemes that are relevant to the specific language the lexicon is built upon [5].

The Language Model

The task of a Language Model (LM) is to predict the next character/word that may occur in a sentence, given the previous words that have been spoken [6]. A good LM can result in contrastingly different results.

The above mentioned elements function together to produce an automatic speech recogniser/transcriber. This methodology is also adapted by the KALDI ASR toolkit used.

Error creating thumbnail: Unable to save thumbnail to destination

Transcription before and after LM addition[7]

References

[1]-R.E. Gruhn, W. Minker, S. Nakamura, Statistical Pronunciation Modeling for Non-Native Speech Processing, ch. 2, p. 5, Harman/Becker Automotive Systems GmbH, Ulm, Germany: 2011 [E-book]. Available: Springer, https://goo.gl/peFMrZ [Accessed 22 May 2018]. [3]-J. Chiu. GHC 4012. Class Lecture, Topic: “Design and Implementation of Speech Recognition Systems – Feature Implementation”, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 27 Jan. 2014. [4]-M. Breum, “Playing with Sounds in Words- Part 3 {Phoneme Segmentation}.” thisreadingmama.com, para. 3, 26 Sept. 2016. [Online]. Available: thisreadingmama.com/playing-with-phonemes-part-3/ [Accessed 22 May 2018]. [5]-“About Lexicons and Phonetic Alphabets (Microsoft.Speech)” Microsoft Research Blog, 2018. [Online]. Available: https://goo.gl/SDqZtU [Accessed 22 May 2018]. [6]-R. J. Mooney. CS 388. Class Lecture, Topic: “Natural Language Processing: N-Gram Language Models”, School of Computer Science, University of Texas at Austin, Austin, TX, 2018. [7]-A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Sateesh, S. Sengupta, A. Coates, A. Y. Ng, “Deep Speech: Scaling up end-to-end speech recognition”, p. 3, 19 Dec. 2014 [Abstract]. Available: https://arxiv.org/abs/1412.5567?context=cs [Accessed 22 May 2018]. [8]-KALDI, “Deep Neural Networks in Kaldi - Introduction”, KALDI, n.d. [Online] Available at: http://kaldi-asr.org/doc/dnn.html [Accessed 22 May 2018]. [20]-KALDI, “History of the KALDI Project”, KALDI, n.d. [Online] Available at: http://kaldi-asr.org/doc/history.html [Accessed 22 May 2018].

Research and Development

Results

Improvement On GUI

Graphical User Interface

Editable transcript
Interactive display window
Ability to choose AM/LM
Audio recording & playback
Timestamp usage on utterances
Recording status indicator
Microphone mute/unmute control
Detailed documentation of decoding session

Improvement of Functionality of the ASR

New Acoustic and Language Models are added to the existing program:

Two new Acoustic Models
- The first Acoustic Model is GMM a model. This is trained by using the corpus provided by LibriSpeech. The LibriSpeech corpus is a large (1000 hour) corpus of English read speech derived from audiobooks, sampled at 16kHz. The accents are various and not marked, but the majority are US English. Several models are trained from this corpus, which includes both monophone model and triphone model (tri2b, tri3b, ..., to tri6b).
- The other Acoustic Model is a NNET2 model. This model is pre-trained using the more than 2000 hours of Fisher corpus and available for download from kaldi website.
A new Language Model
- 3-gram LM
- Trained by plain text of size 34 MB
Ability to transcribe from recorded or live audio

Difference between revisions of "Projects:2018s1-103 Improving Usability and User Interaction with KALDI Open-Source Speech Recogniser"

Revision as of 15:42, 20 October 2018

Contents

Project Team

Students

Supervisors

Abstract

Introduction

Background

The KALDI Toolkit

Automatic Speech Recognition (ASR)

Feature Representation

The Acoustic Model

The Dictionary Model

The Language Model

References

Research and Development

Results

Improvement On GUI

Improvement of Functionality of the ASR

Conclusion

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 28: / Line 28: @@
 == '''Background''' ==
+==The KALDI Toolkit==
+Initially developed in 2009 by Dan Povey and several others [20], KALDI has been aimed at being a low development cost, high quality speech recognition toolkit. Having undergone several improvements over the years, it now consists of recipes on creating custom Acoustic and Language Models that are based on several large corpuses such as the Wall Street Journal, Fisher, TIMIT etc. It's latest addition has been Neural Network Acoustic/Language models that are still in development [8].
+KALDI's directory structure, and overall code layout is quite complex. It also requires the user to have a sound knowledge on Ubuntu terminal commands as well as knowledge on C++ and Python. It is, in fact, this complexity that the first-time users face when approaching the toolkit that has led to the inception of the previous iterations of the ASR program.
 ==Automatic Speech Recognition (ASR)==
@@ Line 69: / Line 75: @@
 The above mentioned elements function together to produce an automatic speech recogniser/transcriber. This methodology is also adapted by the KALDI ASR toolkit used.
-[[File:LM.png|thumb|upright=2.5|Transcription before and after LM addition[3]]]
+[[File:LM.png|thumb|upright=2.5|Transcription before and after LM addition[7]]]
+==References==
+[1]-R.E. Gruhn, W. Minker, S. Nakamura, Statistical Pronunciation Modeling for Non-Native Speech Processing, ch. 2, p. 5, Harman/Becker Automotive Systems GmbH, Ulm, Germany: 2011 [E-book]. Available: Springer, https://goo.gl/peFMrZ [Accessed 22 May 2018].
+[3]-J. Chiu. GHC 4012. Class Lecture, Topic: “Design and Implementation of Speech Recognition Systems – Feature Implementation”, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 27 Jan. 2014.
+[4]-M. Breum, “Playing with Sounds in Words- Part 3 {Phoneme Segmentation}.” thisreadingmama.com, para. 3, 26 Sept. 2016. [Online]. Available: thisreadingmama.com/playing-with-phonemes-part-3/ [Accessed 22 May 2018].
+[5]-“About Lexicons and Phonetic Alphabets (Microsoft.Speech)” Microsoft Research Blog, 2018. [Online]. Available: https://goo.gl/SDqZtU [Accessed 22 May 2018].
+[6]-R. J. Mooney. CS 388. Class Lecture, Topic: “Natural Language Processing: N-Gram Language Models”, School of Computer Science, University of Texas at Austin, Austin, TX, 2018.
+[7]-A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Sateesh, S. Sengupta, A. Coates, A. Y. Ng, “Deep Speech: Scaling up end-to-end speech recognition”, p. 3, 19 Dec. 2014 [Abstract]. Available: https://arxiv.org/abs/1412.5567?context=cs [Accessed 22 May 2018].
+[8]-KALDI, “Deep Neural Networks in Kaldi - Introduction”, KALDI, n.d. [Online] Available at: http://kaldi-asr.org/doc/dnn.html [Accessed 22 May 2018].
+[20]-KALDI, “History of the KALDI Project”, KALDI, n.d. [Online] Available at: http://kaldi-asr.org/doc/history.html [Accessed 22 May 2018].
 == '''Research and Development''' ==