Projects:2017s1-103 Improving Usability and User Interaction with KALDI Open- Source Speech Recogniser
Contents
Project Team
Students
- George Mao (usability design, implementation and evaluation)
- Vinil Chukkapally (live decoding investigation and integration)
Supervisors
- Dr. Said Al-Sarawi
- Dr. Ahmad Hashemi-Sakhtsari (DST Group)
Project Context
This project follows from a 2015 project which examined the performance of the KALDI speech recognition toolkit, with the support of DST Group.
What is KALDI?
KALDI is a free and open-source software toolkit for automatic speech recognition. It is designed for speech recognition researchers [1], and so requires speech recognition knowledge and familiarity with scripting to operate. As such, it is difficult for those without such knowledge or familiarity to use.
Previous Work
While the 2015 project did produce a user interface, it was primarily proof-of-concept. As such, it was not created with usability in mind and so is not particularly user friendly. The "on-line" (live) decoding function featured by the interface is also limited in functionality, running for a fixed duration of approximately 60 seconds and giving no choice of models.
Aim of the Project
The aim of this project is to enable users to access functionalities of KALDI without the knowledge of scripting, a language like Bash, or detailed knowledge of the internal algorithms of KALDI by the design and development of a graphical user interface (GUI).
Furthermore, attempts will be made to transcribe live audio speech continuously, with functionality operable from the interface.
Background
Acoustic and Language Models
It can be said that spoken speech is language that is "coded" into sounds, and thus the transcription of verbal speech can be considered as "decoding" the audio. In speech recognition, such "code" is characterised by acoustic and language models. Acoustic models describe the words or sounds of a language by their acoustic features (properties). Language models define the structure of a language, using statistics of the probability of a sequence of words forming a valid sentence. By using appropriate acoustic and language models, a speech recogniser can, in theory, decode for any combination of language and speaker.
Usability
Usability is commonly described as how easy a system is to use by its target demographic in its intended operating environment [3]. In general, usability engineering models describe measures of usability in terms of [3][4][5]:
- Learnability (ease of learning).
- Efficiency of use after learning.
- Ability for infrequent use without needing to relearn.
- The frequency and severity of user errors.
- Subjective user satisfaction.
It then follows that a system with high usability is one that is easy to learn, efficiently used after learning, intuitive, prevents user errors and subjectively satisfying to the user.
Principles for developing usable systems are well-established, and the project design is informed by the following usability principles [6]:
- The interface should give the user visual feedback on the system state in reasonable time.
- The interface should provide a parallel between the system and real world by presenting information in a natural and logical order.
- Language within the interface should be consistent, and platform conventions followed.
- The interface should minimise the cognitive burden on the user by encouraging recognition rather than recollection.
- The interface should not present irrelevant information.
- The interface should assist the user in error recognition and recovery by using familiar and constructive language, and by indicating problems precisely.
Live Decoding
The approach towards live decoding is to use an audio input buffer to capture incoming data from the active audio recording device. Using voice activity detection, periods of silence are checked for and a long enough silence is treated as the end of a sentence. Audio is sent per-sentence to KALDI for decoding.
System Approach
The approach used towards the development of the user interface is to have the interface act as a configuration "front-end" for the user. The interface constructs an appropriate command to execute KALDI scripts based on user input.
This approach offers the advantage of cleanly separated layers of abstraction, where the interface would be concerned with the interface-scripting level and live decoding efforts with the script-KALDI level. It also allows operational functionality to be changed independently of the interface and reduces the required learning efforts in design and implementation. Furthermore, as the interface would be behaving the same way as a technical user would in operation, there is a stronger parallel between the system and the use case.
Usability Tests
Methodology
Usability of the interface was evaluated through the use of feature inspection, cognitive walkthrough and heuristic analysis.
- Feature inspection is performed by listing the sequence of features of a system used to perform typical tasks. Long, cumbersome or unnatural steps are checked for. [7]
- In a cognitive walkthrough, the problem solving process of a user is simulated and checked to see if it can be assumed to lead to the next correct action at each step. [2][7]
- Heuristic analysis involves evaluating the usability of a system using a set of guidelines. [2][7][8]
Feature Inspection: Features
The features used to accomplish typical tasks were identified as:
- Line edit elements paired with browse buttons to add KALDI and decoding directories
- Radio buttons to select decoding method
- Combo boxes paired with buttons to add language/acoustic models
- The decoding button and microphone on/off controls
Cognitive Walkthrough: Simulated Use Case
The steps presented during the walkthrough, as a simulation of the use case for decoding from a directory, were:
- The user specifies a directory for the KALDI installation by using the browse function.
- The user selects a decoding method (decoding from files).
- The user switches between decoding methods.
- The user adds a language model using the browse function.
- The user selects an acoustic model that has been added previously.
- The user engages the decoding process by pressing the decode button.
- The user edits the transcript presented by the interface after decoding has completed.
Heuristic Analysis: Guidelines
The guidelines used for heuristic analysis were defined based on the outlined usability principles:
- Provide relevant visual feedback, where possible.
- Interface controls should respond to user interaction in an expected manner.
- The user should be aware of the actions of the system.
- Present information to the user where applicable to enable learning of interface usage.
- The functionality of interface elements should be unambiguous.
- The user should be prevented from performing actions that are not relevant to the system.
References
[1] Kaldi, "About the Kaldi project." [Online]. Available: http://kaldi-asr.org/doc/about.html [Accessed: 13 March 2017]
[2] L. R. Rabiner, B. H. Juang, B. Keith, "Speech recognition: Statistical methods", Encyclopedia of Language & Linguistics, Elsevier, pp. 1-18, 2006.
[3] A. Holzinger, “Usability engineering methods for software developers,” in Communications of the ACM 48, no. 1 (2005): pp. 71-74.
[4] J. Nielsen, “The usability engineering life cycle,” IEEE Computer 25(3), pp.12-22, 1992.
[5] J. Nielsen, “Iterative user-interface design,” IEEE Computer 26(11), pp. 32-41, 1993.
[6] J. Nielsen, "10 usability heuristics for user interface design," Nielsen Norman Group 1, no. 1 (1995).
[7] J. Nielsen, “Usability inspection methods,” in Conference companion on Human factors in computing systems: pp. 413-414. ACM, 1994.
[8] J. Nielsen, “Finding usability problems through heuristic evaluation,” in Proceedings of the SIGCHI conference on Human factors in computing systems: pp. 373-380. ACM, 1992