Projects:2017s1-103 Improving Usability and User Interaction with KALDI Open- Source Speech Recogniser

From Projects
Revision as of 23:27, 29 October 2017 by A1669391 (talk | contribs)
Jump to: navigation, search

Project Team

Students

  • George Mao (usability design, implementation and evaluation)
  • Vinil Chukkapally (live decoding investigation and integration)

Supervisors

  • Dr. Said Al-Sarawi
  • Dr. Ahmad Hashemi-Sakhtsari (DST Group)

Project Context

The user interface produced by the 2015 project. It is not designed for usability and has limited live functionality.

This project follows from a 2015 project which examined the performance of the KALDI speech recognition toolkit, with the support of DST Group.

What is KALDI?

KALDI is a free and open-source software toolkit for automatic speech recognition. It is designed for speech recognition researchers [1], and so requires speech recognition knowledge and familiarity with scripting to operate. As such, it is difficult for those without such knowledge or familiarity to use.

Previous Work

While the 2015 project did produce a user interface, it was primarily proof-of-concept. As such, it was not created with usability in mind and so is not particularly user friendly. The "on-line" (live) decoding function featured by the interface is also limited in functionality, running for a fixed duration of approximately 60 seconds and giving no choice of models.

Aim of the Project

The aim of this project is to enable users to access functionalities of KALDI without the knowledge of scripting, a language like Bash, or detailed knowledge of the internal algorithms of KALDI by the design and development of a graphical user interface (GUI).

Furthermore, attempts will be made to transcribe live audio speech continuously, with functionality operable from the interface.

Background

Acoustic and Language Models

It can be said that spoken speech is language that is "coded" into sounds, and thus the transcription of verbal speech can be considered as "decoding" the audio. In speech recognition, such "code" is characterised by acoustic and language models. Acoustic models describe the words or sounds of a language by their acoustic features (properties). Language models define the structure of a language, using statistics of the probability of a sequence of words forming a valid sentence. By using appropriate acoustic and language models, a speech recogniser can, in theory, decode for any combination of language and speaker.

Usability

Usability is commonly described as how easy a system is to use by its target demographic in its intended operating environment [3]. In general, usability engineering models describe measures of usability in terms of [3][4][5]:

  • Learnability (ease of learning).
  • Efficiency of use after learning.
  • Ability for infrequent use without needing to relearn.
  • The frequency and severity of user errors.
  • Subjective user satisfaction.

It then follows that a system with high usability is one that is easy to learn, efficiently used after learning, intuitive, prevents user errors and subjectively satisfying to the user.

Principles for developing usable systems are well-established, and the project design is informed by the following usability principles [6]:

  • The interface should give the user visual feedback on the system state in reasonable time.
  • The interface should provide a parallel between the system and real world by presenting information in a natural and logical order.
  • Language within the interface should be consistent, and platform conventions followed.
  • The interface should minimise the cognitive burden on the user by encouraging recognition rather than recollection.
  • The interface should not present irrelevant information.
  • The interface should assist the user in error recognition and recovery by using familiar and constructive language, and by indicating problems precisely.

Live Decoding

The approach towards live decoding is to use an audio input buffer to capture incoming data from the active audio recording device. Using voice activity detection, periods of silence are checked for and a long enough silence is treated as the end of a sentence. Audio is sent per-sentence to KALDI for decoding. Live decoding diagram.png

System Approach

The KALDI user interface system.

The approach used towards the development of the user interface is to have the interface act as a configuration "front-end" for the user. The interface constructs an appropriate command to execute KALDI scripts based on user input.

This approach offers the advantage of cleanly separated layers of abstraction, where the interface would be concerned with the interface-scripting level and live decoding efforts with the script-KALDI level. It also allows operational functionality to be changed independently of the interface and reduces the required learning efforts in design and implementation. Furthermore, as the interface would be behaving the same way as a technical user would in operation, there is a stronger parallel between the system and the use case.

Usability Tests

State of the Interface

The interface during usability appeared as shown below. It was capable of decoding audio files from a directory containing pre-recorded audio. Live decoding functionality was not implemented into software at that stage and visible elements were placeholders. Interface prototype.png

Methodology

Usability of the interface was evaluated through the use of feature inspection, cognitive walkthrough and heuristic analysis.

  • Feature inspection is performed by listing the sequence of features of a system used to perform typical tasks. Long, cumbersome or unnatural steps are checked for. [7]
  • In a cognitive walkthrough, the problem solving process of a user is simulated and checked to see if it can be assumed to lead to the next correct action at each step. [2][7]
  • Heuristic analysis involves evaluating the usability of a system using a set of guidelines. [2][7][8]

Feature Inspection: Features

The features used to accomplish typical tasks were identified as:

  • Line edit elements paired with browse buttons to add KALDI and decoding directories
  • Radio buttons to select decoding method
  • Combo boxes paired with buttons to add language/acoustic models
  • The decoding button and microphone on/off controls

Cognitive Walkthrough: Simulated Use Case

The cognitive walkthrough was performed as a run-through of a typical use case with the client. The steps presented during the walkthrough, as a simulation of the use case for decoding from a directory, were:

  1. The user specifies a directory for the KALDI installation by using the browse function.
  2. The user selects a decoding method (decoding from files).
    1. The user switches between decoding methods.
  3. The user adds a language model using the browse function.
  4. The user selects an acoustic model that has been added previously.
  5. The user engages the decoding process by pressing the decode button.
  6. The user edits the transcript presented by the interface after decoding has completed.

Heuristic Analysis: Guidelines

The guidelines used for heuristic analysis were defined based on the outlined usability principles:

  1. Provide relevant visual feedback, where possible.
    1. Interface controls should respond to user interaction in an expected manner.
    2. The user should be aware of the actions of the system.
  2. Present information to the user where applicable to enable learning of interface usage.
    1. The functionality of interface elements should be unambiguous.
  3. The user should be prevented from performing actions that are not relevant to the system.

Summary of Results

The results of usability testing are summarised in the table below showing the identified usability issues, the method that identified it, a description of the issue and the expected behaviour. The expected behaviours were each implemented into the final interface, in at least one of the suggested forms.

Issue Testing Method Problem Description Expected Behaviour
Repetitive usage in common interface elements Feature Inspection Items added using browse buttons are not retained when the interface is closed, requiring the user to add them to the interface manually each time. This makes usage cumbersome to the user, especially when certain values are not expected to change between sessions. The interface should retain certain values (KALDI directory, LM/AM) so that the user does not have to repeat the process every time the interface is started.
Potential confusion when changing decoding methods Cognitive Walkthrough The interface hides irrelevant elements when the user changes between decoding methods. Due to the arrangement of visual elements, common elements visibly move in the hiding process, possibly misleading the user to believe that they have also been hidden. Dynamic interface elements should be moved, such that there is less visible change when the user changes the decoding method.
Potential ambiguity in microphone mute/unmute Cognitive Walkthrough The button for microphone mute/unmute function changes icons to indicate the current status. These icons are black-and-white, which may cause difficulty to the user on discerning the state of the microphone. Use colour indication for the icons (e.g. green for unmuted, red for muted), or use an interface element that is less ambiguous.
Lack of signifiers for adding language and acoustic models Cognitive Walkthrough, Heuristic Analysis The buttons used to add language and acoustic models to the interface do not give strong enough indication of their purpose. Add tooltips to elements of the interface, so that the interface provides more information for the user.
Lack of visibility for user edits in the transcript Cognitive Walkthrough When editing the transcript, user input is inserted in the same format as the produced text. The user may find it difficult to see parts that were edited. Make user edited text a different colour, such as green, to increase the visibility of edits.
Lack of access to decoded files Cognitive Walkthrough The user may want to play back audio files that have been transcribed to verify the accuracy of the transcription and make corrections. There is no method within the interface to enable convenient access to the files. Make file names in the transcript clickable links, such that they may open the file of choice from within the interface, or provide an interface element to open the directory of files so that the user may locate and play back the file in the native file browser.
Visual feedback when decoding from file depends on the user manually switching tabs Heuristic Analysis When the user initiates decoding from file, script output is sent to the console tab. The interface does not change tabs, so the user may not be able to see the response. The interface should switch to the console tab when decoding is initiated. The user is able to see the output of the decoding script. Once decoding completes, the interface should switch back to the transcript.
Interface options can be changed while decoding is in progress Heuristic Analysis While the software is decoding, the user is able to interact with interface options, such as decoding mode and decoding directory. This may cause confusion during use, as interface elements change based on decoding type. The user should not be able to modify the interface options when decoding is in progress. Visual options should be kept consistent with the current decoding job.

References

[1] Kaldi, "About the Kaldi project." [Online]. Available: http://kaldi-asr.org/doc/about.html [Accessed: 13 March 2017]

[2] L. R. Rabiner, B. H. Juang, B. Keith, "Speech recognition: Statistical methods", Encyclopedia of Language & Linguistics, Elsevier, pp. 1-18, 2006.

[3] A. Holzinger, “Usability engineering methods for software developers,” in Communications of the ACM 48, no. 1 (2005): pp. 71-74.

[4] J. Nielsen, “The usability engineering life cycle,” IEEE Computer 25(3), pp.12-22, 1992.

[5] J. Nielsen, “Iterative user-interface design,” IEEE Computer 26(11), pp. 32-41, 1993.

[6] J. Nielsen, "10 usability heuristics for user interface design," Nielsen Norman Group 1, no. 1 (1995).

[7] J. Nielsen, “Usability inspection methods,” in Conference companion on Human factors in computing systems: pp. 413-414. ACM, 1994.

[8] J. Nielsen, “Finding usability problems through heuristic evaluation,” in Proceedings of the SIGCHI conference on Human factors in computing systems: pp. 373-380. ACM, 1992