Projects:2015s1-05 Multi-Profile Parallel Transcriber

From Projects
Revision as of 16:03, 24 August 2015 by A1618291 (talk | contribs) (Resources: Might need some more adding)
Jump to: navigation, search

Team

Students

  • Dakshitha Narendra Kirana Kankanamage
  • Siyuan Ma

Supervisors

  • Dr. Said Al-Sarawi
  • Dr. Ahmad Hashemi-Sakhtsari

Introduction

The field of Automatic Speech Recognition (ASR) has been a major field of interest for researchers over the past few decades. Although, it was strongly believed to play a major role in Human-Machine as well as Human-Human interactions [1] [2] [3], the real world users have been unable to utilize this due to the slow pace of advancements in the field. Today, with the rise of portable electronic devices and more powerful computers, Speech Recognition (SR) technology has experienced a great push forward, thereby providing the everyday user a better Human-Machine experience and researchers the power to overcome obstacles that could not have been done so before.

Motivation and Significance

Aims

The aims of this project are to:

  1. Set up a working copy of the Multi Profile Transcriber on the provided computer and review its functional integrity.
  2. Set up evaluation standards and evaluate the current state of the system.
  3. Upgrade the software to contain the latest version of the embedded speech engine.
  4. Re-run the evaluation phase on the upgraded system.
  5. Extend its functionality.
  6. Re-run the evaluation phase on the extended system.

Requirements

System Overview

The typical architecture of an ASR system consists of 4 main elements: Signal processing module, Acoustic Model, Language Model and a Hypothesis Search module [1] [3], as shown in Figure 1.

  • The signal processing component takes an input audio signal and enhances the speech information by reducing noise and converting the signal from time-domain to frequency-domain for easier processing.
  • The Acoustic Model, takes the frequency domain signal and generates an Acoustic Model score of length of the input vector. An Acoustic Model consists of information regarding acoustics, phonetics, environmental variability, gender and other aspects that would affect a person’s speech.
  • The Language Model estimates the probability of a word occurring in a sequence of spoken words, this is achieved with the help of a database of text (corpora).
  • The last stage of the process, the Hypothesis Search, combines both Acoustic and Language model scores and gives the sequence of words with the highest score as the recognised result known as a confidence score [3] [1, 4]. As each speakers’ speech style is unique, modifying the default Acoustic Model or Language Model against the speaker’s speech style prior transcription improves accuracy and performance of the transcription. Modifying these models according to a user, known as creating a profile, can be attained with the speaker reading a known passage of text or having to manually transcribe a voice recording of the speaker for a few minutes.


With the aim of having to transcribe an incoming signal, the Multi-Profile Parallel Transcriber consists of two main components, namely: a Streaming Host and Streaming Guest program. The Streaming Host would reside on the operating system of the host computer, having control over multiple Streaming Guests via a shared directory, with a subdirectory for each Guest instance used. The Streaming Guest program needs to be integrated with an instance of DNS speech engine, which is done by installing DNS followed with the set-up of the Streaming Host program. Since DNS can only be instantiated once on a computer, each guest machine has to be run through a Virtual Machine (VM) in order to simultaneously transcribe multiple users [As shown in Figure 2]. DNS has special ActiveX functions that would allow it to communicate between the guest profile and the DNS transcriber with ease.

Project Approach

Knowledge Gaps and Technical Challenges

Project Status

Resources

  • Delphi XE3
  • Dragon NaturallySpeaking (version: 9, 10, 10.1, 11, 12, 13)
  • Dragon NaturallySpeaking SDK (version: 10, 11, 12)
  • VmWare (Virtual Machines)
  • Sclite by NIST (National Institute of Standards and Technology)

References

[1] L. D. Dong Yu, Automatic Speech Recognition: A Deep Learning Approach, 1 ed. London: Springer London, 2015.

[2] P. M. Kitzing, A ; Ahlander, Vl, "Automatic speech recognition (ASR) and its use as a tool for assessment or therapy of voice, speech, and language disorders," Logopedics Phoniatrics Vocology, vol. 34, pp. 91-96, 2009.

[3] X. H. a. L. Deng, "An Overview of Modern Speech Recognition," in Handbook of Natural Language Processing, Second Edition, N. I. a. F. J. Damerau, Ed., ed Boca Raton, FL: Chapman & Hall/CRC, 2010.