Difference between revisions of "Projects:2015s1-05 Multi-Profile Parallel Transcriber"

From Projects
Jump to: navigation, search
(Knowledge Gaps and Technical Challenges)
(Knowledge Gaps and Technical Challenges)
Line 56: Line 56:
 
#As the main workload of this project is evaluating the performance of Multi-Profile Parallel Transcriber, the formal evaluation tools and standards need to be used and learned for some stage of the project. Before and after the upgrades have been implemented, the system should be tested and validated with Speech Recognition scoring tools such as SCLite. This will require an overall knowledge of how these software needs to be used and important standards that needs to be met [7-9].
 
#As the main workload of this project is evaluating the performance of Multi-Profile Parallel Transcriber, the formal evaluation tools and standards need to be used and learned for some stage of the project. Before and after the upgrades have been implemented, the system should be tested and validated with Speech Recognition scoring tools such as SCLite. This will require an overall knowledge of how these software needs to be used and important standards that needs to be met [7-9].
 
#Another aspect of this project is to make improvements to Multi-Profile Parallel Transcriber. The 1st improvement is to upgrade DNS version.10 to DNS version.13. As the source codes for DNS is programmed using Delphi, it is a more straightforward approach to continue using Delphi. Since knowledge about Delphi is not covered by the general learning content in the university, learning how to program basic Delphi would be the first knowledge gap needs to be fulfill.
 
#Another aspect of this project is to make improvements to Multi-Profile Parallel Transcriber. The 1st improvement is to upgrade DNS version.10 to DNS version.13. As the source codes for DNS is programmed using Delphi, it is a more straightforward approach to continue using Delphi. Since knowledge about Delphi is not covered by the general learning content in the university, learning how to program basic Delphi would be the first knowledge gap needs to be fulfill.
The use of DNS Software Development Kit (SDK) would also be required to make the upgrade from DNS 10 to 13, therefore having prior knowledge of how to use the SDK will also be useful.
+
#The use of DNS Software Development Kit (SDK) would also be required to make the upgrade from DNS 10 to 13, therefore having prior knowledge of how to use the SDK will also be useful.
 
#If time permits, further upgrades of the system shall be done, this would include having the system work under overlapping speech and also an alternative to the VMs implementations.
 
#If time permits, further upgrades of the system shall be done, this would include having the system work under overlapping speech and also an alternative to the VMs implementations.
  

Revision as of 00:13, 28 August 2015

Team

Students

  • Dakshitha Narendra Kirana Kankanamage
  • Siyuan Ma

Supervisors

  • Dr. Said Al-Sarawi
  • Dr. Ahmad Hashemi-Sakhtsari

Introduction

The field of Automatic Speech Recognition (ASR) has been a major field of interest for researchers over the past few decades. Although, it was strongly believed to play a major role in Human-Machine as well as Human-Human interactions [1] [2] [3], the real world users have been unable to utilize this due to the slow pace of advancements in the field. Today, with the rise of portable electronic devices and more powerful computers, Speech Recognition (SR) technology has experienced a great push forward, thereby providing the everyday user a better Human-Machine experience and researchers the power to overcome obstacles that could not have been done so before.

Motivation and Significance

The Multi-Profile Parallel Transcriber prototype is one such example of an ASR, which was custom made as a Speech-To-Text (STT) transcriber in the hopes of demonstrating speaker-independent transcription, created by the use of a speaker-dependent speech recognizer, Dragon NaturallySpeaking (DNS). While its main aim is to transcribe the speech of unknown speakers, it also retains vital information regarding the transcription such as time stamps and the conversation of each speaker in a separate audio file, while providing the flexibility to edit errors in the transcripts. Due to its ability to identify and retain vital information from a conversation, it was found to play an important part in various settings that conventionally requires a human transcriber, including but not limited to situations such as meetings, brain storming sessions and interviews.

Aims

This is a DSTO sponsored project and it aims to evaluate, upgrade and extend the functionality of the speech-to-text software system known as Multi-Profile Parallel Transcriber. The stages of this project can be summarized as follow: 

  1. In the first stage, this software will be ran through an evaluation phase where the software will be tested against various benchmark databases.
  2. Then the core software called Dragon Naturally Speaking will be upgraded to the latest version.
  3. Following the upgrade, the same evaluation phase as the first stage will be implemented again.
  4. At the end of the evaluation phase, the system will then have its functionality extended. This includes increasing the efficiency of the software and implementing a different approach as to how the software works.

Requirements

The core requirements of the project are:

  1. To evaluate the performance of the current version of the Multi-Profile Parallel Transcriber prototype.
  2. To upgrade the current version of the Multi-Profile Parallel Transcriber prototype, which uses DNS 10 to DNS 13 with the help of the source code provided.
  3. The upgraded software shall be tested using formal methods of evaluation with the help of tools such as SCLite. If there is a need to improve the results of the system, this shall be done.
  4. Finally, if time permits, a solution to overlapping speech of speakers and being able to find an alternative to having VMs to be implemented.

System Overview

The typical architecture of an ASR system consists of 4 main elements: Signal processing module, Acoustic Model, Language Model and a Hypothesis Search module [1] [3], as shown in Figure 1.

  • The signal processing component takes an input audio signal and enhances the speech information by reducing noise and converting the signal from time-domain to frequency-domain for easier processing.
  • The Acoustic Model, takes the frequency domain signal and generates an Acoustic Model score of length of the input vector. An Acoustic Model consists of information regarding acoustics, phonetics, environmental variability, gender and other aspects that would affect a person’s speech.
  • The Language Model estimates the probability of a word occurring in a sequence of spoken words, this is achieved with the help of a database of text (corpora).
  • The last stage of the process, the Hypothesis Search, combines both Acoustic and Language model scores and gives the sequence of words with the highest score as the recognised result known as a confidence score [3] [1, 4]. As each speakers’ speech style is unique, modifying the default Acoustic Model or Language Model against the speaker’s speech style prior transcription improves accuracy and performance of the transcription. Modifying these models according to a user, known as creating a profile, can be attained with the speaker reading a known passage of text or having to manually transcribe a voice recording of the speaker for a few minutes.

With the aim of having to transcribe an incoming signal, the Multi-Profile Parallel Transcriber consists of two main components, namely: a Streaming Host and Streaming Guest program. The Streaming Host would reside on the operating system of the host computer, having control over multiple Streaming Guests via a shared directory, with a subdirectory for each Guest instance used. The Streaming Guest program needs to be integrated with an instance of DNS speech engine, which is done by installing DNS followed with the set-up of the Streaming Host program. Since DNS can only be instantiated once on a computer, each guest machine has to be run through a Virtual Machine (VM) in order to simultaneously transcribe multiple users [As shown in Figure 2]. DNS has special ActiveX functions that would allow it to communicate between the guest profile and the DNS transcriber with ease.

The prototype gets the structure of an ASR by having the Streaming Host receive a digital audio input signal, the signal is then distributed to all Streaming Guest profiles. These Guest instances transcribe the signal against a speech profile, allocated by the Streaming Host, and return the transcribed information back to the Streaming Host with confidence scores for each utterance. The Host will then assemble the best confidence score as the best match for the speaker. DNS is speaker-dependent speech transcriber, therefore the need to have a speech profile assigned for each guest instance plays a vital part for the project.

Apart from being the mediator, the Host machine also has the functionality of changing the settings of each guest profile, these include: changing the speaker profiles, being able to correct errors in the transcription and change parameters of the guest VMs such as the interval between utterances.

This document will not go into further details relating to the Acoustic Models, Language Models and Hypothesis searching since these will be covered by the DNS speech recognition engine.

Project Approach

Knowledge Gaps and Technical Challenges

Related to the tasks of the project, some knowledge gaps are described as below:

  1. As the main workload of this project is evaluating the performance of Multi-Profile Parallel Transcriber, the formal evaluation tools and standards need to be used and learned for some stage of the project. Before and after the upgrades have been implemented, the system should be tested and validated with Speech Recognition scoring tools such as SCLite. This will require an overall knowledge of how these software needs to be used and important standards that needs to be met [7-9].
  2. Another aspect of this project is to make improvements to Multi-Profile Parallel Transcriber. The 1st improvement is to upgrade DNS version.10 to DNS version.13. As the source codes for DNS is programmed using Delphi, it is a more straightforward approach to continue using Delphi. Since knowledge about Delphi is not covered by the general learning content in the university, learning how to program basic Delphi would be the first knowledge gap needs to be fulfill.
  3. The use of DNS Software Development Kit (SDK) would also be required to make the upgrade from DNS 10 to 13, therefore having prior knowledge of how to use the SDK will also be useful.
  4. If time permits, further upgrades of the system shall be done, this would include having the system work under overlapping speech and also an alternative to the VMs implementations.

The possible technical challenges are closely connected to knowledge gaps listed above, so overcoming the knowledge gaps would be considered the main technical challenges the group will have to face.

Project Status

Resources

  • Delphi XE3
  • Dragon NaturallySpeaking (version: 9, 10, 10.1, 11, 12, 13)
  • Dragon NaturallySpeaking SDK (version: 10, 11, 12)
  • VmWare (Virtual Machines)
  • Sclite by NIST (National Institute of Standards and Technology)

References

[1] L. D. Dong Yu, Automatic Speech Recognition: A Deep Learning Approach, 1 ed. London: Springer London, 2015.

[2] P. M. Kitzing, A ; Ahlander, Vl, "Automatic speech recognition (ASR) and its use as a tool for assessment or therapy of voice, speech, and language disorders," Logopedics Phoniatrics Vocology, vol. 34, pp. 91-96, 2009.

[3] X. H. a. L. Deng, "An Overview of Modern Speech Recognition," in Handbook of Natural Language Processing, Second Edition, N. I. a. F. J. Damerau, Ed., ed Boca Raton, FL: Chapman & Hall/CRC, 2010.