Projects:2017s2-205 Multi-Profile Parallel Speech-to Text Transcriber

From Projects
Jump to: navigation, search

Summary

The aim of this project is to produce a speech transcriber prototype using Dragon Naturally Speaking (DNS) that can transcribe live recording through a single microphone and recognize multiple voices. The proposed prototype recognizes the speakers by comparing the confidence scores generated by DNS for each utterance. The confidence score is used as a measure of transcription accuracy. The main deliverables of this project are to successfully perform transcription for multiple speakers and evaluate the transcription accuracy. Users are required to create and train their profiles by dictating and making corrections to enable DNS to analyze acoustic data such as accent, speech pattern and other variables. The results from several experiments have proven that sufficient profile training is a necessity to achieve high transcription accuracy. The future progress of this project would be to continue conducting more experiments that consider different types of acoustic variability to validate the reliability of the prototype.

Aims

The aim is to produce a custom-made speech-to-text transcriber that can transcribe live recording through a single microphone and recognize multiple voices.

The final aim is to produce a robust and reliable speech transcriber prototype that produces accurate transcription result.

Motivation

The prototype Multi-Profile Parallel Transcriber was designed to transcribe live recording through a single microphone and identify the voices of multiple speakers. Since DNS only allows one profile assigned at a time and one operating system (OS) can only have one DNS, the motivation of this project is to implement a system that can assign a profile to each DNS to transcribe and identify the voices of multiple speakers.

System Structure

System Structure.PNG

The prototype system consists of two separate programs, the StreamingHost and StreamingGuest programs.

StreamingHost and shared folder resides in the host OS, while StreamingGuest and DNS reside in the VM OS. The purpose of using multiple virtual machines is to execute multiple StreamingGuest simultaneously, so each VM needs to have a DNS installed since DNS only allows one user profile to be assigned at a time and one OS can only have one DNS. Thus, each StreamingGuest is assigned with one profile.

StreamingHost receives and sends audio to the shared folder, StreamingGuest then receives and sends audio to DNS. DNS transcribes the audio and returns transcription result to StreamingGuest. StreamingGuest sends the result back to the shared folder and StreamingHost displays the result. Audio is split into utterances, so the transcription process is repeated for each utterance.

Shared Folder.PNG

The shared folder consists of 3 folders (Data Exchange, Profiles, Output) and one XML file, as shown in Figure 5. The Data Exchange folder stores the input audio utterances.

Profile data is exported from DNS and stored in the Profiles folder.

The Output folder stores the transcription result. The Output subfolders are named after the machine name of VM, so each VM stores transcription result in their respective folder.

The XML file contains data of each VM such as machine name, output directory, assigned user profile and profile ID.

Profile Training

The user must choose the best suit of accent region when creating a DNS profile to increase the accuracy of the transcription.

In previous DNS versions, the user could read sample text for several minutes to train user profile, but in version 15, this option is no longer available. Nevertheless, in version 15, the user can improve accuracy by dictating for several minutes, making corrections, and then be running Accuracy Tuning. Accuracy Tuning updates user profile based on acoustic data and language model.

Deliverables

Successfully perform transcription for multiple speakers and conduct experiments to evaluate transcription accuracy.

Software

Dragon Professional Group v15: Create user profiles and transcribe speeches

Embarcadero Delphi XE3: Compile source code

VMware Workstation 11.0: Create multiple virtual machines

NSIS: Nullsoft Scriptable Install System: Setup installer for StreamingHost and StreamingGuest

SCLITE: Evaluate the accuracy of the transcription

WavePad Audio Editor: Edit audio file

Achievements

The first achievement is the functional system that can assign a profile to each DNS in each VM. Each VM only needs the StreamingGuest program, DNS and access to the data exchange path to perform transcription.

The second achievement is the production of an accurate speech transcriber. The results from several experiments have proven that DNS can achieve high accuracy transcription and identify multiple speakers if the profiles have sufficient training. Training profiles help DNS to become accustomed to the speaker’s accent, speech pattern and other variabilities.

Conclusion

The main finding is that user profile requires sufficient training to achieve accurate transcription. DNS becomes accustomed to the user’s accent, speech pattern and other variabilities by analyzing acoustic data and language model to improve the accuracy of the transcription.

The objective is to test if the prototype can differentiate multiple speakers in terms of gender, accent, speech pattern and other variabilities. So, the future progress of this project would be to continue conducting more experiments that consider different types of acoustic variability to validate the reliability of the prototype.

Exploring the fundamentals of AM and LM provides the knowledge on the mechanism of the speech recognizer, and therefore realizing why DNS requires sufficient profile training to achieve high accuracy transcription.

Project Team

Er Win Ng

Boqi Hu

Supervisors

Dr. Said Al-Sarawi

Dr. Ahmad Hashemi-Sakhtsari

Sponsor

Defence Science and Technology Organization

DSTO.png