Projects:2015s1-05 Multi-Profile Parallel Transcriber
Contents
Team
Students
- Dakshitha Narendra Kirana Kankanamage
- Siyuan Ma
Supervisors
- Dr. Said Al-Sarawi
- Dr. Ahmad Hashemi-Sakhtsari (DSTO)
Introduction
The field of Automatic Speech Recognition (ASR) has been a major field of interest for researchers over the past few decades. Although, it was strongly believed to play a major role in Human-Machine as well as Human-Human interactions [1] [2] [3], the real world users have been unable to utilize this due to the slow pace of advancements in the field. Today, with the rise of portable electronic devices and more powerful computers, Speech Recognition (SR) technology has experienced a great push forward, thereby providing the everyday user a better Human-Machine experience and researchers the power to overcome obstacles that could not have been done so before.
Motivation and Significance
The Multi-Profile Parallel Transcriber prototype is one such example of an speaker independent ASR. Custom made as a Speech-To-Text (STT) transcriber in the hopes of demonstrating speaker-independent transcription, created by the use of a speaker-dependent speech recognizer, Dragon NaturallySpeaking (DNS). While its main aim is to transcribe the speech of unknown speakers, it also retains vital information regarding the transcription such as time stamps and the conversation of each speaker in a separate audio file, while providing the flexibility to edit errors in the transcripts. Due to its ability to identify and retain vital information from a conversation, it was found to play an important part in various settings that conventionally requires a human transcriber, including but not limited to situations such as meetings, brain storming sessions and interviews.
Ability to do a long-time transcribing with a satisfying accuracy is one significant of advantages of the Multi-Profile Parallel Transcriber. Another advantage of the Multi-Profile Parallel Transcriber is that unlike Siri or some other applications, it is able to achieve off-line transcription which will assure the security of the information.
Aims
This is a DSTO sponsored project and it aims to evaluate, upgrade and extend the functionality of the speech-to-text software system known as Multi-Profile Parallel Transcriber. The stages of this project can be summarized as follow:
- In the first stage, the Multi-Profile Parallel Transcriber will be set-up following the instructions in the provided documents.
- This software then will be ran through an evaluation phase where the Multi-Profile Parallel Transcriber will be tested against various benchmark databases.
- Then the core software called Dragon Naturally Speaking will be upgraded to the latest version.
- Following the upgrade, the same evaluation phase as the first stage will be implemented again.
- At the end of the evaluation phase, the system will then have its functionality extended. This includes increasing the efficiency of the software and implementing a different approach as to how the software works.
Requirements
The core requirements to solve the proposed questions of the project are decided as follow:
- The first task is to set up a working copy and evaluate the performance of the current version of the Multi-Profile Parallel Transcriber prototype.
- Then we need to upgrade the current version of the Multi-Profile Parallel Transcriber prototype, which uses DNS version.10, to DNS version.13, which is the latest version of DNS, with the help of the source codes provided and the API information from the DNS SDKs provided.
- The upgraded software shall be tested using formal methods of evaluation with the help of tools named ScLite. If there is a need to improve the results of the system, this shall be done.
- Finally, if time permits, a solution to overlapping speech of speakers and being able to find an alternative to having VMs to be implemented.
System Overview
The typical architecture of an ASR system consists of 4 main elements: Signal processing module, Acoustic Model, Language Model and a Hypothesis Search module [1] [3], as shown in the right.
- The signal processing component takes an input audio signal and enhances the speech information by reducing noise and converting the signal from time-domain to frequency-domain for easier processing.
- The Acoustic Model, takes the frequency domain signal and generates an Acoustic Model score of length of the input vector. An Acoustic Model consists of information regarding acoustics, phonetics, environmental variability, gender and other aspects that would affect a person’s speech.
- The Language Model estimates the probability of a word occurring in a sequence of spoken words, this is achieved with the help of a database of text (corpora).
- The last stage of the process, the Hypothesis Search, combines both Acoustic and Language model scores and gives the sequence of words with the highest score as the recognised result known as a confidence score [3] [1, 4].
As each speaker’s speech style is unique, the default Acoustic Model or Language Model is modified against the speaker’s speech style prior transcription to improve accuracy and performance. Modifying these models according to a user is known as training. Through training, a profile for the user will be created and this can be attained by having the speaker read a known passage of text or having to manually transcribe a voice recording of the speaker with a valid transcript for a few minutes.
For this project, with the aim of transcribing an incoming signal, the Multi-Profile Parallel Transcriber consists of two main components, namely a Streaming Host program and several Streaming Guest programs. The Streaming Host would reside on the operating system of the host computer, having control over multiple Streaming Guests as well as having access to a shared directory with a sub-directory for all Guest instances, which will store all vital information of the transcriptions. All Streaming Guests need to be integrated with an instance of DNS speech recognition engine but since DNS can only be instantiated once on a computer, each guest machine has to be run through a Virtual Machine (VM) in order to simultaneously run multiple guests, as shown in the following Figure. DNS is speaker-dependent speech transcriber, therefore the need to have a speech profile assigned for each guest instance plays a vital part for the project.
The prototype gets the structure of an ASR by having the Streaming Host receive a digital audio input signal, the signal is then distributed to all Streaming Guest profiles. These Guest instances transcribe the signal against a speech profile, allocated by the Streaming Host, and return the transcribed information back to the Streaming Host with confidence scores for each utterance. The Host will then assemble the best confidence score as the best match for the speaker. DNS is speaker-dependent speech transcriber, therefore the need to have a speech profile assigned for each guest instance plays a vital part for the project.
Apart from being the mediator, the Host machine also has the functionality of changing the settings of each guest profile, these include: changing the speaker profiles, being able to correct errors in the transcription and change parameters of the guest VMs such as the interval between utterances.
Knowledge Gaps and Technical Challenges
Related to the tasks of the project, some knowledge gaps are described as below:
- As the main workload of this project is evaluating the performance of Multi-Profile Parallel Transcriber, the formal evaluation tools and standards need to be used and learned for some stage of the project. Before and after the upgrades have been implemented, the system should be tested and validated with Speech Recognition scoring tools such as SCLite. This will require an overall knowledge of how these software needs to be used and important standards that needs to be met [7-9].
- Another aspect of this project is to make improvements to Multi-Profile Parallel Transcriber. The 1st improvement is to upgrade DNS version.10 to DNS version.13. As the source codes for DNS is programmed using Delphi, it is a more straightforward approach to continue using Delphi. Since knowledge about Delphi is not covered by the general learning content in the university, learning how to program basic Delphi would be the first knowledge gap needs to be fulfill.
- The use of DNS Software Development Kit (SDK) would also be required to make the upgrade from DNS 10 to 13, therefore having prior knowledge of how to use the SDK will also be useful.
- If time permits, further upgrades of the system shall be done, this would include having the system work under overlapping speech and also an alternative to the VMs implementations.
The possible technical challenges are closely connected to knowledge gaps listed above, so overcoming the knowledge gaps would be considered the main technical challenges the group will have to face.
Outcomes
New Approach to Setup the System
2 attempts have been made by my partner and proved that these two approaches does not work. However, the results have given us sufficient information to come up with a new approach.
This attempt is that inside the Host machine, we install the Streaming Host program and the Streaming Guest program. Namely, we cheated the Host machine as the ‘Virtual Machine’ and he found that in this case, the communication works perfectly. Then he created a real virtual machine and the test result is that the communication between the VM and the Host machine cannot be achieved.
The other attempt is that inside the VMwave Workstation, he choosed a virtual machine as the ‘Host Machine’ and installed the Streaming Host program and the Streaming Guest program. He then did all the set up in each virtual machine and he found that in this case, the communication works perfectly and this system can be set up successfully without the host machine. He also found that if the owner of the shared folder is a VM, the communication between the VM and the Host machine can be achieved.
To summarize the results from these 2 attempts above, the communication between a VM and the Host machine can be achieved when the owner of the shared folder is the VM and amongst several VMs the communication can be achieved as well. So my idea is that how about we use a VM as the ‘bridge’ to link all the computers together instead of the Host machine.
The outcome is that using this new arrangement, we are finally able to rebuild the system involving the host machine. After some basic function tests, we get a conclusion that this approach works perfectly. The advantage of this approach is that it can avoid the influence from the VM tools, which may be changed as the upgrade is made. However, the weakness of this arrangement is that the connection created is not stable enough and sometimes this weakness will cause a runtime error. More details about this error will be covered in the following section.
Upgrade DNS from version.10 to version.13
As the core software Dragon NaturallySpeaking we are currently using is version.10.1 and the Multi-Profile Transcriber was actually developed years ago, the next essential task is to upgrade it to the latest version of DNS. To upgrade DNS, 2 main tasks need to be done as the preparation. So the outcomes of these tasks will be illustrated as follow:
- Reading the source codes, we understand that the software Streaming Guest is actually calling the API functions to start DNS and use the functionalities of DNS. So to upgrade DNS, one thing have to be made sure is that as DNS is upgraded from version.10.1 to version.13, all the functions should not be changed and if anything have been changed, we will need to be able to make same changes in the source codes and recompile the code to get a new working copy. To find the information related to the API functions, we will need to use the DNS SDKs. After we compare the function information from SDK 10 and SDK 11, we can have a conclusion that all the functions in terms of the descriptions and the function name are matched. So we can get a result that DNS version.10.1 can be safely upgrade from version.10.1 to version.11.
- The next step is to check whether any changes have been made as DNS version.11 is upgraded to DNS version.13. To do this, I used the same approach as usual and found the information about the API functions from DNS SDK version.11 and DNS SDK version.12.5 (the latest version of DNS SDK). The information have been summarized in the following table. After I compared the function information from DNS SDK 11 and DNS SDK 12.5, I can get a conclusion that all the functions in terms of the descriptions and the function name are matched again. So we can get a result that DNS version.11 can be safely upgrade from version.11 to version.13.
- Apart from the API function check, the upgrade of the provided profiles is another significant part of the preparation to the upgrade of DNS. The approach is to use the DNS SDK and there is a specific tool can be used to upgrade the profiles to the new versions. We have successfully upgraded all the provided profiles to version.11 and version.12.5.
Test and Evaluation
After we set up a working copy with an upgraded DNS, we did some tests in terms of the basic functionalities. Generally, the performance is satisfying. The outcomes and the errors we observed can be listed as follow:
- Some runtime errors occur when transcribing and we have traced and found out what caused those errors.
- Another issue of the performance is that for half of the upgrades profiles, after we transcribe some audio files or personal voice, the confidence score return is always 500 and for the all the words in the utterances, the confidence score return is still 500. We also tried to create my profile using DNS version.11.5 and do some tests but it cannot solve this issue. This result will cause a problem. Since the whole idea of the Multi-Profile Transcriber is to transcribe the same utterances using various profiles so we are able to get higher or lower confidence scores. As a result, we will be able to get the best transcripts. So if all the confidence scores are 500, the final transcripts we get will not be reliable anymore. This problem will seriously affect the evaluation and scoring. But We have did all the preparation for scoring.
Justification
The decisions made by the group and the related consideration will be list as follow.
- The preliminary plan for this project is over the current plan, which includes some extra tasks such as voice separation and an approach to achieve the system without the VMs. However, the project has a relatively late start and some time is wasted to clarity the real needs from the client. Additionally, during the process, an extra task to come up with a new arrangement to achieve the system comes out unexpectedly. The time left has been treated as a significant consideration. So the decision to focus on upgrading the software rather than external requirements was made by both of the group members and supervisors to make sure that the basic needs for the project can be met.
- When working on the task of milestone 1, taking the fluency of the project into consideration, a decision was made by both of the group members to divide the workload. Once the work of my partner is done, my part of work can fluently continue on the base of his work.
- During the previous meetings, since it is pointed out that evaluation study is not a problem we should have thought about, the final decision was made by both of the group members and supervisors to simultaneously work on task 1 and prepare for task 2. The available time is also a vital factor.
- Another decisions is made by the group members and supervisors is that we will change our target to DNS version.12.5 instead of 13. Since this problem is caused by a bug inside DNS and even the Nuance have not solve it, keeping works on this will waste more of our time. The time left has been again treated as a significant consideration. The final outcome of this project is another worry and consideration.
- Similarly, another decisions is made by the group members and supervisors is that we will change our target to DNS version.11.5 instead of 12.5. The time left and the final outcome of this project has been again treated as the significant considerations.
Critical Evaluation
Considering the original proposed requirements of the project, the final outcomes have reached around 80% of it. Some adjustments have been made to assure that it is possible to achieve the basic requirements at the end of the project.
The solutions we planned to meet the requirements of the proposed questions are relatively efficient. However, some misleading and incomplete information and some unexpected issues have caused this inadequate result. Some of our time is wasted and this is the main reason causing the incomplete achievement.
Even the outcomes of this project are not sufficient when considering the integrity, all of our attempts we did are still valuable. Besides all the errors and problems we have met and solved, we have made sufficient preparation for the future and the other project groups who will continue works on this project will be able to spend more time on the actually content and have a better efficiency.
Conclusion
This document has presented an overall outline of the project and reported the final outcomes. Although the outcomes we have got are not perfectly meet the requirements, we have tried all we can to contribute to the project and improve Multi-Profile Transcriber.
Eventually, some suggestions will be summarized as follow:
- Since in the initial stage, one of our tasks is to learn Delphi and understand the source code, we find that it is so difficult for a starter without sufficient programming background to be familiar with Delphi. Additionally, Delphi has an obvious weakness, which is that it is not that commonly used by programmers and as a result, it raises the difficulty to extend the functionalities of the software. So in the future, a better choice is to choose 3 or more students with good programming background to re-write this program using Java or C++ instead. Since the algorithm of the program is similar, the workload will not be very huge.
- Before re-writing the program, the first task could be trying to figure out how to solve the problem of always giving confidence score 500. Because for the evaluation and scoring, the only task left is to get the transcripts. So once we get valid transcripts, the evaluation can be easily done.
- When re-writing the program, we need to take those runtime errors into consideration and see whether it is possible to avoid those errors when programming.
Resources
- Delphi XE3
- Dragon NaturallySpeaking (version: 9, 10, 10.1, 11, 11.5, 12, 12.5, 13)
- Dragon NaturallySpeaking SDK (version: 10, 11, 12.5)
- VMware Station (Virtual Machines)
- Sclite by NIST (National Institute of Standards and Technology)
References
- [1] L. D. Dong Yu, Automatic Speech Recognition: A Deep Learning Approach, 1 ed. London: Springer London, 2015.
- [2] P. M. Kitzing, A ; Ahlander, Vl, "Automatic speech recognition (ASR) and its use as a tool for assessment or therapy of voice, speech, and language disorders," Logopedics Phoniatrics Vocology, vol. 34, pp. 91-96, 2009.
- [3] X. H. a. L. Deng, "An Overview of Modern Speech Recognition," in Handbook of Natural Language Processing, Second Edition, N. I. a. F. J. Damerau, Ed., ed Boca Raton, FL: Chapman & Hall/CRC, 2010.
- [4] P. K. D. Nidhi Desai, Prof.Vijayendra Desai, "Feature Extraction and Classification Techniques for Speech Recognition: A Review," International Journal of Emerging Technology and Advanced Engineering, vol. 3, pp. 367-371, December 2013 2013.
- [5] A. Z. J. S. L. M. B. B. D. a. A. Hashemi-Sakhtsari, "Transcription of multiple speakers using speaker dependent speech recognition," Technical Report DSTO-TR-1498, 2003.
- [6] S. S. Graham, J, "An automatic transcriber of meetings utilising speech recognition technology," Client Report DSTO-CR-0355, March 2004 2004.
- [7] (2015). NIST Multimodal Information Group Website. Available: http://www.itl.nist.gov/iad/mig/tools/
- [8] (2015). NIST SCLITE Scoring Package Version 1.5. Available: http://www1.icsi.berkeley.edu/Speech/docs/sctk-1.2/sclite.htm#sclite_name_0
- [9] (2015). SCLITE Command Line Options. Available: http://www1.icsi.berkeley.edu/Speech/docs/sctk-1.2/options.htm