节点文献

基于混淆网络和辅助信息的语音识别技术研究

Research on Confusion Network and Side Information for Speech Recognition

【作者】 王欢良

【导师】 韩纪庆;

【作者基本信息】 哈尔滨工业大学 , 计算机应用技术, 2007, 博士

【摘要】 通过语音与机器进行自由交流是人们多年以来的梦想。经过几十年的不懈努力,语音识别技术已获得了巨大进步,但仍难以满足实际应用的需要。如何进一步提高语音识别性能及其稳健性(Robustness)成为当前语音识别技术发展的瓶颈。人类在语音辨识过程中潜在地利用了众多信息源,而当前基于计算机的语音识别系统通常只利用了非常有限的声学和语言学信息,如语音的谱特征和N-gram统计语言模型。对于语音识别这种复杂任务来说,这些主要信息是远远不够的。有效地建模和应用其它辅助信息将有助于提高语音识别性能。混淆网络是多候选识别结果的一种紧凑表示形式,基于混淆网络解码可以最小化词错误率。基于混淆网络来融合辅助信息进行解码是提高识别性能的一个有效途径。本论文主要从混淆网络和辅助信息两个方面研究了改善语音识别性能的方法。在混淆网络方面,主要研究了混淆网络的高效构造方法和融合辅助信息的解码方法。在辅助信息方面,主要研究了几种重要辅助信息的有效建模和应用方法。本论文的主要研究内容和创新点具体如下:1.提出了两种高质量混淆网络的快速构造方法。一种方法通过对Lattice结构进行分段来降低混淆网络构造方法的计算规模,提高了混淆网络的生成速度,而其质量只有轻微下降。另一种方法利用具有最大后验概率的转移弧来指导混淆集合的构造,使算法复杂度降为线性。为了提高了生成混淆网络的质量,提出了基于K-L散度的弧相似性测度方法。最后,针对汉语语音识别任务,给出两种新的混淆网络结构:汉字混淆网络和逻辑混淆网络。2.提出了两类辅助信息的建模方法和应用于混淆网络的解码方法。为了利用词间的长距离依赖信息,提出了基于词义类对触发式语言模型的混淆网络解码方法。为了利用更多的辅助信息源,提出了基于多系统结果融合的混淆网络解码方法。实验结果显示两种方法可以使汉字错误率分别相对下降7.9%和10.7%。3.提出了利用声调辅助信息来改善汉语音识别性能的方法。在声学解码阶段,提出采用基于多空间分布的隐马尔可夫模型来对声调进行建模,解决了其特征不连续的问题。在双数据流隐马尔可夫模型框架下,对谱特征和基频特征进行同步解码,可使汉字错误率相对下降15.9%。在第二遍解码阶段,提出基于Supra-tone单元的独立声调建模方法。利用Supra-tone声调模型进行混淆网络解码,进一步使汉字错误率相对下降8.0%。4.开发了一个具有输入错误在线快速修正功能的汉语语音输入系统。通过利用汉字混淆网络,可以把句子级候选分解为汉字级候选,从而使用户能够利用候选快捷地修正近一半的识别错误。为了快速可靠地输入新的汉字,提出手写信息辅助的孤立汉字语音输入方法。这种方法具有比手写输入更快的速度,并且比单纯的语音输入更为可靠。综上所述,本文通过对混淆网络和辅助信息的研究提高了语音识别的性能和实用性。混淆网络的高效生成方法对于其它任务(如语音文档检索等)也会有很大帮助。采用触发语言模型和多系统结果合并的混淆网络解码方法为有效利用其它类型辅助信息提供了有益借鉴。对声调辅助信息的研究是充分利用声学辅助信息(如重音、语调等)的一个很好开端。利用混淆网络和手写辅助信息使语音输入错误的修正更为快捷可靠,这是辅助信息和混淆网络在语音识别任务中的一个成功应用。

【Abstract】 Communicating freely with computer via speech is always people’s dream for many years. Although some great progress has been achieved in speech recognition area after several decades of unremitting efforts, it is still far away from the practical applications. How to further improve the performance and robustness has become the bottleneck of speech recognition.It is well-known that very limited acoustic and linguistics knowledge, i.e. spectral feature of speech signal and N-gram based statistical language model, is used in automatic speech recognition system. This information is far from enough for the complicated tasks like speech recognition since a large amount of information is implicitly utilized for human in the process of speech apperception.The performance of speech recognition can be improved by more effectively modeling and applying other side information. Confusion network is a more compact form representing multiple candidates, and word error rate can be minimized by performing second-pass decoding on confusion network. It is more significant for improving recognition performance to use confusion network as a decoding platform where various side information can be well integrated.Accordingly, two subjects are studied in this thesis: confusion network and side information. It is attempted to reduce character error rate by performing confusion network decoding with various side information. In the aspect of confusion network, the efficient approachs to generating and decoding confusion network are studied. In the aspect of side information, the effective methods are investigated to model and apply it. Major original works in the research are listed in details as follows:1 . Two approaches to efficiently generating confusion network are proposed. In the first one, lattice scale is reduced by segmenting original lattice into multiple sublattices, which can improve generation speed at a cost of slight decline of its quality. In the second one, the constructing process of confusion set is guided by the arc with maximum posterior probability, which can reduce the complexity of generation algorithm to linearity. Moreover, K-L divergence is introduced to measure the similarity between two arcs, which can increase the quality of confusion network. Finally, for Chinese speech recognition task two new structures of confusion network are introduced: character-based confusion network and logical confusion network.2 . Decoding methods integrating two types of side information on confusion network are studied. Trigger language model based on semantic class pairs is proposed to model dependence relationship between long-span words. The model is integrated with confusion network decoding process. Different speech recognition systems utilize different knowledge sources and modeling methods, consequently their error pattern is also different. A decoding method is proposed to combine the results from multiple recognition systems on confusion network. Experimental results show both methods can relatively reduced character error rate by 7.9% and 10.7%, respectively.3.It is investigated to use tone information to improve the performance of Chinese speech recognition. In the acoustic decoding stage, multi-space probability distribution based HMM (MSD-HMM) is adopted to model tone pattern, which resolves the problem that tone feature is discontinuous in the whole utterance. In the framework of two-stream HMM, spectral and pitch features can be decoded synchronously. In the second pass, tone information over a horizontal, longer time span is used to build explicit tone models which are apply to decoding on the confusion network generated in the first pass. Experimental results show that in the first-pass decoding 15.9% relative error reduction can be obtained in character recognition and an additional 8.0% relative error reduction by the second-pass decoding.4.A reliable speech input system with the ability of fast correcting input error is developed. Character-based confusion network is used to decompose sentence-level hypothesis into character-level one, which can allow the user to correct about half of recognition errors quickly and conveniently. In order to speed up new character input, speech recognition method assisted by handwriting information is proposed. It has faster input rate than single handwriting input and more reliable than single speech recognition.To sum up the above arguments, generation method of confusion network, its decoding methods integrating side information, modeling methods of side information and their application are investigated in this thesis, and the performance improvement is achieved for speech recognition. Efficiently constructing confusion network with high quality is the base of decoding, which is significant not only for speech recognition task but also for other tasks based on confusion network (such as speech document retrieval). The study on confusion network decoding methods, which integrate trigger language model based on semantic class pairs and the results from multi-system combination, also provides beneficial reference for utilizing other types of side information. Application of tone information remarkably improves the performance of speech recognition and also exhibits a good beginning for better utilizing various acoustic side information (such as stress, intonation etc). Speech input system becomes more reliable and its error correction process more convenient and efficient by using confusion network and handwriting information. This is a successful application of side information and confusion network in speech recognition.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络