节点文献

基于特征补偿的自动语音识别的研究

Feature Compensation for Automatic Speech Recognition

【作者】 杨钊

【导师】 刘庆峰; 戴礼荣;

【作者基本信息】 中国科学技术大学 , 信号与信息处理, 2010, 硕士

【摘要】 本文主要研究的是自动语音识别中的前端噪声鲁棒性问题。众所周知,语音识别的根本目的就是使机器能够听懂人类的语言。在当前的实验室环境下,很多识别系统已经能够达到很好的性能。但在实际环境中,由于噪声的复杂多变和未知因素的干扰,系统性能往往会急剧下降以至于远远不能达到实用的目的。因此,噪声鲁棒性一直是语音识别研究中一个非常重要的方面。噪声鲁棒性的根源就在于训练环境和测试环境的失配。实际中这种失配是由语音采集环境的影响(如加性噪声、信道畸变等)以及说话人自身的影响(如说话风格、口音等)引起的,当然,我们也可以将这种失配都看成是噪声的影响。为了使语音识别系统在不同噪声环境下仍能具有较好的性能,就需要采用各种方法来增强识别系统的鲁棒性。噪声鲁棒性的方法多种多样,但一般来说可分为前端方法和后端方法两大类。前端方法集中于对语音信号本身或者语音特征做处理,达到消除或尽可能抑制噪声影响的效果;后端方法主要集中于增强语音声学模型自身的宽容度和适应能力,使模型能够容忍一定程度的噪声,或者调整模型参数使之跟上噪声环境的变化。本文主要是对噪声鲁棒性的前端方法进行了一些研究,改善了一些已有的方法,也提出了一些新的方法。首先,在本文第一章中,对语音识别技术的发展历程做了简单的概述,并重点介绍了一下基于统计建模框架下自动语音识别系统的几个重要组成部分。由于实际中噪声的多样化,使得噪声鲁棒性也出现了很多种方法,每种方法都有它的特点和适用范围。正是针对这种情况,论文在第二章中分别从鲁棒性特征的提取、语音增强、特征补偿/增强、模型补偿四个方面对噪声鲁棒性问题进行了比较全面的介绍和总结。在本文第三章中,首先介绍了基于显式模型的一阶矢量泰勒级数(VTS)离线特征补偿算法,但是离线算法在实用时并不完美,它最大的缺陷在于其巨大的运算量极大的降低了系统处理的效率。因此,在离线算法的基础上我们提出了实用化的一阶VTS特征补偿算法,它在保证离线算法性能的同时,大大提升了算法处理的实时性。虽然实用化的一阶VTS特征补偿算法取得了不错的效果,但是它和离线算法一样,对噪声均采用的是单高斯建模,而在实际环境中噪声是复杂多样的,这种情况下单高斯可能不能很好的描述噪声参数的分布特性,从而使干净语音估计不准最终影响到识别性能,针对以上问题,在本文第四章中,提出了对噪声多高斯建模的一阶VTS特征补偿算法。实验结果表明,噪声多高斯建模方法还是能够在一定程度上提高系统识别的性能。

【Abstract】 This thesis is focused on the research topic of noise-robust front-end of automatic speech recognition (ASR).As we all know, the ultimate purpose of speech recognition is to make the computer understand human spontaneous language. And now many mature systems have got fairly high speech recognition rate in laboratory. However, the system’s performance is too much worse to be used in real environment because of disturbance of various noises and unknown factors. Therefore, the noise robustness is a very important part of speech recognition research. The derivation of noise robustness can come down to the mismatch between training and testing environment. In our real world, this mismatch is caused by the influences of the speech collecting environment (additive noise, convolutional noise, etc.) and speaker (speaking style, accent, etc.), we can also regard this mismatch as influences of noises. In order to make the speech recognition system maintain the good performance under these noise conditions, we must use various methods to enhance the robustness of system.The noise-robust methods are various and be roughly classified into two categories: front-end methods and back-end ones. The front-end methods focus on mitigating the effect of noises by processing the speech signal or speech feature, while the back-end ones try to adjust models to meet the change of environments, which make models and real environments match. This thesis is primarily focused on the research of front-end noise-robust methods, and then some existing algorithms are implemented, several new methods are proposed.Firstly, this thesis gives an overview and summary on the development history of ASR in chapter one, and highlight the several important components of ASR which is based on the statistical modeling.There are many kinds of noise-robust front-end methods because of the diversity of noises, and each has its character and in-point range. Therefore, general introductions and conclusions are made in chapter 2 from four aspects including robust feature extraction, speech enhancement, feature compensation/enhancement and model adaptation.In chapter 3, we firstly introduce the offline feature compensation based on first-order Vector Taylor Series (VTS) approximation using explicit model of environmental distortion. But the offline algorithm is not perfect in practice. The biggest disadvantage of it is its huge computation which will reduce the system processing efficiency. Therefore, a practical first-order VTS approximation is proposed; it keeps the performance comparable to the offline condition, and also greatly increases the efficiency of the algorithmAlthough the practical first-order VTS algorithm has achieved good performance, but as is the offline algorithm, it assumes that for each sentence, the noise feature vector in cepstral domain follows one single Gaussian PDF (probability density function), this may be not a suitable description of the noise distribution because of the diversity and complexity of noises, thus the clean speech is estimated inaccurate, ultimately affect the recognition performance. So a first-order VTS approximation which assumes the noise feature vector in cepstral domain follows multi-Gaussian PDF is proposed in chapter 4.The results show that this method can improve the system’s performance to some extent.

  • 【分类号】TN912.34
  • 【被引频次】4
  • 【下载频次】239
节点文献中: