节点文献

连续语音识别的稳健性技术研究

Research on Robust Algorithms in Continuous Speech Recognition

【作者】 徐望

【导师】 王炳锡;

【作者基本信息】 解放军信息工程大学 , 军事情报学, 2006, 博士

【摘要】 说话人差异,信道失真和背景噪声导致训练环境和测试环境不匹配,严重影响了非特定人连续语音识别系统的性能。为提高中文连续语音识别系统的稳健性和自适应能力,本文从信号空间、特征空间和模型空间三个方面对说话人归一化、语音增强、端点检测、特征补偿和不确定解码等关键技术进行了深入研究和分析,提出了一些新的思路和方法,并以大量的实验予以证明。本文主要完成了如下工作: 1.将双线性频率弯折方法引入到声道长度归一化中。传统的频率弯折方法存在声道模型假设过于简单,变换后信号频谱带宽改变的问题。本文根据双线性变换中低通滤波器截止频率的映射公式,求出对齐不同说话人或人群第三共振峰的频率弯折因子。利用该频率弯折因子,对Mel滤波器组的位置和宽度进行双线性变换,得到声道长度归一化的特征矢量。该方法避免了对弯折因子的线性搜索,同时还利用了双线性变换使弯折频谱连续且无带宽改变的优点。实验证明,该方法是一种快速的、尤其适用于无监督模式下的稳健性方法。语音特征参数经过声道长度归一化后,在孤立词识别中,成年男性语料训练的基线系统对成年女性语料的识别率从71.50%提高到了91.00%,对儿童语料的识别率从71.00%提高到了84.00%;在连续语音识别中,男性语料训练的HMM声学模型参数集对女性语料的识别率从13.91%提高到了50.56%。 2.采用高斯混合模型(Gaussian Mixture Model,GMM)分类器对测试语句的信道环境进行分类。在多信道环境下进行语音识别时,当基线系统的信道环境与测试语句的信道环境匹配,识别率要明显高于用某一种信道语料或多种信道语料混合训练的基线系统的识别率。如果用各信道的语料分别建立一个GMM模型,信道的差别就转而体现在各GMM的差别上,且具有可分性。本文用各电话信道的洲练语料训练出相应的GMM信道模型和HMM声学模型,在识别时候,对测试语句进行信道分类,选择相应信道下的HMM声学模型对该语句进行识别。实验结果表明,该方法能有效提高多信道环境下的语音识别率。 3.推导了一种基于离散余弦变换和听觉掩蔽效应的子空间降噪算法。本文采用离散余弦变换来逼近本征分解中的Karhunen-Loeve变换,用基于Johnston掩蔽模型的感知滤波器对降噪后的语音进行后置滤波。该方法利用基于离散余弦变换的本征分解快速算法,可将运算复杂度O(N~3)减少到N~2,同时能有效地抑制残差噪声。 4.提出了特征空间能量熵的定义。当背景噪声为有色噪声或噪声能量可变时,传统的语音端点检测方法往往失效。带噪语音的空间可分为正交的信号加噪声子空间和噪声子空间。语音信号是由确定性的非线性动力系统产生,所以它的能量将集中在信号加噪声子空间。而随机噪声的能量在整个带噪语音空间中近似均匀分布。因此语音和噪声具有不同的空间能量分布,有着不同的空间能量熵。本文对语音信号的协方差矩阵进行本征分解,由特征值求出信号在特征空间能量概率分布,提出了特征空间能量熵的

【Abstract】 The inter-speaker variation, channel distortion and background noise result in the mismatch between the training condition and the testing condition. The mismatch degrades significantly the performance of the speaker-independent continuous speech recognition system. In order to increase the robustness and adaptation ability of Chinese continuous speech recognition, speaker normalization, speech enhancement, endpoint detection, feature compensation and uncertainty decoding methods respectively viewed from signal space, feature space, model space are studied in detail in this dissertation. Some new methods are proposed by using a lot of experiments. The main contributions of the dissertation are as follows:1. A vocal tract length normalization method based on the bilinear frequency warping is proposed. The traditional frequency warping methods have the faults that the vocal tract model is too simple and the bandwidth (BW) of the transformed signal differing from that of the original. We compute the frequency warp factor by the cut-off frequency map of the prototype low-pass filter to the desired low-pass filter. Then the Mel filterbanks are adjusted by bilinear frequency warping to get the vocal tract normalization MFCC. The method avoids the exhaustive search for the frequency warp factor and warps the spectrum continuous without suffering the bandwidth problem. It is proved to be a quite fast adaptation technique, and especially suitable for the unsupervised adaptation. The effectiveness of this method is examined on isolated and continuous speech recognition. The baseline isolated digit recognizer is trained on adult males’ data and the baseline continuous speech recognizer is trained on men’s data respectively. After the vocal tract normalization, in isolated digit speech recognition, the recognition accuracy of adult female’s isolated digit is improved from 71.50% to 91.00% and that of children’s isolated digits is improved from 71.00% to 84.00%. In continuous speech recognition, the recognition accuracy of continuous speech of women is improved from 13.91% to 50.56%.2. In order to increase the robustness of speech recognition in multi-channel environment, a GMM (Gaussian Mixture Model)-based channel classifier is used. If the speech signals filtered by a kind of channel are modeled by a GMM, the difference of the channels can be characterized by the GMM. The GMMs of different channels are discriminable. A GMM-based channel classifier is used to the select a most likely HMM from pre-trained HMMs of each specific telephone channel environment. The selected HMM is used as the reference HMM to recognize each utterance. The results of Mandarin continuous speech recognition show that the proposed speech recognition scheme is an efficient framework to enhance the robustness of speech recognition in multi-channel environment.3. A speech enhancement algorithm based on discrete cosine transform and hearing masking properties is deduced. The discrete cosine transform is used to approximate to Karhunen-Loeve transform (KLT) in the subspace-based speech enhancement, which reduces the computation of eigenvalues of a N×N symmetric Toeplitz matrix

节点文献中: 

本文链接的文献网络图示:

本文的引文网络