节点文献
音频噪声环境下唇动信息在语音识别中的应用技术研究
Research on Noise Treatment of Speech Recognition with Lip-movement Information
【作者】 奉小慧;
【导师】 贺前华;
【作者基本信息】 华南理工大学 , 通信与信息系统, 2010, 博士
【摘要】 传统语音识别研究只利用声学语音信息,而音视频双模态语音识别将说话人的唇动信息和声学语音信息一起作为特征参数,共同完成语音识别,为提高语音识别系统的鲁棒性和抗噪性能提供了一条新途径。本文着重研究音视频语音识别中视频图像的前端处理、视频特征提取、音视频信息融合等实际应用问题。本文主要工作如下:1)建立了一个针对车载控制系统的中文句子级双模态语音数据库(BiModal Speech Database, BiMoSp),由26人(14男12女)的数据构成。经过对多个驾驶员进行问卷调查后归纳出68条最常用的车载设备控制指令作为语料,每个说话人为每个控制语句提供4个音视频语音样本。2)提出一种基于多色彩空间的嘴唇区域定位算法。该算法将RGB空间的彩色边缘检测结果、HSV空间的色调以及饱和度分量相结合,并根据嘴唇的位置特性,对嘴唇区域的基准线进行调整,然后通过投影确定嘴唇边界点的位置,最后在二值图像中完成嘴唇区域定位。为了提高视频图像处理的鲁棒性,在实验中还引用其他数据库的部分图像,实验定位的准确率为98.25%,相对利用PCA的定位算法,准确率提高了3.37%。3)以提高轮廓提取精度和速度为目标,提出了一种利用多方向梯度信息和基于先验知识的改进几何活动轮廓(GAC)模型。将多方向梯度信息和嘴形椭圆形状的先验知识(Prior Shape)结合起来引入到Level Set的能量函数中,避免了传统GAC模型在嘴形轮廓提取中的不足。相比传统的GAC,该模型使嘴唇轮廓提取实验的准确率提高了8.38%。4)提出了一种基于帧间距离和线性判别投影变换(LDA)的动态特征提取方法。该方法弥补了差分特征的缺陷。利用该方法得到的特征不仅嵌入了语音分类的先验知识,而且捕捉了视觉特征的纹理变化信息。实验结果表明,由DTCWT变化而来的静态特征经过帧间距离运算,识别错误率相对降低了3.25%。而该静态特征经过LDA变换之后识别错误率相对降低了6.50%。LDA变化后的特征和一阶、二阶差分特征结合之后,相对静态特征,又可使识别错误率分别降低了9.44%和15.43%。将帧间距离和LDA差分得到最终的动态特征,其识别错误率相对静态特征降低了20.12%。5)提出了一种双训练模型来改善音视频特征融合的识别效果。从训练数据和测试数据不匹配而带来的噪声影响考虑,在不影响识别速度的前提下,使用噪声模型和基准模型来共同完成音视频特征融合语音识别。对在噪声环境下的基于英语音视频数据库(AMP-AVSp)和中文音视频双模态语音数据库(BiMoSp)的实验结果表明,使用双训练模型在高噪声情况下识别性能得到了很大地提高。对于AMP-AVSp和BiMoSp,在SNR=-5dB时,比仅使用基准模型识别的错误率分别降低了45.27%和37.24%。6)提出一种基于整数线性规划(Integer Linear Programming,ILP)的最优流指数选取的决策融合方法。根据决策融合中的似然概率线性相加特性,利用提出的最大对数似然距离(Maximum Log-Likelihood Distance,MLLD)为准则,建立了流指数选取模型。在实验中用梯度值为0.05的穷举搜索法选取的流指数做参考。实验结果表明,两种方法得到的流权值和音视频语音识别结果都很接近。因为穷举搜索法往往都能得到模型的最优解,两个模型实验结果的近似也反映了ILP模型能够为音视频决策融合选取出最优数据流指数以达到最佳识别效果。
【Abstract】 Audio-visual speech recognition (AVSR), also known as bimodal speech recognition, has becoming a promising way to significantly improve the robustness of ASR. Motivated by the bimodal nature of human speech perception, work in this field aims at improving ASR by exploring the visual modality of the speaker’s mouth region in addition to the traditional audio modality. This thesis addresses some key issues of AVSR, namely lip contour extraction, visual feature extraction, audio visual fusion and so on. Some main contributions are proposed in the thesis:1) An audio-visual bimodal continuous Speech Database for vehicular voice control is collected. This database includes 26 persons (14 male, 12 female) speaking every continuous sentence 4 times. There are 68 sentences in this database,which are got from the conclusion of survey.2) An adaptive mouth region detection algorithm was presented based on multi-color spaces. The algorithm combined color edge detection in RGB color space and threshold segmentation in HSV color space. Furthermore, according to the mouth position of the face, an adaptive lip localization method was introduced to detect the position of mouth baseline automatically. Then the rectangular area of mouth region was detected by projection method. Experiment results show that the proposed algorithm can locate mouth region fast, accurately and robustly. The correct rate is to 98.25%. And compared to Principal Component Analysis (PCA), the accuracy improvement is 3.37%.3)To increase the accuracy and speed of lip contour extraction, we propose an improved Geometric Active Contours (GAC) model based on Prior Shape (PS) and mutil-directions of gradient information for lip contour detection. Here, the mutil-directions of gradient information and lip prior shape are introduced into the energy function of level set. The improved GAC model avoids the outline of lip contour extraction using tradition GAC model. Experiments results show that the accuracy improvement of the lip contour detection using PS-level set model is 8.38% over GAC model.4) A dynamic visual feature extracting method based on frame distance and LDA is also proposed. The proposed feature has not only captured important lip motion information, but also embodied a priori speech classification information. Evaluation experiments demonstrate that static feature with frame distance can significantly improve 3.25% for DTCWT. Static feature with LDA also can significantly improve 6.50% for DTCWT. Then With further delta and delta-delta augments, the recognition rate can improve 9.44% and 14.43% respectively. The final dynamic feature can make recognition accurate improvement is 20.12% compared to the static feature.5) A bimodal training model is proposed to improve the recognition rate of audio-visual feature fusion. Consider that the infect of noisy because of training data and testing data not match with each other, and the recognition speed, we use noisy training model and basic training model to finish audio-visual feature fusion speech recognition. Here, we use two audio-visual speech databases which are English AMP-AVSp and Mandarin BiMoSp to do the experiments. Experiments results show that using the bimodal training model can improve the recognition accuracy for both databases. Such as when SNR=-5dB in the testing data, the recognition improvements for both database are 45.27% and 37.24% respectively.6) A new weighting estimation method based on Integer Linear Programming (ILP) is developed to estimate the optimal exponent weighting for combining audio (speech) and visual (mouth) information in Audio-visual decision fusion. According to log-likelihood linear combination of the two streams and the rule of Maximum Log-Likelihood Distance (MLLD), the ILP model is built. In the experiments, we use exhaustive search (ES) and frames dispersion (FD) of hypothesis as compared methods. The results in ILP model are similar with ES model and are superior to FD model. As we know, ES can get the optimal result, that means ILP also can get the optimal stream weighting for Audio-visual decision fusion speech recognition.
【Key words】 Audio-visual speech recognition; lip movement; contour extraction; dynamic feature; Audio-visual fusion;