节点文献

基于单视觉通道唇读系统的研究

Research on Lipreading System Based on Visual Channal Only

【作者】 梁亚玲

【导师】 杜明辉;

【作者基本信息】 华南理工大学 , 通信与信息系统, 2011, 博士

【摘要】 唇读(lipreading/ speechreading)是人工智能,图像处理,模式识别等相关研究领域综合发展所产生的一个新的研究方向,被广泛的应用于噪声环境下提高自动语音的识别率,也用于安防系统的身份认证,远距离语义识别,听觉障碍人士的语言学习,老年人的唇部语义学习及残障人士辅助系统的唇部命令识别等。目前关于唇读的研究集中在将视频通道作为音频通道的一种补充来提高语音的识别率。在真正的高噪声环境下,语音信道的信息量急剧下降,系统的识别率主要取决于视觉通道,研究基于单视觉通道的语义识别就非常重要。目前基于单视觉通道唇读的研究处于较为初级的阶段,研究对象为小词汇量,且识别率相对较低。将词汇量扩大到较大词汇量,提高单视觉通道唇读的识别率是本文的研究目标。本文针对单视觉唇读系统中几个关键问题,进行了较为系统,深入及广泛的研究,主要的研究工作及成果包括以下几个方面:(1)对国内外的数据库进行了相应的研究,结合本文的研究对象采用哈工大的数据库HITBICAVDatabase作为主库,在该库的基础上选取不同音标的字建立了一个适合本文研究的数据子库database9603。并对该数据库中的每幅图像提取感兴趣区域生成了可直接用于特征提取和识别的数据库。自建了一个小型的双模态唇读数据库,并对自建数据库进行相应的预处理工作。(2)针对唇部感兴趣区域的提取问题,提出了基于人脸结构和灰度信息的感兴趣区域提取方法。该方法通过对大量人脸结构的分析发现,人嘴的宽度与双眼的距离相当,因此采用双眼瞳孔来定位唇部的左右边界,并完成对唇部图像的缩放以及水平位置的调整。利用灰度投影检测唇角,定位唇部的垂直位置。该方法提取的图像具有相对固定的参照,能够真实反映唇部的大小和形状信息。对镜头的缩放以及头部的倾斜具有较好的鲁棒性。针对唇部的提取问题,提出了基于LAB空间a分量的唇部提取(分割)方法。通过对色度空间各分量可分离性的研究,通过fisher准则寻找到能够将唇部和非唇部(肤色,牙齿,胡须等)进行有效分割的彩色分量‘a’。该方法可较好的将唇部提取出来,并根据图像特征自动生成阈值,便于唇部提取的自动化。针对基于轮廓的唇部提取,本文提出了基于流形的唇部轮廓提取方法。实验结果表明,本文提出的唇部轮廓提取方法更逼近唇部的真实轮廓图像。文中还将‘a’分量方法与流形的方法结合起来,提取唇部,实验结果表明基于色度和轮廓的方法提取的唇部效果更好。(3)对唇部特征表示进行研究。提出了DT-CWT+PCA的唇部特征提取方法,DT-CWT具有近似的平移不变性及良好的方向性,能够较好的提取唇部感兴趣区域的边缘信息及频域信息,且能克服感兴趣区域(ROI)提取过程中存在的位移问题。实验结果表明该特征提取方法提高了识别率。针对DT-CWT+PCA的方法中将DT-CWT的幅值系数重新排列导致丧失数据本身几何信息的缺点,提出了基于DT-CWT+LBP+PCA空频域相结合的特征提取方法。该方法提取的特征既能体现唇部的频域信息和空间域信息,又能反应其局部信息和全局信息,且对位移和旋转具有不变性。实验结果表明基于DT-CWT和LBP的空频域特征提取方法很大程度上提高了唇读的识别率。(4)对唇部特征有效降维问题进行研究。提出了基于DCT+ONPP的特征提取方法,正交邻域保持投影(ONPP)在降维的同时保持了数据本身的几何结构信息。实验结果表明该方法能够提高识别率。在基于监督的学习方法,本文提出了采用局部敏感的判别分析方法(LSDA)对唇部图像提取特征。LSDA结合了LDA和LPP两者的优点,充分体现了唇部局部几何特征。实验结果表明与LDA及传统的方法相比,本文方法识别率更高,且该方法的识别率高于非监督的降维方法。(5)针对唇读系统中各样本帧数不同的问题,提出了基于唇部灰度能量图的概念,并结合唇部能量图提出相应的特征提取方法。唇部灰度能量图是通过唇部灰度图像的叠加平均得到的,在投影的过程中完成了样本特征维数的归一化。唇部灰度能量图在保留唇部图像本身静态特征的同时也反映了其动态特征,有效去除传统方法中对单帧分别提取特征时各帧特征之间的相关性,大大降低了特征的维数,缩短了识别时间,提高了识别率。唇部灰度能量图的提出,使得基于人脸识别和基于监督的特征提取方法非常容易移植到基于唇部灰度能量图的唇部特征提取上来。基于此本文将DT-CWT+LBP和LDA的特征表示和特征降维方法应用到唇部灰度能量图上来提取特征。实验结果表明传统的特征表示和特征降维方法仍然适用于唇部灰度能量图,且基于能量图的方法比传统方法的识别率高。

【Abstract】 As a result of the joint development in artificial intelligence, image processing, pattern recognition and the relative researches, Lip-reading is a new research direction. It has been researched as complement to improve the speech recognition in noise environment, and also been used for speaker identification in security system, for semantics recognition in distance, for the language learning of hearing hard people, for the older people‘s lip movement recognition and as a associate system for the handicap people. Until now most researches still take the lip-reading research as a complementarity for the noise automatic speech recognition system. But in the reality environment, the quality of audio channel is dropped dramatically in noise enviroment, for the hear-hard person the voice channel can not transmit information. So the lip-reading based on the visual channel is very important. The visual only system is in the step stage, it is limited in small vocabulary, and the recognition rate is relatively low. So extend the vocabulary to the middle and big vocabularies, to improve the recognition rate of visual only system is the aim of this paper.Some key issues are researched in visual only lip reading system in this paper, the main research works and contributions of the thesis are as follows:(1) Do some research on the available database and choose the HITBICAVDatabase as the main database. Choose one words for every pronunciation to build a subdatabase9603 for the research. Some preprocessing is done for the database such as lip location and normolization. The preprocessing make the database can be used to extract features directly. In the same time, a small database for lip-reading is setup. It includes 10 male and 10 female videos which speak the ten numbers of 0 to 9, for each number, they speaking 10 times.(2) Though analyze the structure of a lot of people‘s faces, it is found that the width of the lip is a little small than the width of two eyes. So we propose the lip location method based on the face structure and the luminance. Which use the distance of the two pupils as the reference of the borderline of the lip region of interesting. Use the line of the two pupils to adjust the lip to the level and zoom it to the specified size. The proposed ROI segmentation method has invariable reference, so it can reflect the real size and shape of the lips, it is robust for the zoom and the incline of the face. Based on the separability of different components in different color spaces for lip and non-lip. Proposed a lip extraction method based on ’a’ component of LAB color space. This method performance very good and it can create the threshold automatically. It is very useful for the automatic of lip-reading. For the contour extraction of lip, a method based on manifold is proposed, and the experimental result show the contour is more similar to the reality. The method based on‘a’component and manifold are also proposed to extract the lip, the experimental result show it is performance better than use only one method.(3) For the representation of the feature, the method based on DT-CWT+PCA is proposed. The approximate invariance of DT-CWT make it is very useful to overcome the shift of ROI. The direction choice make it has the good properties to extract the edge of the lip. The experimental results show it is performance better than DCT+PCA. For the rearranged coefficients of DT-CWT lost the geometrical informations and the local informations is very important. The method based on DT-CWT+LBP+PCA is proposed. The hybrid features reflect the frequency domain and space domain characters. It can reflect the group and local properties too. The experimental results show that it is improve the recognition rate greatly.(4) For the dimension reduction, the DCT+ONPP method is proposed, ONPP is a method based on manifold which can keep the neighbor geometrical properties of the data and reflect the group properties too.The experimental result show it is a better than DCT+PCA and more suitable for lip-reading. For the supervised method, the DCT+LSDA feature extraction method is proposed .it is a combination of LDA and LPP, the experimental also show it is effective for lip-reading system.(5) To solve the problem that different samples have different number of frames. The Lip Gray Energy Image (LGEI) is proposed, which can norm the feature dimension. It keeps the statistic feature and the dynamic feature of the lip sequence. The feature extration method based on LGEI reduces the feature dimension compared with traditional method which extract feature for single frame. The method based on LGEI short the computer time and improved the recognition rate. The concept of LGEI make it is easier to use the method in face recognition to lip-reading.Based on LGEI, DT-CWT+LBP and LDA are used to present the lip and to reduce the feature dimension.the experimental result show that the proposed method improved the recognition rate greatly and performance better.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络