

Detecting and Processing Visual Information in Speech Synthesis System Driven by Visual-speech

【作者】 王蒙军

【导师】 李刚;

【作者基本信息】 天津大学 , 生物医学工程, 2007, 博士

【摘要】 为恢复那些后天致残、但仍然具有正确唇形特征语言残障者的语音表达能力,探索建立一个基于视觉信息的唇形识别发声系统,本研究把从嘴唇图像序列中提取的视觉信息作为一种特殊语言加以分析识别。在研究中,对视觉信息检测与处理中的一些基本问题,如视觉信息与语音信息的对应关系,嘴唇区域和唇形轮廓所包含的信息量,最佳唇读系统特征向量的选取,以及自动有效地提取与识别唇形特征的方法进行了深入分析。论文的主要研究内容包括:1.通过分析正面和侧面视角下人脸图像的特点,建立一种新的非对称唇形轮廓描述模型,其中既包含嘴唇高度、宽度等信息,又包含嘴唇突出度信息,同时计算各个参数对时间的导数,来获得唇形轮廓的动态信息,通过组合不同的特征参数,分析参数选择对识别效果的影响,基于独立汉字发音的实验表明,该模型能够将识别效果平均提高25%以上。并且据此模型设计建立了基于常用汉字、面向残疾人的汉语双模语音数据库。2.基于运动检测和数学形态学方法对唇动序列的灰度图像进行处理,得到唇形区域和唇形轮廓,然后从唇形区域提取嘴唇宽度的投影W ,外唇轮廓的高度H ,嘴唇突出度的投影信息F ,并且考察它们对时间的导数关系,得到dW /dt , dH /dt , dF /dt等差分特征参数,组合形成6维几何特征向量。3.利用离散傅里叶变换(DFT)和离散余弦变换(DCT)分别得到描述唇形轮廓特征的傅里叶描述子和离散余弦变换描述子,然后将两类描述子作为唇形轮廓的特征向量,采用隐马尔可夫模型(HMM)进行学习和识别,分析了两类描述子刻画唇形轮廓特征的能力。4.采用特征融合技术提高特征向量分类识别能力,用串联加权组合的方法,将唇形区域几何特征向量和由离散余弦变换描述子表述的唇形轮廓特征向量融合形成新的特征向量,应用HMM对其进行学习和识别,分析不同加权因子下的识别效果。5.选用二阶HMM来对唇形特征参数序列进行学习和识别,利用了各帧唇形特征向量之间的上下文相关性,更符合汉语发音方式,通过实验分析比较了一阶HMM和二阶HMM对相同特征向量的识别能力。

【Abstract】 In order to develop a communication approach for voice-impaired people, a speech synthesis system driven by visual speech is approached. The visual information of lip-movement from the mouth region is used as a special language in this system. In this research, some fundamental problems are explored, such as how to correlate the visual information with sound information, how much information can be extracted from the lip region and lip contours, how much the parameters of the lip features can contribute to a robust speechreading system, and what is the effective proceeding to extract lip parameters automatically.The main research content of the dissertation involves:1. Based on analyzing the frontal-view face image and profile-view face image, a new model, which can extract the degree of pouting from it, is presented. At the same time, the differential coefficient of some parameters to describe dynamic characteristic of the lip contour are calculated. Experimental results based on a small database of Chinese words show that the parameters from unsymmetrical lip contour model improved the recognizing rate in more than 25%. Then using this model, a mandarin Chinese visual-speech database is designed for voice-impaired people.2. Movement detection and morphological processing are used to extract mouth area and lip contours from the image sequences. Then the lip features is extracted from the mouth region; including the projection of the width of the outer lip contourW , the height of the outer lip contourH , and the projection of the poutingF . The difference of these parameters are calculated as new parameters to describe the dynamic information of the lip, including dW /dt , dH /dt and dF /dt .3. Discrete Fourier Transform and Discrete Cosine Transform are used to get the descriptors of lip contours in the unsymmetrical lip contours model automatically. Hidden Markov Model is trained by using both of the descriptors as the eigenvector of lip contours, and then recognition ability is tested.4. Feature fusion is used to improve the classifiable power. To get better effect of combination, weighting combination is used to form the parts of with balance. Geometrical features of lip region and the descriptors of lip contours by Discrete Cosine Transform are combined to get a new discriminate vector. With this new vector, the HMM model is used to training and recognizing. The recognition rate is analyzed with different weighting factors.5. Second-order Hidden Markov Model is used and implemented to train and test the lip’s feature sequences, which can capture more context information from the lip’s feature sequences, and it fits for the pronunciation of Chinese. The accuracy of recognition rates by both second-order Hidden Markov Model and first-order Hidden Markov Model are tested with the same lip’s feature sequences.

  • 【网络出版投稿人】 天津大学
  • 【网络出版年期】2009年 08期

