节点文献

真实感汉语可视语音合成关键技术研究

The Study on Key Technologies of Realistic Chinese Visual Speech Synthesis

【作者】 赵晖

【导师】 唐朝京;

【作者基本信息】 国防科学技术大学 , 信息与通信工程, 2010, 博士

【摘要】 可视语音合成又称语音动画合成,是指根据给定的文本或语音,合成出与文本或语音相对应的脸部图像序列,加深人们对语言内容的理解。可视语音合成技术在人机交互、影视娱乐、信息对抗等领域有着重要的应用。本文提出了汉语大规模双模态语料库的设计方案和彩色噪声图像唇部提取方法,在此基础上,提出了多种真实感汉语可视语音合成方法,设计实现了一个以可视语音合成技术为核心的演示系统。实验结果证明本文的可视语音合成方法能够实时、精确、有效地达到信息欺骗等目标。本文的研究工作包括:针对人脸彩色噪声图像,提出了基于峰值趋势检测分割的唇部提取方法。峰值趋势检测分割方法由平行线投影分割算法和基于直方图的加权模糊聚类分割算法构成。平行线投影分割算法的核心思想是根据映射规则,将二维直方图转换为一维直方图,结合了二维直方图分割方法的准确性和一维直方图分割方法的实时性。实验结果证明该唇部提取方法的准确率高,能够为真实感汉语可视语音合成提供精确的唇部坐标信息,并用于语料库的唇部素材选取。提出了大规模汉语双模态语料库Bi-VSSDatabase的设计方案。制定了原始语料选取原则和组成文件的命名规则;提出了基于人工免疫混合聚类的口型特征参数聚类方法;建立了能够反映汉语协同发音现象的三视素模型,并据此提出双模态语料精选算法;设计了双模态语料标注及切分方法。对覆盖率、覆盖效率等统计指标进行计算,计算结果证明了Bi-VSSDatabase能够为真实感汉语可视语音合成提供真实准确、有广泛代表性的双模态语料。提出了三种语音驱动的可视语音合成方法:HMM模型状态合成方法、混合参数合成方法和双层HMM模型合成方法;提出了两种文本驱动的可视语音合成方法:基于HMM模型的方法和基于单元拼接的方法,设计了拼接单元搜索流程,定义了拼接单元的拼接规则。分别以汉语三视素和汉语动态视素作为训练与合成的基本单元。基于三视素的合成序列的主观满意度和客观评测结果都达到良好以上,证明了所提出的方法能够合成平滑、连续、令人满意的口型序列。针对口型序列与背景视频的缝合问题,提出了基于快速行进算法的唇部区域修补方法,合成了完整、自然、流畅的说话人视频。提出了一种基于改进乘积HMM的可视语音质量客观评估方法,能够模拟人们对说话人视频的视觉-听觉感知过程,并从客观角度给出评估结果。在评估过程中,比较分析了本文几种可视语音合成方法的质量,证明了可视语音合成技术能够极大地提高人们,尤其是听障人士对语音内容的理解能力。

【Abstract】 Visual speech synthesis can be called speech animation. Visual speech synthesis technology is to synthesize visual image sequence according to the given text or speech, which can deepen people’s language comprehension. Visual speech synthesis technology plays important roles on domains of human-computer interaction, movie and entertainment, information countermeasure and so on.A large-scale Chinese bimodal database is designed and a mouth segmentation approach in color image with noise is proposed. Based on them, several realistic Chinese visual speech synthesis approaches are proposed in this dissertation. Also, a demonstration system is designed, in which visual speech synthesis is the key technology. The experimental results show that, aiming at information spoofing, the proposed visual speech approaches is fast, exact and efficient. The main contents of this dissertation are summarized as follows:In order to get mouth area in color image with noise, a thresholding segmentation algorithm based on peak clustering tendency test is proposed. The thresholding segmentation algorithm is composed of two algorithms: parallel projection segmentation algorithm and weighting Fuzzy c-Means clustering algorithm based on histogram. Parallel projection segmentation algorithm is used to project two-dimension histogram into one-dimension histogram according to mapping rule, and the algorithm is proved to satisfy the accuracy of two-dimension histogram segmentation approach and real-time performance of one-dimension histogram. The experimental results show that accuracy of mouth segmentation is high, which is able to provide accurate mouth coordinate information. Meanwhile, the proposed approach can be used to select mouth corpus for bimodal database.A large-scale Chinese bimodal database -- Bi-VSSDatabase is designed. Original corpus selection rule and the composed document naming rule are made; Mouth feature parameter clustering approach based on artificial immune system is proposed; Chinese triphone model is built, which can reflect Chinese coarticulation characteristics. Based on Chinese triphone model, bimodal corpus selection algorithm is proposed; Then, bimodal corpus marking and segmentation approach is designed. Several statistical indicators, such as coverage rate, coverage efficiency, are calculated. Experimental results of these statistical indicators show that Bi-VSSDatabase is able to provide sufficient, exact and representative bimodal corpus for realistic Chinese visual speech synthesis.Three speech-driven visual speech synthesis approaches are proposed: hidden Markov model (HMM) state synthesis approach, mixing parameter synthesis approach and two-layer HMM synthesis approach. Two text-driven visual speech synthesis approaches are proposed, which are based on HMM and unit concatenation separately. In the unit concatenation synthesis approach, concatenating unit searching procedure is designed and concatenating rule is made. Chinese visual triphone and Chinese dynamic viseme are used as basic unit in training and synthesizing stage separately. Subjective and objective assessment scores of synthesized mouth sequence based on visual triphone are satisfactory. The assessment results prove that the proposed approaches can synthesize smooth, continuous, and satisfactory mouth sequence. After mouth sequence has stitched into background video, a mouth area inpainting approach based on fast marching method is proposed. With the help of painting procedure, a complete, natural and fluent talking-head video is synthesized.Based on improved product HMM, a visual speech quality objective assessment approach is proposed. The assessment approach can simulate people’s visual and auditory perception to the speaker and provide objective assessment result. In the assessment process, all the proposed visual speech synthesis approaches are compared. The comparison results prove that the proposed visual speech synthesis technology could highly enhance people’s capability of speech comprehension, especially for the people with impaired hearing.

节点文献中: