

Research on Prosodic Structure Prediction Based on Statical Model

【作者】 包森成

【导师】 董远;

【作者基本信息】 北京邮电大学 , 信号与信息处理, 2009, 硕士

【摘要】 随着计算机技术的进步和其他相关学科的发展,在过去的几十年间,语音合成技术有了迅猛的发展,涌现出了大量的新理论和新技术。在现阶段,语音合成技术主要是以文语转换系统(Text To Speech,TTS)为研究重点,这是一种将输入的文本转换为语音输出的技术。TTS系统一般由文本分析、韵律处理、语音合成三个模块组成。这三个模块并不是相互孤立的,每一个模块的性能都对最终输出语音的质量有很大的影响。对合成系统输出语音音质的评价是多方面的,但主要集中在输出语音的可懂度和自然度两个方面。当前,TTS系统的输出语音在可懂度方面已经达到了比较高的水平,而在语音的整体自然度方面还有待提高,其根本问题就是不能对自然语流中的韵律进行有效的模拟。韵律处理的研究主要有以下几个方面:韵律预测,韵律规则,韵律描述和韵律建模。本文主要研究了韵律结构预测模板,希望通过对此模块的研究与改进来提高合成语音的自然度。韵律预测与文本分析之间有着紧密的联系,这是因为TTS系统的输入是无限制的文本,从文本中只确定读音是远远不够的。为了提高语音的自然度,还需要从文本中提取更多的与韵律相关的信息,其中包括文本的韵律结构、重音和语调等信息。研究表明,在TTS系统中引入韵律层级结构可以显著提高合成语音的质量,特别是合成语音的自然度。如何提高韵律结构预测的正确率是本文研究的重点。本文从汉语的声学特点和韵律特征出发,分析和研究了汉语的韵律特征、停顿、重音以及韵律边界之间的关系,分析并对比了汉语韵律层级结构,同时分析了韵律边界的声学特征。对传统的韵律结构预测的方法进行了综述和比较,指出传统韵律结构预测方法的优缺点,然后重点研究了基于统计机器学习的韵律结构预测,特别是条件随机场(CRFs)和最大熵(ME)模型在韵律结构预测中的应用。在基于条件随机场的韵律结构预测系统的研究中,理论上,本文详细阐述了条件随机场的定义,条件分布以及参数估计。在应用上,本文重点研究了条件随机场的特征模板,并讨论了窗长的选取,复合特征的作用等问题。在基于最大熵模型的韵律结构预测系统的研究中,在理论上,本文详细阐述了最大熵模型模型的定义,条件分布以及参数估计。在应用上,本文重点研究了最大熵模型的特征模板,并讨论了窗长选取和动态特征的作用等问题。此外,本文提出了基于最大熵模型的多遍韵律结构预测系统,并和基于CRFs的预测系统进行了性能上比较和分析。在韵律短语预测上,前者的性能好于后者。

【Abstract】 During the past few decades, with the development of computer and other reiated subjects, the speeeh synthesis technique progressed a lot. TTS is a technique that ean convert the input text to speeeh output. generally speaking, a TTS system consists of three modules, including text analysis, prosody processing, speeeh synthesis.However, the three modules are not independent. The quality of output speeeh is impactedg reatly by every single module.We can evaluate the output speech in many aspects, but mainly in the output speech intelligibility and naturalness. At present, the intelligibility of TTS has reached a high level, but the naturalness still needs to be improved. There are for areas in prosodic treatment research: prosody prediction, prosody rules, prosody description and prosody modeling. This paper mainly studied the prosodic structure prediction; hope to improve the module to improve the naturalness of synthesized speech.There are close relaition between prosody predictions a text analysis. It is far from sufficient to determine the pronunciation from the text, because the importation of TTS systems is unlimited text. In order to improve the naturalness of speech, it is necessary to extract more prosody information from the text, including the prosodic structure, accent and intonation information. Studies have shown that the prosodic structure can significantly improve the quality of synthesized speech, especially the naturalness of synthesized speech. This paper focuses on how to improve the prosodic structure prediction.This paper analyzed the relationship amony the Chinese prosodic features, pause, accent, as well as the rprosodic boundary, analyzed and compared the Chinese Prosodic hierarchy, while the acoustic characteristics of prosodic boundary. The paper reviewd and compared the traditional Prosodic structure prediction methods, pointed out that the the advantages and disadvantages of traditional prosodic structure prediction methods, and then focused on statistical machine learning based prosodic structure prediction, especially CRF and ME model.In the study of CRFs based prosodic structure prediction system, the paper described the CRFs definition and parameter estimation. And this paper focused on the feature template of CRFs, discussed the selection of the feature window and combined features.In the study of Maximum entropy-based prosodic structure prediction system, this article described the ME definition and parameter estimation. Then it focused on the feature template of maximum entropy model, and discussed the selection of feature window and dynamic features. In addition, this paper, came up with maximum entropy based multi-pass prosodic structure prediction system, and compared with the CRFs-based prediction system. In the prosodic phrase prediction, the former’s performance is better than the latter.


