节点文献

基于统计声学建模的语音合成技术研究

Research on Statistical Acoustic Model Based Speech Synthesis

【作者】 凌震华

【导师】 王仁华;

【作者基本信息】 中国科学技术大学 , 信号与信息处理, 2008, 博士

【摘要】 近十几年来,随着针对语音信号的统计建模方法的日益成熟以及参数合成器性能的不断提升,统计参数语音合成(Statistical Parametric Speech Synthesis)思想被提出,并得到了越来越多研究者的关注。其中,以基于隐马尔可夫模型(Hidden Markov Model,HMM)的参数语音合成方法为代表,该方法已逐步发展成为和基于语料库的单元挑选与波形拼接合成方法相并列的一种主流语音合成方法。相比传统的单元挑选与波形拼接合成方法,基于HMM的参数语音合成方法具有合成语音流畅度高、鲁棒性好,系统构建速度快、自动化程度高,系统尺寸小、灵活度高等优点。本文以统计声学模型在语音合成中的应用为研究重点,在原有基于HMM的参数合成方法之外,提出了两种新的基于统计声学建模的语音合成方法。第一,基于HMM的单元挑选与波形拼接合成:我们将HMM参数语音合成中使用的声学参数建模思想,与传统的单元挑选与波形拼接合成方法相结合,使用概率准则指导最优单元搜索,通过拼接波形生成最终语音,以克服参数合成方法在生成语音音质上的不足,提高合成语音的自然度;第二,融合声学参数与发音器官参数(Articulatory Feature)的建模与合成:我们在声学参数之外,引入和语音产生机理更加紧密相关的发音器官参数,通过对原有的HMM模型结构进行改进,实现两种参数的联合建模与生成,从而提高合成时声学参数预测的精确度和灵活性。整篇文章的安排如下:第1章是绪论,将回顾语音合成的发展历史,并对常见的几种语音合成方法进行简要的介绍。第2章将具体介绍基于HMM的参数语音合成方法,包括HMM的基本原理、系统框架、关键技术点等,并通过对此方法特点的分析,阐明我们进行新的语音合成方法研究的动机与出发点。第3章将重点介绍基于HMM的单元挑选与波形拼接语音合成算法。首先我们提出了使用HMM进行单元挑选的两种不同的实现形式,一种以帧为拼接单元,基于最大似然准则实现单元搜索,另一种使用音素和帧的两级拼接单元,结合似然值准则和Kullback-Leibler距离(Kullback-Leibler Divergence,KLD)进行单元选择;然后,我们归纳出了基于HMM的单元挑选合成的统一算法框架,并通过在中文和英文合成系统上的测试证明了此算法的有效性;最后,我们提出了最小单元挑选错误(Minimum Unit Selection Error,MUSE)准则,用以替代原有HMM训练中使用的最大似然准则,实现了合成系统的全自动构建,并进一步提高了合成语音的自然度。第4章将介绍融合发音器官参数与声学参数的统计建模与合成。这里的“发音器官参数”指的是对发音过程中说话者舌、唇、下颚等发音器官的位置以及运动情况的定量描述。在阐明了引入发音器官参数的原因以及对原有系统框架进行了简单回顾后,我们提出了对声学参数和发音器官参数进行联合建模与参数生成的总体思路,并且从模型聚类策略、状态的同步性假设以及特征之间的独立性假设三个方面,讨论了几种可能的模型结构;然后,通过一系列的客观和主观评测,证明了这种结合发音器官参数的系统构建方法在提高声学参数预测的精确度和灵活性方面的有效性。第5章对全文进行了总结。

【Abstract】 With the development of statistical modeling techniques for speech signals and the performance improvement of parametric speech synthesizer, statistical parametric speech synthesis methods have been proposed and made significant progress in the last decade. One representative approach of these methods is Hidden Markov Model (HMM) based parametric synthesis, which has become a mainstream speech synthesis approach together with the unit selection and waveform concatenation approach. This method has a lot of advantages compared with the conventional unit selection speech synthesis, such as high smoothness, robustness and flexibility, fast and automatic system construction, small system footprint, and so on.This dissertation focuses on the application of statistical acoustic model to speech synthesis. Besides the original HMM-based parametric synthesis approach, two novel methods are proposed. The first is HMM-based unit selection and waveform concatenation synthesis. We apply the statistical ideas in HMM-based parametric synthesis to unit selection and waveform concatenation system to overcome the shortcoming of speech quality for parametric synthesis system and improve the naturalness of synthesized speech. The second method is parametric synthesis for integrated acoustic and articulatory features. Considering that articulatory features give better representation of speech generation mechanism, we integrate articulatory features into HMM-based parametric synthesis system to improve the accuracy and flexibility of acoustic parameter generation by simultaneous modeling and generation of acoustic and articulatory features.The whole dissertation is organized as follow:Chapter 1 is the introduction. It reviews the history of speech synthesis research and gives a brief introduction to the several most common speech synthesis techniques.Chapter 2 introduces the HMM-based parametric synthesis method in detail, including the fundamental principles of HMM, the system framework, and some key techniques in the system. Based on some analysis of the characteristics of this method, the motivation of our research work is declared.Chapter 3 focuses on the HMM-based unit selection synthesis method. At first, two different HMM-based unit selection systems are introduced. The first system adopts frame-sized unit and maximum likelihood criterion for unit selection; the second system uses hierarchical units and combines Kullback-Leibler divergence together with likelihood criterion to select the optimal unit sequence. Then, a unified framework of HMM-based unit selection speech synthesis method is proposed. Our evaluations on Chinese and English systems prove the effectiveness of the proposed method. At last, Minimum Unit Selection Error (MUSE) criterion for the model training of HMM-based unit selection system is proposed to achieve fully automatic system construction and improve the naturalness of synthesized speech.Chapter 4 presents a method that integrating articulatory features into the original HMM-based parametric synthesis system where only acoustic features are used. Here, we use "articulatory features" to refer to the quantitative positions and continuous movements of a group of articulators. These articulators include the tongue, jaw, lips, velum, and so on. After a brief introduction to the original system, the modeling and parameter generation methods for unified acoustic and articulatory features are proposed. Different model structures are explored to allow the articulatory features to influence acoustic modeling: model clustering, state synchrony and cross-stream feature dependency. The results of objective and subjective evaluation show that the accuracy and flexibility of acoustic parameter prediction can be improved effectively by proposed method.Chapter 5 concludes the whole dissertation.

  • 【分类号】TN912.33
  • 【被引频次】11
  • 【下载频次】773
节点文献中: 

本文链接的文献网络图示:

本文的引文网络