节点文献

基于韵律的蒙古语语音合成研究

Research on the Mongolian Speech Synthesis Based on Prosody

【作者】 敖敏

【导师】 白音门德; 熊子瑜;

【作者基本信息】 内蒙古大学 , 中国少数民族语言文学, 2012, 博士

【摘要】 本研究基于大规模语音数据库,采取语音实验的方法考察了蒙古语语音合成中遇到的韵律问题。文章主要由三个部分组成:一是开展了面向蒙古语合成的大规模语音数据库和电子词典等基础资源的建设工作;二是细致描写了蒙古语连续话语中的音节结构变化现象,包括音段的增加和脱落以及由此引起的音节重组问题,并在此基础上探讨了蒙古语口语和书面语的音节对应关系以及音段增减变化和话语韵律结构之间的关系问题;三是深入考察了蒙古语在朗读条件下的韵律组织结构问题,并从音高和音长等基本声学参数入手,对韵律词和韵律短语在朗读话语中的实际表现进行了细致考察,揭示出音高曲拱这一语音声学特征在韵律短语分析过程中的重要作用。本研究得到的结论主要有:一、研制了一套面向蒙古语语音合成的字音转写符号系统,该系统包括词首、词中、词末位置出现的50个元音(包括长元音、短元音、二合元音)符号和27个辅音(基本辅音和借词辅音)符号,并从语音区别特征方面对每个音素进行了描写和区分。语音合成结果表明本文对蒙古语音段系统的描写和区分是有效和必要的,系统而细致的音素分类能在一定程度上改善合成语音的可懂度。二、在连续话语中,词的多个读音在语义、语法、语用三个层面上存在差异。在具体的语境中,每个多音字的读音具有唯一性,根据这一特点,在字音转写过程中可以有效地区分多音字。没有区别词义也不存在语法和语用特征的多音字属于读音规范化范畴的问题,有进一步整理合并的必要。三、在单词层面上,蒙古语单音节词在口语和书面语中的音节结构基本一致,而针对双音节词的书面语读音与口语读音之间的不对应性,本研究细致归纳出了音节结构变化的12条规律。多音节词在口语中的音节结构变化从词末音节开始往前变,并遵循双音节词的音节结构变化规律。在蒙古语口语中,音节重组与音节中的元音类型以及该音节在词中的位置有一定的联系:短元音音节的结构容易发生变化,而长元音音节和二合元音音节的结构比较稳定;词首音节(不包括单音节词)的结构比较稳定,词中音节和词末音节相对容易发生音节重组。根据这一特点,本研究把蒙古语口语中的音节分为稳定音节与易变音节,并认为在蒙古语语音合成的字音转写中易变音节是转写的重点和难点。四、在连续话语层面上,引起音节结构变化的主要因素是名词附加成分和词缀化虚词,这些附加成分或虚词在书写形式上与其他成分是分开的、但在口语读音中却往往不能单独构成一个独立音节。它们在连读后发生的音节重组规律与单词内部的音节变化规律基本一致。当名词附加成分的音节类型是V、C、 VLC时,需要借助其前置音节的辅音来构成独立音节。当附加成分的音节类型是CVL时,其表现比较稳定,在连读时能单独构成词末音节。蒙古语口语中的音段脱落和增加、音节重组等现象与话语的韵律结构有一定的联系:韵律词是此类音变现象的作用域,名词与名词附加成分之间发生的音节重组以及音段增加和脱落等现象通常发生在韵律词内部。因此,可以把名词附加成分看做预测韵律词边界的有用线索。研究结果还表明,词缀化虚词的韵律作用域存在一定区别:虚词“(?)”的作用域是韵律词,虚词“(?)”的韵律作用域是韵律短语,虚词“(?)”和“(?)”的作用域是语调短语。五、本研究发现,在正常朗读的陈述句中,每个韵律短语一般都包含一个相对独立完整的音高曲拱,有且仅只有一个音高峰值,在此之前音高呈上升走势,在此之后音高呈下降走势,并且一般会一直延续到韵律短语的结束位置。这种“低-高-低”的音高变化模式构成了一个个相对独立完整的音高曲拱,起始于韵律短语之首,结束于韵律短语之末。根据这一发现,本研究认为,当一个语句内部既无标点符号又无显著停顿时,可在一定程度上参考音高的变化走势来帮助确定其内部的韵律短语边界位置:韵律短语边界往往处于两个音高曲拱的交界处。数据统计结果还表明,韵律短语边界前音节会有一定的延长。另外,词末弱短元音也是预测韵律短语边界的重要语音事件。六、韵律词边界处没有可明显感知到的停顿,也没有明显的延长。韵律词内部的每个音节的时长分布与该音节在韵律词内的位置有一定关系:尾音节时长>首音节时长>中间音节时长。韵律词在韵律短语中的位置会影响韵律词的长度,通常韵律短语边界处的韵律词时长比韵律短语中间位置上的韵律词时长略长。韵律词在韵律短语中的位置会影响韵律词的音高特性。根据统计,韵律词主要有以下4种组构方式:(1)1至5个音节的单个语法词;(2)并列关系的两个单音节语法词;(3)1至4音节的语法词和一个单音节虚词的组合;(4)处于韵律短语边界位置的单音节语法词或功能词。七、语音合成结果表明,在增加韵律短语和韵律词的切分信息之后,能够在一定程度上改善合成语音的自然度。但由于目前用于训练的韵律切分语料还相对较少,所以导致合成语音的自然度提升效果不够显著。但作者相信,随着对蒙古语韵律特性的研究逐渐深入,以及在训练过程中不断增加包含韵律切分信息的语料,将有可能合成出高质量、高自然度的蒙古语语音。

【Abstract】 Based on large-scale speech corpus and phonetic experiments, this paper examined Mongolian prosody issue in Mongolian speech synthesis. The paper consists of three parts. The first part is about resources construction of large-scale Mongolian speech synthesis-based speech corpus and E-dictionary. The second part inspects syllabic structure changes in Mongolian discourse such as segment adding, dropping and re-organizing and explores syllabic corresponding relation between spoken and writing Mongolian and the relation between segment adding&dropping and prosodic structure of discourse. The third part studies prosodic structure of reading Mongolian. Through checking basic acoustic parameters such as pitch and duration, the paper made comprehensive inspection on prosodic words and prosodic phrase and proposes that pitch contour plays important role in dividing prosodic phrase. Followings are main conclusions of the paper:A. We proposed a set of Phonetic Transcription Symbols used in Mongolian speech synthesis including50vowels (long, short and compound) in word-initial, median, final positions and27consonants (basic consonants and borrowed consonants), which are described and differentiated in phonetic contrast features. Mongolian speech synthesis results indicate that these descriptions and differentiations are effective and improve understanding level of synthesized speech.B. In discourse, the multiple pronunciations of words differentiate in syntax, grammar and pragmatic. In specific context, however, the pronunciation of polyphone is sole, which can be used to differentiate polyphones. Some polyphones, which have neither meaning contrast, nor grammar and pragmatic features, belong to pronunciation normalization issue and should be integrated.C. In terms of word level, syllabic structures of monosyllabic words of spoken and writing Mongolian are almost same. There are12syllabic structure-changing rules for disyllabic words between spoken and writing Mongolian. In polysyllabic spoken Mongolian words, syllabic structure changes begin from the final syllable to the initial, keeping same syllabic structure changing rules with disyllabic words. In spoken Mongolian, syllabic structure is variable for syllable with short vowel. Syllables with Long vowel and diphthong are stable. Word-initial syllables are stable. Based on these findings, syllables of spoken Mongolian can be divided into stable and variable. In Mongolian synthesis, transcription of words in variable syllables is vital.D. In continuous speech, the primary factors causing syllabic structure change are noun supplements and affixed function words, which cannot constitute independent syllable in spoken Mongolian. For sentence and words, syllabic re-organizing rules are same. When syllabic type of affixed elements is V, C, VLC, consonant of previous syllable constitutes independent syllable. CVL is very stable and can be word-final syllable. Segment dropping and adding of spoken Mongolian, syllabic re-organizing and prosodic structure of discourse are related. Syllabic re-organizing, adding and dropping of segments all happen in prosodic words. Affixed noun elements are useful phonetic clues to predict prosodic word boundary. Action scopes of prosody of four function words are different:for function word "uAE(?)u", it is prosodic phrase; for function words "(?)" and "(?)", it is intonation phrase; for function word "(?)", it is prosodic words in sentence.E. In declarative discourse at normal reading speed, every prosodic phrase has a complete pith contour and a pitch peak. Pitch contour goes up before the pitch peak and goes down after that, forming L-H-L pitch pattern, which begins at initial of prosodic phrase and ends at final of prosodic phrase. Therefore, this paper concludes that when a sentence has neither punctuation mark nor evident pause, prosodic phrase boundary is at the intersection of two pitch contours. Statistics show that previous syllable of prosodic phrase prolongs at some extent. In addition, word-final schwa is also stress cue to predict prosodic phrase boundary.F. Prosodic word boundary has neither evident pause nor lengthening. Within prosodic words, syllabic duration and syllabic position are related:duration of final syllable> duration of initial syllable> duration of medial syllable. Duration of prosodic words at prosodic phrase boundary is a little bit longer than those at the medial of prosodic phrase. Syllabic position at prosodic phrase affects pitch pattern of prosodic words. Based on statistics data, there are four types of prosodic words:1) Grammar words of one-five syllables.2) Two parallel monosyllables grammar words.3) Grammar words of one-four syllables and monosyllable function words.4) Monosyllabic grammar words or function words at prosodic phrase boundary.G. Speech synthesis results show that segmenting cues of prosodic phrase and words can improve naturalness of synthesized speech at some extent. However, due to small size of speech corpus of prosody transcription, improvement of naturalness of synthesized speech is limited. We believe that, with advancing of Mongolian prosody research and more speech corpus with prosodic transcription, it is possible that high quality and high naturalness of synthesized Mongolian speech can be achieved.

【关键词】 蒙古语语音合成韵律结构音节重组
【Key words】 MONGOLIANSPEECH SYNTHESISPROSODYSYLLABLERE-STRUCTURE
  • 【网络出版投稿人】 内蒙古大学
  • 【网络出版年期】2012年 11期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络