

【作者】 林磊

【导师】 万建成;

【作者基本信息】 山东大学 , 计算机软件与理论, 2005, 硕士

【摘要】 语音合成(TTS,Text To Speech)技术是将计算机自己产生的或外部输入的文字信息,比如文本文件内容、WORD文件内容等文字信息,按语音处理规则转换成语音信号输出,即使计算机流利地读出文字信息,使人们通过“听”就可以明白信息的内容。随着计算机技术和通讯技术的巨大发展,TTS技术已经应用到语音对话系统、语音呼叫中心、语音触发的网站和电子邮件服务等很多领域并且已经发挥出其巨大的威力,但是,当前现有的TTS系统在自然度和可懂度方面都离人们的要求相差甚远,真正能够代替人来阅读的TTS系统还没有出现,从而也制约着TTS系统在更大的范围内的使用。 在语音合成方面,首先遇到的困难是从文本信息到韵律的标识上,自然语言中,语音特征变化万千,其数据本身隐含了知识。而对这些知识,人类可以感知,但对其的认识、描述是远远不够的。在从文字到韵律符号描述的自动转换方面,对自然语音理解能力的不足一直是研究工作的瓶颈所在。目前文字到韵律描述的转换通常只能根据一些基本的语法信息(如词性)来划分语调短语或设置语句的普通重音,还没有根据句子的语义来做深层次处理(如设置不同的表达或感情色彩)的能力。其次,从声学的层面上,人们对韵律特征对应的声学参数还没有完全认识,缺乏完备的描述,只能凭经验。这也进一步阻碍了将文本标注的韵律信息表现出来,生成自然的带有韵律感和重音感的合成语音。 本文借助我们实验室以往对自然语言理解处理的成果——二元语义关系分析。建立了一套符合XML扩展标记语言标准的文本语音合成描述符号体系,同时建立了从语义描述标注到语音合成韵律标注的转换规则,将对语义的描述自动转换到语音韵律信息的描述。而且,还考虑到了文本中的多音字、数字、符号、字母的发音问题,建立了一系列针对这些情况的发音描述方式。 在韵律语音的合成上,本文搜集了1248个汉语中的单字和8000多个使用频率较高的双字词、三字词、四字词以及常用人名、地名等语料信息,对其进行整理编号后,在转门为本系统开发的语音库维护程序上对这些语料进行了人工录音,对这些语音资料切分和基音周期分析后,存入语音数据库和检索索引数据库,构建了本系统所需要的基础语音数据。 语音合成模块包含语速修改单元、语气修改单元、重音修改单元以及静音生成单元等,并且把它们做成模块的形式,提供接口供语音合成模块调用以改变语

【Abstract】 TTS (Text To Speech) technology is a kind of technology that can translate the text information (the computer itself generated or input by other people), for example, a text file or a word document into the speech information. In a word, we want to let the computer read the text information fluently so that the people can understand the information only by listening. With the great development of computer technology and communication technology, TTS technology have applied to Speech dialog system, Call center system, Voice web pages and Voice email system, etc., and have a significant effect on application. However, all the TTS system now people used are suffered from the natural and understanding, and no TTS system can really read the text for people, so all these disadvantages make the TTS only can be used in limited fields.The first difficulty is the tagging of Prosodic information. In natural language, speech characters are protean and these characters connote a lot of knowledge. The people can feel the knowledge but cannot describe them. In the fields of automatically translating the words into prosodic markup, the limited understanding of natural language is the bottleneck of research work. Now, the translating of words into prosodic describe can only depend on these basic information such as syntax information (part-of-speech) to partition tone phrase or set the stress of a sentence, yet can not process deeply according to the semantic. And secondly, in the parts of acoustics, people are not fully able to know the parameters. Meanwhile, they are shot of elegant describe and people understand them only by the experiences. Therefore, all these limitations embarrass the development of information represented.In this paper, we depend on the development of natural language at our lab- binary relations syntax analysis and set up a set of marks according to the XML to markup the text which will be translated into voice, and at the same time we set up a set of regulars in order to transfer the semantic description into prosodic description. Meanwhile, we also considered the multi sounds words, numbers, symbols and characters, and set up serialsof description manners for this condition.In prosodic speech synthesizing, we collected 1248 Chinese single characters and more than 8000 often used Chinese phrases, including double character phrase, three character phrase, four character phrase and famous names of people and places. After analyzing and tagging, we record all of them on our speech database maintenance program by people, and after cutting and marking pitch, we save them into speech database and index database, thus, we get all the base speech data of our TTS system.Speech synthesizing module contains speech speed edit unit, speech mode edit unit, stress edit unit and silent generator unit, etc. All the units are in module form, and they can offer interface.In this speech synthesizing system, firstly, we set up prosodic marks based on the deep understanding of natural language and transform the semanteme markup to prosodic markup based on binary relations syntax analysis, therefore, this kind of markup is more advanced and can approach real prosodic purposes of human people. In synthesizing procedure, based on PSOLA algorithm and extensive speech database, we implement an easy voice prosodic control which makes the synthesized speech clearly and naturally and makes a great progress in understanding and naturalness.Next work in this paper included: deep research in semanteme markup and prosodic research, and to transfer more semanteme information into prosodic information; to set up a more extensive speech database, so that the language materials can contain not only sentences but also paragraphs of text; to create more prosodic control units in order to control not only prosodic in sentences but also between sentences and paragraphs.

【关键词】 语音合成TTS韵律标注PSOLA语音库
【Key words】 Speech synthesizeTTSProsodic taggingPSOLASpeech database
  • 【网络出版投稿人】 山东大学
  • 【网络出版年期】2005年 08期
  • 【分类号】TP391.42
  • 【被引频次】1
  • 【下载频次】155

