节点文献

非时齐语言建模技术研究及实践

The Study of Non-stationary Language Modeling Techniques and Its Practices

【作者】 肖镜辉

【导师】 王晓龙;

【作者基本信息】 哈尔滨工业大学 , 计算机应用技术, 2007, 博士

【摘要】 语言模型是自然语言的数学描述,是人们为了解释、利用自然语言规律而构建的抽象的形式化系统。语言模型的研究是自然语言处理领域的基础性研究,其研究成果可以被直接地应用到汉语音字转换任务中,并且能够被广泛地应用在语音识别、手写体识别、印刷品字符识别、机器翻译、信息检索、语料库多级加工等众多的自然语言应用领域当中。当前,随着网络信息的飞速增长,海量电子文本的获得已不再困难,概率统计的方法以其准确率高、鲁棒性强等优点成为语言建模领域的主要方法。统计语言模型成为当前的主流语言模型。然而,统计语言模型单纯从统计角度出发,将自然语言看作是语言元素的随机序列,而忽略了语言本身的规律和特点。如何在统计语言模型中利用语言学知识成为当前语言建模领域面临的难题之一。目前,将语言学知识直接与统计语言建模技术相结合面临如下困难:1.语言学知识难以精确地自动获得;2.语言学知识难以与现有的统计建模技术相融合。针对上述问题,本文提出通过研究语言单位在自然语言序列中的位置信息和规律来间接地反映自然语言的语法语义信息。语言单位因其语法语义属性不同,其可以充当的语言成分不同,在句子以及篇章中所起到的作用也不相同,它在自然语言文本中出现的位置和范围具有一定的规律性。这种规律是自然语言语法语义规律的体现。针对上述规律,本文在随机过程理论的基础上扩展了时齐性假设,提出非时齐语言建模假设,即假设当前语言单位的出现概率与它在自然语言序列中的位置相关。在此基础上,本文分别对非时齐语言建模的理论、技术、方法和相关问题进行研究,并将其应用到汉语音字转换任务中,从而提高汉语键盘输入系统的性能。本文的研究内容主要包含以下四个方面:第一,本文进行语言建模研究的资源准备工作,提出一种面向汉语语言建模的词表自动生成算法。本文首先将词表自动生成工作同汉语语言建模工作相结合,设计一种一体化迭代算法框架,通过建立优化词表的方式来提高现有语言模型的性能。在该框架下,本文采用统计特征与构词特征相结合的词表生成策略,以提高词表生成算法的性能。最后,本文提出两种启发式方法使系统自动适应训练语料的领域,从而使系统具有自适应性。第二,本文进行非时齐语言建模的理论与方法研究。首先,本文讨论了语言单位非时齐属性的量化表示方法,并在此基础上分析了语言单位非时齐属性的统计规律。接下来,本文将非时齐属性规律与现有的语言建模技术相结合,分别提出非时齐Ngram模型和非时齐最大熵马尔科夫模型,并讨论了模型构建、训练方法、参数平滑和模型复杂度等问题。最后,本文分别在音字转换和词性标注任务中对以上两种模型进行验证。第三,针对语言模型中的数据稀疏问题,本文提出基于语义的平滑算法。本文从Hownet和同义词词林等语言学资源中提取汉语语义信息,将其分别与回退平滑和插值平滑技术相结合,设计基于语义的回退和插值平滑算法,从而提高平滑后语言模型的性能。并且,本文设计基于迭代的参数优化方法,自动优化平滑算法中的各项参数。第四,本文将语言建模技术应用到汉语键盘输入任务中。首先,针对手机等移动设备上的拼音汉字输入法,本文提出键音转换问题,同时给出两种解决方案,并在实验中加以验证。接下来,本文提出利用用户输入的拼音信息来提高汉语音字转换系统的性能。一种基于类别的最大熵马尔科夫模型被用来高效地构建音字转换系统,使之能够同时利用用户输入的拼音信息和汉字之间的约束信息。实验表明,拼音信息能够有效提高汉语音字转换系统性能。

【Abstract】 Language model is a mathematic description of natural language, which is usu-ally presented as a formalized system to explain and exploit the principle of language.The study of language model is fundamental in the research area of natural languageprocessing. Its achievements can apply to Chinese Pinyin-to-Character Conversiontask directly, and promotes many tasks of natural language processing, includingspeech recognition, handwriting recognition, optical character recognition, machinetranslation, information retrieval, multi-level processing of corpus, and so on.In these days, the quantity of digit text increases rapidly on the internet. Thestochastic techniques become the main way to language modeling due to its high ac-curacy and strong robustness. The stochastic language model becomes the most preva-lent language model. However, it takes natural language as a stochastic chain fromthe statistical view purely, ignoring the characters of language. It is one of the chal-lenges to involve linguist knowledge in stochastic language model. However, there aretwo problems to combine the linguist knowledge with the current stochastic languagemodel directly: 1. it is difficult to acquire the precise linguist knowledge automati-cally; 2. it is hard to integrate the linguist knowledge into the current framework oflanguage model.In order to solve the above problems, this paper represents the positional infor-mation of language element formally and exploits their principles in language mod-eling. Concretively speaking, language element plays different roles in different por-tions of sentence due to its syntax and semantic property. Therefore, the probabilityof language element is relevant to its positional information. In order to exploit thepositional information, the stationary hypothesis of traditional language element is re-laxed and the non-stationary hypothesis is made: the occurrence of current languageelement is determined partially by its position in the sequence of language elements.Based on the above hypothesis, the paper focuses on the studies of the theory, thetechnique, the method and the related issues of non-stationary language modeling. Fi-nally, these techniques are applied to the Chinese Pinyin-to-Character conversion taskso as to improve the performance. The paper mainly consists of four parts: Firstly, the paper does the resource preparation and proposes a Chinese lexi-con construction algorithm for language modeling. It combines the Chinese lexiconconstruction with language modeling and presents a unified framework of iterationalgorithm. The performance of current language model is improved by optimizingthe lexicon. Under the framework, a multi-feature lexicon construction algorithm isproposed which exploits both the statistical feature and the lexical feature. Finally,two heuristic methods are proposed to make the system self-adaptive the domain oftraining corpus.Secondly, the paper studies the theory and the technique of non-stationary lan-guage modeling. First of all, the paper provides the formal representation of positionalinformation of language element, based on which the principles of non-stationaryproperty of language element are induced. Then these principles are involved inthe process of language modeling. Two non-stationary language models, the non-stationary Ngram model and the non-stationary Maximum Entropy Markov model,are proposed. Several related issues, including the model construction, the trainingalgorithm, the smoothing technique and the model complexity, are well discussed.Finally, these models are verified on the Pinyin-to-Character conversion task and thePos-tagging task respectively.Thirdly, the paper proposes the semantic-based smoothing technique so as tosolve the data sparseness problem of language model. It acquires the semantic infor-mation from Hownet and TongyiciCilin, and then combines them with the traditionalsmoothing techniques. The iterative algorithms are designed to optimize the parame-ters automatically.Fourthly, the paper applies the techniques of language modeling on Chinese key-board input method. First of all, it proposes the Key-to-Pinyin conversion task for thedigit keyboard of mobile devices. Two kinds of solutions are provided and verified inthe experiments. Then, it improves the performance of the current Pinyin-to-Characterconversion system by exploitation of the pinyin constraint inputted by users. A class-based Maximum Entropy Markov model is proposed to describe both the constraintsfrom pinyin and the ones between characters. The experimental results show that thepinyin constraints improve the performance of Pinyin-to-Character conversion taskeffectively.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络