节点文献

基于词典和概率统计的中文分词算法研究

The Research of Chinese Word Segmentation Algorithm Based on Dictionary and Probability Statistics

【作者】 何爱元

【导师】 李晓光;

【作者基本信息】 辽宁大学 , 计算机软件与理论, 2011, 硕士

【摘要】 对于汉语的自然语言处理来说,汉语自动分词是文本分析的第一个步骤。目前的中文分词方法,可以分为基于词典的分词方法、基于统计的分词方法和基于理解的分词方法三种。基于理解的分词方法研究尚不成熟。如今,比较流行的方法是将词典的方法和统计的方法结合起来。中文分词面临的难点问题是未登录词的识别和歧义切分。近年来,开发的大量的中文分词系统对中文分词中的未登录词识别,通常的做法是在分词系统中加入单独的未登录词识别模块,建立相关的规则来识别。这些分词系统对一些专有名词,如人名、地名、机构名等能够较好的识别,但是对于那些没有特殊规则的网络新词几乎不能识别,这在很大程度上影响了分词的精度。对于歧义切分,尽管近几年对歧义切分的准确率有所提高,但是歧义切分问题仍是迫切需要解决的问题。这两年,字标注的分词方法,取得了很好的成绩。但是,它的分词成绩受限于训练语料类型与规模的分词模式,虽然是目前的研究主流,但这与实用分词的需求背道而驰。因此本文采用了基于词典和概率统计的分词方法提高分词系统的实用性,并解决当前分词系统中急需解决的未登录词识别及歧义切分的问题。本文主要做了两方面的改进:第一,本文采用了与以往新词识别不同的角度对网络新词的识别做了相关研究,我们采用的方法是定期在互联网中采集不同领域的大量网页,用本文中的识别策略进行新词的识别。本文在识别新词中,对特殊标点符号中的词、文章关键词、超链接词汇等做了相关分析与研究。并将识别的新词添加到分词词典中,来扩充词典的词汇量。这对解决分词中的未登录词问题非常有效。最终来提高分词系统的分词准确率和召回率。第二,本文在原有的n元语言模型的基础上,提出了逆向n元语言模型,并分析了n取3时能够使模型的性能最优。从而提出了一种基于双向三元语言模型的中文分词方法,然后在该语言模型中加入了词信息。本文中的基于双向三元模型含词位置信息的分词算法,能更好的处理汉语切分中的歧义问题。最后,通过实验比较,本文的分词系统在速度和精度上都能达到不错的效果。

【Abstract】 For Chinese natural language processing, Chinese word segmentation is the first step in text analysis. The current method of Chinese word segmentation can be divided into three kinds: the method based on dictionary, the method based on probability statistics and the method based on understanding. The understanding method is not mature. Today, the combination method based on statistical and dictionary is more popular. The difficult problem of Chinese word segmentation is the unknown word recognition and ambiguity processing.In recent years, to the unknown word recognition, many segmentation systems add a single identification module and establish relevant rules to recognize the unknown word. The research on named entities such as person name, place name and organization name, etc. has got good achievement. However, the research on web new words which have not special rules can not recognize. These words affect the accuracy of segmentation system. In recent years, although for ambiguity segmentation to improve the accuracy of segmentation ambiguity, but ambiguity segmentation problem is still urgent need to address the problem. For ambiguous word segmentation, although the accuracy has increased, this problem should be solved urgently.Therefore, this thesis uses statistics and dictionary method to solve the problem of unknown word recognition and ambiguity segmentation.This paper includes two betterments:In the first place, this paper uses a different direction to recognize the new word. We collect a large number of pages from different areas of the Internet and use our policy to recognize. Finally, we add these new words to the dictionary and to expand dictionary vocabulary. This is very effective to solve the unknown words of Chinese segmentation. Ultimately, improve the segmentation system’s precision rate and recall rate.In the second place, we present reverse n-gram language model through the original n-gram language models. So, this paper proposes a language model based on two-way 3-gram language model. Finally, this paper adds the word information to the model. Adding the word information can improve system performance. This model can better handle the ambiguity of Chinese segmentation. Through the experimental comparison, our system can achieve good effect in speed and accuracy.

  • 【网络出版投稿人】 辽宁大学
  • 【网络出版年期】2012年 01期
节点文献中: