节点文献

基于规则与统计的汉语自动分词研究

【作者】 李丹

【导师】 赵伟;

【作者基本信息】 长春工业大学 , 计算机软件与理论, 2010, 硕士

【摘要】 随着网络的发展,数字化信息迅速增加,人们对中文信息的处理也越来越关注,同时,现代汉语信息的处理和研究也显得尤为重要。汉语自动分词和命名实体识别是中文信息处理的基础研究课题,它的研究和实现具有重要的理论意义和实用价值。由于它的研究结果直接影响到机器翻译、语法分析、语义分析、语音识别、信息检索、信息过滤等领域的研究,因此,对分词和命名实体识别的要求也显得日益迫切并一直引起人们的关注。同其它语言相比,汉语自动分词和命名实体识别有其特有的难点。我们认为影响分词正确率的因素有两个:1歧义切分问题2汉语人名、地名、机构名等专有名词。目前,汉语自动分词和命名实体识别的处理结果还有待提高。本文对汉语自动分词和作为命名实体识别子问题的中文人名识别这两个问题分别进行了研究,提出了结合词频的机械匹配算法和SVM与错误驱动学习相结合的中文人名识别算法。汉语自动分词是中文信息处理中的重要步骤,它是诸多中文信息应用领域的基础。目前汉语自动分词方法主要包括基于规则的方法、基于统计的方法和基于理解的方法。本文对现有自动分词算法进行了深入分析,在此基础上着重研究了基于规则和统计的汉语自动分词算法,提出了结合词频的机械匹配算法。该方法首先在基于长度优先的基础上同时结合词频优先进行分词,对未匹配字串再应用改进的正向最大匹配法和逆向最大匹配法结合熵率进行分词。实验结果表明,这种分词算法进一步提高了分词的准确率。中文姓名识别是中文分词中未登录词识别的一个重要部分,处理好中文姓名问题势必会有效地提高未登录词识别的精度。本文提出了支持向量机和基于转换的错误驱动学习相结合的中文人名识别方法。利用基于转换的错误驱动学习方法对SVM的识别结果进行校正,转换规则较好地处理了语言现象中的特殊情况,进一步提高了SVM的识别结果。实验结果表明,与单独使用SVM模型的人名识别方法相比,加入错误驱动学习方法后,中文人名识别的准确率、召回率和F值均得到了提高。

【Abstract】 With the development of the internet, Digital information increase rapidly, people have become pay more attention to Chinese Information Processing system day by day. At the same time, modern Chinese has become more and more significant. Automatic Chinese segmentation and name entity recognition are basic research projects in natural language processing and computational linguistics. Its research and application have great theoretical and practical significance. The research on automatic Chinese segmentation and name entity recognition are of great benefit to many applied areas, such as machine translation, semantic analysis, parsing, speech recognition, information retrieval, information filtering and so on. So the demand on automatic natural language processing becomes indispensable.Comparing with other languages, automatic Chinese segmentation and name entity recognition have its own difficulties. We consider that there are two factors to affect the speed of the words auto-segmentation:1 the difference meaning syllables of words; 2 the proper noun of Chinese name、the name of place、the name of department and so on. At present, the results of automatic Chinese segmentation and name entity recognition are still not quite satisfying. In this paper, Chinese word segmentation and Chinese names recognition have been studied separately. And presents a Chinese word segmentation algorithm combing with word frequency and a method of Chinese name recognition based on Support Vector Machines and transformation-based error-driven learning.Chinese automatic segmentation is an important step in Chinese information processing. It is the foundation in many application fields of Chinese information. At present, three main methods have been used for automatic Chinese segmentation, which include rule method, statistical method and understanding method. Through analyzing the existed automatic segmentation methods, this paper emphasizes on the research of rule method and statistical method. And presents a Chinese word segmentation algorithm combing with word frequency. The method firstly based on priority of length combining with word frequency to segment short sentence. If any non-matching word strings of the short sentence exist, we apply the improved maximum matching method and reverse maximum matching method combined with entropy rate to segment. Experimental results show that the algorithm improves the accuracy of word segmentation.Recognition of Chinese personal name is emphasis and difficulty for unknown words recognition. If the problem is effectively solved, then it will improve the precision of unknown words recognition. The paper presents a method of Chinese name recognition based on Support Vector Machines (SVM) and transformation-based error-driven learning. Using the transformation-based learning approach to correct the identification results of SVM. Transformation rules effectively deal with the special cases of language phenomenon and improve the performance of SVM. Experiments show that the method is efficient in identifying person names from Chinese texts. In open test, the precision, recall, and F-measure are improved.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络