节点文献
一个改进的中文分词算法及其在Lucene中的应用
Study on an Improved Chinese Segmentation Algorithm and Its Application in Lucene
【作者】 付敏;
【导师】 陈传波;
【作者基本信息】 华中科技大学 , 软件工程, 2010, 硕士
【摘要】 中文分词是中文信息处理的核心问题之一。采用基于字符串匹配与统计相结合的算法能够较好的实现中文分词。该算法首先将中文文本以标点符号为切分断点,把待切分的文本切分成含有完整意义的短句,以提高字符串匹配算法的正确率。然后将每个短句分别按照正向最大匹配和逆向最小匹配进行扫描、切分,同时在每次扫描时,根据语义和语言规则对结果进行优化,将汉字、英文字母、数字分别进行划分,增强算法对不同类型文本的处理能力。最后,根据最小切分原则和统计的方法进行歧义消解处理。通常中文分词的算法分为三种,基于字符串匹配、基于统计方法和基于理解的。三者各有优缺点,改进的分词算法集成了基于字符串匹配算法在实现方式简单,效率高的优点,并辅以基于语言的基本规则提高了初切分阶段的正确率。在具体实现上,两次扫描分别采用了正向最大匹配与逆向最小匹配的算法。算法的选用分别利用了正向最大匹配切分片段数较少的优点和逆向最小匹配对多义型歧义解决较好的优点。利用语言规则优化则是在扫描的同时将汉字、字母和数字分开划分,并且对于汉字中的数词、量词,英文字母中的罗马数字再分别处理,较好的解决了多种类型文本的分词问题。改进的分词算法的歧义消解处理过程是根据两次扫描的结果进行比较,如果结果完全相同则直接输出。如果两次扫描结果不同,判断为有歧义字段产生,需要做相应消歧处理:如果切分的片段数不同,根据最小切分的原则选择片段数较小的作为结果输出;如果切分片段数相同,则采用统计的方法,利用词典中的词频来判断采用哪个结果作为正确输出。该算法的另一个改进是在词典的存储结构上,采用两字哈希、尾字链表处理的方式,对尾字链表按照词频排序,在一定程度上也提高了分词的效率。整个算法可应用于Lucene做为中文信息检索系统的组件,从实验结果来看,准确率比Lucene自带分词器有了较大的提高。
【Abstract】 Chinese Segmentation is one of the most important elements of the Chinese Information Processing. The algorithm which combined by the character matching method with the statistic method can better realize the Chinese Segmentation. This algorithm firstly segments the Chinese text by identifying the punctuation which makes the text to be short sentences with completely meaning that can promote the accuracy of character matching. Then every short sentence would be scanned and segmented by the method of Maximum Match Method and Reverse Minimum Matching Method, meanwhile, the results would be optimized based on the rules of language by optimization program which could identify the characters, letter and numbers that can strengthen the process ability of algorithm on dealing with different type of text. Finally, the ambiguousness would be eliminated by Minimal Segmentation Principle and statistic method.Chinese Segmentation algorithm has been generally defined in three ways, based on character matching, based on statistic method and based on understanding. Every of them have merits respectively. Improved segmentation algorithm which combined the merits of easy to accomplish with high efficiency that accompanied by rules of language has promoted the accuracy of basic segmentation. In practice, two times scan adopted the Maximum Match Method and Reverse Minimum Matching Method which used the strong points of less fragments of maximum matching and special ability of dealing with polysemous ambiguousness of reverse minimum matching. Characters, letter, numbers can be deal with based on rules of language at the same time of scanning. Then the numeral and classifier in Chinese, Roman numerals in English would be processed by optimization program with better solved the problem of segmentation on multi-type text. The ambiguousness eliminate processing of improved segmentation algorithm is to compare the results of scanning and output the one of them directly when they are equal. Ambiguousness would be judged as happening and should be processed by program if results of times scanning are different: To select the less fragments result as the output based on Minimal Segmentation Principle if the number of fragments are not equal, or to select the higher frequency word to output as the method of statistic when the number of fragments are equal. Another improvement of this algorithm is on constructing the structure of dictionary by adopting the method of previous two characters stored by hashtable and the rest word stored by linked list by the order of frequency. This improvement promotes the efficiency of segmentation in some way. The whole algorithm can be applied to Lucene as the composition of Chinese information searching system. From the result of experiment, this algorithm has a great improvement on accuracy compared to the segmentation system provided by Lucene.
【Key words】 Chinese Segmentation; Two times scan; Elimination of ambiguousness; Hashtable; Lucene;
- 【网络出版投稿人】 华中科技大学 【网络出版年期】2012年 02期
- 【分类号】TP391.1
- 【被引频次】9
- 【下载频次】340