节点文献

SVM和最大熵相结合的中文机构名自动识别

Automatic Identification of Chinese Organization Names Based on SVM and Maximum Entropy

【作者】 杨德来

【导师】 黄德根;

【作者基本信息】 大连理工大学 , 计算机软件与理论, 2006, 硕士

【摘要】 未登录词的识别是汉语自动分词的难点之一,而中文机构名是未登录词的一个重要部分,涉及广泛,种类繁多,形态各异,且绝大多数未收入到词典中。中文机构名的自动识别对提高汉语自动分词和句法分析的精确率都有重要的意义。 本文提出一种支持向量机(Support Vector Machine,SVM)和最大熵相结合的中文机构名自动识别方法。中文机构名识别范围限定在以机构名特征词为结尾的完整机构名。根据机构名的特点,将机构名识别分为两个部分,后界判断和前部标注。对文本中出现在特征词典的词,基于SVM判断是否是机构名特征词(后界判断),从识别出的机构名特征词前词开始向前基于最大熵标注,直到标注到非机构名成分停止标注(前部标注),然后继续在文中重复上述过程。 为了提高后界判断效率,提出驱动式识别方法,对文本中出现的收录在特征词典的词进行后界判断,识别出该词是否是机构名特征词,对识别出的机构名特征词开始前部标注。由此可知,后界判断问题是二值分类问题,而SVM是一种优秀的二值分类器,因此基于SVM的后界判断模型可以有效地解决机构名特征词识别问题。根据机构名特征词的统计分析和语法特征,建立基于SVM的后界判断模型。 机构名前部词组成比较复杂,由于最大熵可以灵活地将许多分散、零碎的知识组合起来,对复杂问题的解决有较好的效果,同时最大熵以较好的效率解决多类分类问题,因此最大熵的前部标注模型有效地解决了比较复杂的中文机构名前部词识别问题。根据机构名前部词的特征和统计分析结果,制定最大熵特征模板,构建特征集并进行参数估计获得基于最大熵的前部标注模型。 实验表明,SVM和最大熵相结合的中文机构名自动识别方法是有效的:系统开式召回率和精确率分别达91.05%,93.59%,F值为92.84%。和当前同类文献相比,本识别系统取得了比较好的识别结果。并且本文所提出的方法具有较强的推广能力,利用本方法还可以对其它未登录词如人名、地名等进行识别。

【Abstract】 Chinese organization name recognition belongs to the domain of the recognition of Name Entity, which is a basic research work in Chinese lexical analysis. If there are some unknown Chinese organization names in the text, they will affect the correctness of segmentation and lexical analysis, this requires the segmentation system of having the ability to recognize the Chinese organization name, so it can improve the correctness of segmentation and lexical analysis.The automatic recognition method of Chinese organization name with the combination of SVM and Maximum Entropy is proposed. As for the words appeared in the characteristic dictionary, we use SVM to decide whether it is the characteristic word of the organization name (latter boundary decision) , we use the method based on SVM to tag from the word before the characteristic word, until encounter non-organization name composition (tagging foreside), then continue the process mentioned before in this paper.In order to improve the efficiency of the latter boundary decision, a drive recognition method is proposed, which decides the latter boundary of the words appear in the text, which are collected in the characteristic dictionary, then tag the former parts of the organization name.The latter boundary decision is a problem of two value categorization, and SVM can effectively solve the problem of the recognition of the characteristic word of the organization name.Due to the complex composition of the former word of the organization name, Maximum Entropy combine different kinds of text information, and solve the problem of the recognition of the more complex former words of the Chinese organization name. According to the feature of the former words and the analysis of the statistical results, we make the Maximum Entropy feature module, establish the feature set and access the parameters, eventually get the former parts tag module based on Maximum Entropy.The results show that SVM and Maximum Entropy combined Chinese organization name recognition is effective: in open test, the recall and precision rate and F-measure are 91.05%, 93.59%, and 92.84% respectively. Compared to present document of this kind, the recognition system gets better results , furthermore, it can also recognize other name entities, such as person name, place name and so on.

  • 【分类号】TP391.4
  • 【被引频次】9
  • 【下载频次】378
节点文献中: 

本文链接的文献网络图示:

本文的引文网络