

SVM and TSVM Based Chinese Entity Relation Extraction

【作者】 徐芬

【导师】 王挺;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2007, 硕士

【摘要】 信息抽取技术自动将无结构文本转化为有结构文本,既可以自成系统满足人们的强烈需求,同时还是其它应用如信息检索、文本分类、自动问题回答等的重要基础技术。实体关系抽取是信息抽取技术中的重要环节,正成为越来越热门的研究课题。中文实体关系抽取工作尚处于起步阶段,还有大量的工作需要完成。本文针对中文实体关系的特点,设计了一系列的特征,包括词、词性标注、实体属性和提及信息、实体间交迭关系和知网提供的概念信息等,以构成实体间关系的上下文特征向量并使用SVM分类器进行中文实体关系抽取。以ACE2004的训练语料作为实验数据,得到了较好的识别性能。同时根据分级实验的结果,详细考察了各种特征集和不同训练样例数目对中文实体关系性能的影响。实验结果表明:不同细化程度的任务应该选取不同抽象程度特征集组合。其中词性特征集较适合关系发现任务,知网概念特征集较适合关系大类和子类识别任务,词特征集是最基本特征集,实体间交迭特征集对抽取性能贡献最大。训练语料库规模的增加可以提高识别性能,开发较大规模的训练语料库对使用SVM分类器是很有必要的;但当语料库达到一定规模后,语料库规模的增加对性能的影响变弱,这时则应该把主要的注意力放在特征集构造上。在上述研究的基础上,针对SVM对大规模训练语料库的依赖,将半监督学习方法TSVM引入到中文实体关系抽取工作中。实验结果显示,在训练向量数目非常小时TSVM的性能远远超过SVM,但在训练向量数目较大后,TSVM的性能反而不如SVM。在关系发现这样相对简单的问题上,TSVM分类器仅使用少量标注语料和大量未标注语料,就可以得到不错的性能,降低了抽取系统的成本、改善了其可移植性;但在更复杂的关系类别识别问题上,TSVM分类器的性能仍不甚理想,应该考虑更多其他的半监督学习方法。同时本文研究并实现了TSVM多分类器构造。进一步的工作包括两个方面,一是改善现有的特征集如将更多的特征如组块识别、知网概念结构等加入到特征集以提高关系抽取性能和进行更精确的参数选择,二是定量研究标注数据的选择对性能的影响以及SVM和TSVM要求的标注数据规模规律。

【Abstract】 Information Extraction Technology automatically transforms unstructured texts into structured ones, which not only forms a system to satisfy the strong request, but also affords a basis for other applications such as Information Retrieval, Text Category, Question Answering. Entity Relation Extraction is so important in Information Extraction that it receives more and more interest from researchers. The task of Chinese entity relation extraction still needs much further study, calling for a mass of work.This paper presents the work of Chinese entity relation extraction. We have designed the context vector by using several new features including word, part of speech tag, entity and mention, overlap and HowNet concepts. Based on the context information, we apply an SVM classifier to detect and classify the relations between entities. We take the training data of ACE 2004 as our experimental data and have obtained encouraging results. The experimental results are analyzed in detail, which helps us investigate the impact of various features and training example quantities on the extraction performance. The experimental results indicate: it would be advisable to choose different features for different extraction task. The word features are suitable for relation detection task, while Hownet concept features are appropriate for relation type and subtype characterization tasks. Word features is a basic one and overlap features contribute most. The performance will rise with the increasement of training examples, so it will be necessary to develop large corpus if you want to use SVM classifier. But after the amount of corpus achieves certain level, the gain from adding more training examples is so trivial that we must find other way to enhance extraction performance, developing more features for instance.Aiming at the dependence of SVM method on large scale corpus, we propose the introduction of semi-supervised learning method TSVM to relation extraction. to see whether it can improve the extraction performance by using both labeled and unlabeled datum. Results from experiments show that: TSVM performs much better than SVM in the same context when labeled examples are very few, while SVM performs little better than TSVM when there are many labeled examples. TSVM can perform well on relation detection task, which makes it practicable on this kind of task. But on the task of relations type recognition, TSVM perfoms not very good, forcing us to look for other semi-superisved learning methods. An multi-TSVM classifier is also constructed.Future works include developing more features such as chunking information, Hownet concept structure to improve the extraction performance, choosing parameters for the classifier and invesigating the rule of example quantities needed by SVM and TSVM.


