节点文献

实体关系自动抽取技术的比较研究

Comparative Study of Automatic Entity Relation Extraction

【作者】 宁海燕

【导师】 王晓龙;

【作者基本信息】 哈尔滨工业大学 , 计算机科学与技术, 2010, 硕士

【摘要】 随着计算机技术和网络技术的不断发展,海量信息以电子文档的形式出现在人们面前。从这些自然文本中提取出有用的信息,日益成为人们关注的问题。因此信息抽取技术应运而生,关系抽取是其中的一个子任务。文本中特定的事实信息称为实体,而确定这些实体之间的关系称为实体关系抽取。实体关系抽取对本体库的构建以及改进信息检索技术等有重要的作用。本文重点对实体关系抽取技术的几个问题进行了研究和解决:首先,本文抽取了传统命名实体以外的存在重要语义关系的词:领域术语。针对领域术语评测数据的不统一和评价的困难性,通过词典评测、人工评测在准确率、召回率、F度量等评价指标上与几种主流的基于统计的术语抽取方法进行了详细的对比和分析。本文还提出了基于线性支持向量机权重的术语抽取方法,实验结果表明,该方法能有效地抽取领域术语。其次,本文基于不同的应用需求,利用统一的语料对比研究了基于特征的有监督、半监督和无监督的实体关系抽取方法。在有监督实体关系抽取方法中,前人的研究工作没有考虑各种特征对两个实体间无关系即no-relation的影响。对此,本文详细对比了通用特征:实体周围词语、实体类型、子类型、实体位置、实体中心词和内容的依存句法分析对真正关系和no-relation的影响,并提出了新特征:特征词位置信息,实验表明该特征能有效提高实体关系抽取的准确率。本文通过Bootstrapping半监督实体关系抽取方法进行了不同的对比实验:实体特征、种子集规模对实体关系抽取性能的影响;同等条件下,半监督实体关系抽取方法与有监督实体关系抽取方法的性能比较。实验结果表明半监督实体关系抽取能够提高实体关系抽取的准确率。无监督实体关系抽取方法主要采用的是聚类方法,因此本文主要研究了聚类算法以及合并策略对实体关系抽取的影响。本文对比研究了三种聚类算法,即K-means、自组织映射和Affinity Propagation算法,以及两种合并策略(DCM和Cosine)。Affinity Propagation算法能够取得较优的结果,自组织映射算法在运行时间上更有优势。

【Abstract】 With the development of computer and network technology, large amount of information in form of electronic documents has appeared. More and more attentions are paid to extract useful information from these texts. Therefore, information extraction technology has become prevalent and relation extraction is one of the important subtasks.Specific fact information in text is represented as entity, and the judgment of the relationship between these entities is defined as entity relation extraction. Entity relation extraction plays an important role in constructing ontology and refining information retrieval technology. This thesis focuses on some issues about entity relation extraction technology:First of all, domain-specific terms with important semantic relations except traditional named entity extraction are extracted. Because of the variability in the evaluation data of domain-specific term and difficulty in judging domain-specific terms by human, a variety of popular Chinese automatic domain-specific term extraction statistical methods are compared and analyzed in this paper. Both the objective method based on professional computer dictionary and the subjective method based on human judgment are adopted. A comprehensive comparison is performed with many evaluation measurements including precision, recall and F-measure. Moreover, this paper proposes a domain-specific term extraction method based on the weight of linear support vector machine. The experimental results show that this method extracts domain-specific terms effectively.Secondly, a unified corpus is employed to make comparison among the supervised, semi-supervised and unsupervised feature-based entity relation extraction in order to meet the requirements of different application.Previous studies based on supervised entity relation extraction methods did not consider the effect of features on no-relation between two entities. Thus, this paper compares effects of general features: words around an entity, type and subtype of an entity, location of two entities, dependency parsing of the center words and content of an entity on real relationships and no-relation. Besides, a novel feature that location information of a characteristic word is proposed and relation extraction.We do various comparison experiments with different entity features and size of seed set by semi-supervised entity relation extraction method of Bootstrapping. Also, we compare the performance of semi-supervised and supervised entity relation extraction method in the same conditions. Experimental results imply that the semi-supervised entity relation extraction can improve the precision of entity relation extraction.Most researchers use data clustering methods in unsupervised entity relation extraction. The effect of clustering algorithms and combined strategies on entity relation extraction is the focus of this thesis. Three clustering algorithms, namely K-means, Self-Organizing Map (SOM) and Affinity Propagation algorithm and two combined strategies (DCM and Cosine) are compared and analyzed in the thesis. Affinity Propagation algorithm can achieve the best precision in our experiment, and the SOM algorithm is superior in the real running time.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络