节点文献

单类中心学习及其在二元关系抽取中的应用

The Learning of Single Class Center and the Application in Binary Relation Extraction

【作者】 李志圣

【导师】 何丕廉;

【作者基本信息】 天津大学 , 计算机应用技术, 2008, 博士

【摘要】 在互联网上进行二元关系抽取,是当前信息抽取的重要研究方向。为利用互联网的大量未标定语料,许多文献提出了基于self-training机制的学习方法:即在小标注集上训练初始系统,然后在系统运行过程中,自动标定可靠候选,重新训练,以改进系统性能。实践证明:上述方法在二元关系抽取中是行之有效的,但已有文献缺乏对学习过程的理论分析。本文首先将在二元关系抽取中的模式学习问题转化为单类文本中心的学习问题。在文本向量空间中,当初始中心被给定后,可将其足够小邻域内的文本向量作为自动标定数据。本文要解决的核心问题是:当数据集具有何种特性时,利用自动标定数据能确定地改进对单类中心的学习?为解决该问题,本文研究文本向量空间的分布特性。为克服高斯混合模型在描述具有硬聚类特性的数据分布时的缺点,本文提出了基于k-means算法划分区域的TGMK模型,并揭示了TGMK模型与k-means算法、高斯混合模型的密切联系。实验结果表明:TGMK模型适合描述多类文本数据。本文在k-means算法基础上提出了single-mean算法。文中证明:当多类数据集适合被1-TGMK的泛化模型—1-TGMR模型所描述时,新算法从目标类的初始中心出发,将收敛到实际中心。至此,完成了对核心问题的解答。实验表明了新算法在文本数据上的有效性,从而说明了self-training机制在二元关系抽取中的有效性。本文为二元关系抽取工作建立了基于single-mean算法的形式化学习模型,并针对在互联网上进行二元关系抽取的特殊性,提出了新的候选评分方法和自动标定方法。本文将学习模型应用到中文问答对和中英文术语对的抽取中。与前人工作不同的是:本文将self-training机制引入中文问答模式和中英文术语模式的学习中,使得系统对人工标定语料的依赖度减到最小;本文利用启发规则,改进模式和候选的评分方法。实验表明:与同类系统相比,新系统能在更小的标注集上,实现更优的性能。

【Abstract】 To extract the binary relation from web is an important research direction in the field of information extraction. Many literatures had presented learning methods based on self-training mechanism. In these methods, an initial system is trained on a small labeled data set. Then the system labels the reliable candidate data to re-train itself for better performance.These Literatures show that the above methods are efficient in extraction of binary relation. But no literature tries to analyze the methods strictly.This paper transforms the pattern learning in the extraction of binary relation into the learning of centre of single text class. In text vector space, the vectors in the small neighborhood region of the initial centre could be labeled as the reliable data. This paper aims to answer the key problem: what nature the data set should owns so that the self-labeled data can definitely improve the learning of single class centre.This paper solves the key problem through the study on the nature of text vector space. For conquer the defects in the description for distribution of“hard”data set by Gaussian mixture model, this paper presents a new model: TGMK model based on the partitions acquired by k-means algorithm, and exposes the relations among the TGMK model, k-means algorithm and Gaussian mixture model. The experiment result shows that TGMK model is suitable as the description for the text data set of multiple classes.Based on k-means algorithm, this paper presents a new algorithm: single-mean algorithm. This paper proves that if the data set of multiple classes is suitable to be described by 1-TGMR model which is the generalization version of TGMK model, the output centre of single-mean algorithm will definitely converge to the actual centre of data set from the initial centre. The above researches solve the key problem perfectly. The experiment shows that single-mean algorithm is efficient on the text data set of multiple classes, which also shows that the learning method based on self-training mechanism is efficient in the extraction of binary relation.This paper creates a formal learning model for the extraction of binary relation based on single-mean algorithm, and presents a new score method for candidates and a new self-labeled method against the particularity of the extraction of binary relation from web. This paper uses the formal model to acquire Chinese Q-A patterns and Chinese-English terminology pairs. Differently with the previous work, this paper presents new methods to learn Chinese Q-A patterns and Chinese-English terminology patterns based on self-training mechanism, which reduces the dependence of labeled data set to the maximum extent. This paper also utilizes the heuristic rules to improve the scoring methods for the patterns and candidates. The experiment results show that compared with the same kind of systems, our systems have better performances on smaller labeled data set.

  • 【网络出版投稿人】 天津大学
  • 【网络出版年期】2009年 08期
节点文献中: