节点文献

面向自由文本的细粒度关系抽取的关键技术研究

Research on Key Technology of Free Text Oriented Fine-grained Relation Extraction

【作者】 朱倩

【导师】 程显毅;

【作者基本信息】 江苏大学 , 计算机应用技术, 2011, 博士

【摘要】 信息抽取(IE, Information Extraction)是继信息检索和机器翻译之后,信息处理领域倍受关注的一个重要的研究方向。IE的目的是抽取出指定的事件、事实等信息并填入一个数据库中供用户查询使用,只有得到各个实体之间的正确关系,才能进行正确的数据库填充。实体关系抽取成为影响IE系统质量的一个关键技术,有着广泛的应用背景。随着Internet的快速发展和网上信息量的迅猛增长,及自然语言处理技术和机器学习技术的不断发展和成熟,从自由文本中抽取出有用的结构化信息已经成为可能。目前实体关系抽取研究已经取得了很多的成果,也越来越走入人们的日常生活,比如像google的Powerset语义搜索引擎、apache软件基金会的Lucene全文检索引擎架构等等。但是,对文本浅层特征的利用以及依赖于少量特定领域的训练文本,使得它们的效果往往不尽如人意,实体抽取技术仍然面临着很多困难。本文以Triples<实体,属性,值>(Entity-Artribute-Value,EAV)为研究对象(本文称为细粒度关系,或EAV关系),以HNC (Hierarchical Network of Concepts,概念层次网络)理论、描述逻辑和半监督学习理论为基础,研究语义层面的细粒度关系(实体-属性、实体-属性值、属性-属性、属性-属性值之间的关系)抽取的关键技术,本文的主要贡献:1、构建了描述细粒度关系本体的逻辑系统ALCIQ(EAV)(3.5)。在传统的知识管理方式下,由于信息资源缺少统一的语义描述,用户难以实现相关资源的语义融合,本体技术是解决这一困难的重要手段。本体的建立对于需要交换信息,共享信息的人或异构系统来说,将有助于清除在概念和术语上的分歧,对领域内的概念理解达成共识,成为人机之间,机器和机器之间互相理解的语义基础。本文基于本体技术给出了EAV建模的描述逻辑ALCIQ(EAV),基于ALCIQ(EAV)推理算法实现了EAV本体依赖、EAV角色依赖、EAV外部依赖和EAV的形式化,有效地解决了细粒度关系范围的界定。2、提出了基于HNC的词语语义关联度计算方法(4.3.4)。在细粒度关系抽取中,关联度计算可以发现词语之间的固有联系和隐含关系,可以联想孤立词语的关联词语(相似词语、相反词语、搭配词语、共现词语等),是词语语义相似度和词语语义相关度扩充。本文通过HNC把整个世界作为一个普遍联系的有机整体,假设词语之间也是相互联系的,词语之间构成一张无向带权图(网),用一条边来连接相关联的两个词语,边上的权重为两个词语的关联度,通过在概念网络寻找两个词语的路径来计算词语之间的固有联系和隐含关系。利用HNC联想机制,计算HNC符号的中层表达式,实现词语联想。解决了语义层面上的词语关联度计算,扩展了词语语义相似度和词语语义相关度概念,是抽取实体、属性、属性值的基础。实验结果表明通过词语语义关联度抽取的属性和属性值更能客观地反映真实的细粒度语义关系。3、提出了基于半监督学习的未定义关系类别的细粒度关系抽取算法(5.3)。未定义关系类别的关系抽取是细粒度关系抽取的核心问题,针对预定义关系类别应用的局限性,本文基于半监督学习给出了未定义关系类别的聚类算法,该算法包括:基于正例和未标注数据学习算法、关系模式泛化算法和关系模式置信度计算算法,并在维基百科上展示了一个细粒度关系抽取的实验,在训练数据较少的情况下,其效果仍然是可接受的。4、给出一个细粒度关系抽取应用案例——中文科技术语分析(6.2)。中文科技术语分析有利于确定中文科技术语的内涵与分类,界定与判断新术语,把握中文科技术语所属领域的发展重点与发展方向。为了验证细粒度关系抽取的效果,将本文的细粒度关系抽取方法应用于中文科技术语分析。首先,利用ALCIQ(EAV)对科技术语建模,界定中文科技术语文本范围;然后,计算“术语-属性-属性值”关联度,抽取中文科技术语的属性及其相应的值;最后,基于半监督学习的未定义关系类别算法对中文科技术语聚类。

【Abstract】 Information Extraction is an important research direction in the field of information processing after information retrieval and machine translation. The purpose of IE is to extract appointed events or facts and fill them into a database for users to query it, and only when the relations between the entities are right, then the database can be correctly filled. Relation extraction has become one key technology that effect the performance of IE system and it has extensive application background. With the rapid development of Internet and the rapid growth in the amount of online information, and with the development and maturity of natural language processing and machine learning techniques, it has become possible to extract useful structured information from free text.At present, relation extraction has gotten many achievements, and it has more and more pacing into people’s daily lives, such as google’s Powerset semantic search engine and Lucene full-text search engine architecture of apache software foundation etc. But since they all use text’s shallow features and depend on the training text from few specific areas, so their performance is not satisfactory, and relation extraction still facing many difficulties.The paper’s research object is Entity-Artribute-Value triples(EAV), and with the theory of Hierarchical Network of Concepts, description logics and semi-supervised learning theory to research the key technology of semantic-level fine-grained relation extraction(the relation between Entity-Artribute, Entity-Value, Artribute-Artribute, Artribute-Value), and the main contributions of the paper are:1. ALCIQ(EAV)(3.5) is constructed to describe fine-grained relation Ontology. According to traditional knowledge management pattern, the information lacks uniform semantic description, so it is hard for users to realize relevant information resource semantic fusion. Ontology technology is an important means to resolve this difficulty. For the people and heterogeneous systems who want to exchange information or share information, the establishment of Ontology can help clear the divergences of concepts and terminology, reach a consensus on the understanding of the concepts of the field, and it is the semantic basis of the mutual understanding between machines or people and machine. Based on Ontology technology, the paper presents ALCIQ(EAV) which is used to EAV modeling, the paper also realized the formalization of EAV Ontology dependency, EAV role dependency, EAV external dependency and EAV integrity with ALCIQ(EAV) reasoning algorithm, and it effectively solve the definition of the fine-grained relation scope.2. Semantic association degree algorithm is presented based on HNC (4.3.4) When fine-grained relation is extracted, association degree calculation can find inherent link and implicit relationship between words, it can also associate isolated word with its relational word(similar word, contrary word, collocating word, concurring word etc.) and it is the expansion of semantic similarity degree and semantic correlation degree. Let the world be a universal connected organic whole with HNC, and suppose words are connected with each other, thus the words compose a undirected weighted graph, and the associated words are connected by edge, while the weight of the edge is the association degree of these two words, therefore, inherent link and implicit relationship between words can be obtained by searching the path between two words in the HNC.Words association can be realized by computing HNC symbols’middle-level expression with HNC’s association mechanism. The solving of word association degree computing and the expanding of semantic similarity degree and semantic correlation degree are the basic of extracting entity, attribute and attribute value. The experiment result shows the attribute and attribute value that extracted by semantic association degree can more objectively represent actual fine-grained semantic relation.3. The type-undefined fine-grained relation extraction algorithm is proposed based on semi-supervised learning (5.3). The type-undefined relation extraction is the key problem of fine-grained relation extraction. To resolve the limitation of type-defined relation application, the paper gives a type-undefined relation clustering algorithm based on semi-supervised learning, and the algorithm is composed of:one learning algorithm based on positive examples and unlabeled data, one relation pattern generalization algorithm and one relation pattern confidence computation algorithm, and the fine-grained relation extraction experiment is also carry out on Wikipedia, the result is acceptable even though the training data is relatively few.4. The fine-grained relation extraction application is showed—Chinese technical terms analysis (6.2). Chinese technical terms analysis is beneficial to determine the connotation and class of Chinese technical terms, define and judge new terms, and it can also contribute to hold the development focus and development direction of the field that the Chinese technical terms belongs. To validate the effect of fine-grained relation extraction, the extraction method presented in the paper is applied to Chinese technical terms analysis. Firstly, Chinese technical terms is modelinged with ALCIQ(EAV), and the boundary of the term is determinated, second, the association degree of "term-artribute-value" is computed, and the artribute of Chinese technical term and its value is extracted, finally, the type-undefined relation extraction algorithm is used to process Chinese technical term clustering based on semi-supervised learning.

  • 【网络出版投稿人】 江苏大学
  • 【网络出版年期】2012年 06期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络