节点文献

汉语专业领域命名实体语义关系自动抽取研究

A Research for Semantic Relation Automatic Extraction among Named Entities in Chinese Professional Domain

【作者】 赵君喆

【导师】 何婷婷;

【作者基本信息】 华中师范大学 , 计算机软件与理论, 2007, 硕士

【摘要】 我们处于一个信息爆炸的时代,互联网上的中文信息在飞速地增长。通过信息抽取技术从浩瀚的中文信息海洋中自动寻找用户所需求的信息则显得至关重要。而命名实体语义关系抽取是信息抽取中的主要任务之一,所以近年来命名实体语义关系抽取研究也成为了我国自然语言处理研究领域中的一个热点。当前汉语的命名实体语义关系抽取研究主要是有指导(Supervised)或弱有导(Weakly Supervise)的方法,且研究对象大多是一般领域的语料。这些方法在训练语料库的标注、关系抽取规则的编制以及初始关系种子的选取上都费时费力;此外,适用于一般领域语料的关系抽取方法难以满足一些专业领域的需求。所以,本文提出了一套适用于专业语料的无指导命名实体语义关系抽取的方案,并实现了该系统。此外,本文还尝试了利用该系统的抽取结果构造关系模板和关系种子。本研究针对专业领域的语料特性,运用语言资源工具对向量空间模型(VSM)进行改进和优化,解决了专业领域语料的特征模糊问题;根据潜在关系信息分布特征,设计了专业领域语料中实体-关系网络的构造方法;利用复杂网络(Complex Networks)理论中的网络社区(Community)特性,实现了在专业领域语料中关系类别的自动发现;通过对词语在上下文中的重要性分析,采用了提取重要性权重最高词作为关系描述词的关系描述方法。本文在专业领域语料平台上对该系统进行了实验,并结合权威评价手段对实验进行了评估,另外还构造了有指导关系抽取系统对实验系统获得的关系进行验证。最终结果表明:本系统在专业领域语料中不但能发现几乎所有的人们已知的关系种类,而且能发现一些不为人知的关系种类;系统在无指导的情况下,可以快速并比较准确地得到命名实体之间的关系描述。实验证实了本文构造的系统在专业领域语料中及无指导情况下具有良好的性能,同时实验还证实了无指导关系抽取结果对有指导关系抽取系统具有辅助作用。此外,本文还发现该系统提取的关系描述可以为专业领域中关系本体(Ontology)的建设提供依据。

【Abstract】 We are in an era of information explosion, and the Chinese information in rapid growth on the Internet. It is crucial to automatically collect the needful information for users by information extraction technology from the large-scale Chinese information. And the semantic relation extraction among named entities is one of major tasks in information extraction. Therefore, in recent years, the research of Chinese semantic relation extraction among named entities has become a hot field in natural language processing research in our country.A majority of current methods of Chinese relation extraction are supervised or weakly supervised. And their research objects are corpuses in common domain. There ways are time-consuming and laborious in tagging training corpuses, making relation extraction rules and selecting initial relation seeds. In addition, those methods sometimes are not applicable in certain professional corpuses. Therefore, this paper presents an unsupervised method to discover the semantic relations among named entities in professional corpuses. And this paper achieves the system. In addition, we attempt to use the extracted results of this system to construct the relation templates and relation seeds.According to the characteristics of corpus in professional field, we optimized vector space model adopting some linguistic tool to overcome the blurry feature of professional corpus. Then we proposed a method to construct entity-relation network according to the feature of latent relation information distribution. And then, we extracted relations automatically utilizing community characteristic in complex networks. Finally, By importance analysis of words in context, we use the words with highest weight as key words to describe relations.We tested our system in the corpus of professional field and evaluated it using standard method. We also constructed a supervised relation extraction system to verify the result of the system. Result indicated that the system can get description among named entities rapidly and accurately while unsupervised. And it could get almost all the known relations, even some kind of unknown relations.Experiment shows good performance of our system in both professional field and unsupervised procedure. It also proves that the result of unsupervised relation extraction could assist supervised method. In addition, the relation descriptions of our result can provide basis for the construction of ontology in professional field.

  • 【分类号】TP391.1
  • 【被引频次】2
  • 【下载频次】210
节点文献中: