节点文献

面向市场情报分析的Web实体事件融合问题研究

Research on Web Entity Event Fusion for Market Intelligence Analysis

【作者】 孙涛

【导师】 王新军;

【作者基本信息】 山东大学 , 计算机软件与理论, 2014, 博士

【摘要】 随着Internet的快速发展,Web已经成为一个开放的、分布广泛的全球信息服务中心。企业希望通过大数据的分析,获取有价值的市场情报,在激烈的市场竞争中取得先机。在Web上,企业更加关心描述与其存在利益关系的实体(包括企业、产品、人物等)的一些事件信息、,这些事件描述了实体从事的一些活动或者其最新的状态,为企业挖掘市场情报提供了第一手的资料。大量的事件信息以新闻、评论、消息等形式充斥在Web上,具有冗余度高、准确度差、数据离散等特点,给企业进行市场情报分析带来极大的不便。因此,如何消除冗余、事件关系发现,从而有效地整合事件信息,成为准确获取市场情报的前提。作为企业进行市场情报分析的重要步骤,Web实体事件融合可以为市场情报分析提供高质量的数据,为市场情报分析提供全面、真实、可靠的数据支持,因此,吸引了越来越多的研究者的关注。但是,由于Web上事件信息多以新闻等非结构化形式出现,存在表达自由、形式多样、发布随意等特点,因此Web实体事件融合还存在以下问题亟待解决:(1)Web上不同网站对同一事件的描述存在较大的差异,事件融合首先需要识别描述同一事件的不同表象;(2)由于事件不断发展、不同网站对事件描述详细程度的不同、网站的偏好以及编辑出错等原因,使得Web上事件信息存在不完整、过时、错误、虚假等冲突情况的发生,因此,为了保证市场情报分析数据的准确性,Web实体事件融合需要进行事件冲突解决;(3)从单个事件的描述信息很难发现事件的全貌,无法知道其来龙去脉,因此,为了给市场情报分析提供实体事件全面描述信息,Web实体事件融合需要建立起不同实体事件之间的关联关系,为挖掘事件间的隐式关联奠定基础。Web实体事件融合是数据质量的保证,以及市场情报分析的前提,本文针对事件融合面临的若干关键问题而展开研究,本文的主要工作与贡献可以概括为如下几个方面:(1)针对如何识别Web上存在的众多不同事件表象问题,本文提出了一种基于异质信息网络的Web实体共指事件识别方法,有效地提高了识别事件不同表象的准确性。该方法使用一种层次聚类的整体式共指事件识别算法,并利用了匹配决策之间的相互影响,进而迭代实现共指事件的识别。在事件相似度度量方面,本文提出的方法综合利用了实体、事件、文档、数据源之间的关系,通过运用多种特征进行事件相似度度量,得到准确度较高的事件表象综合相似度。通过在企业事件数据集、人物事件数据集、产品事件数据上的实验说明,所提出的算法可以有效地完成Web实体共指事件识别任务,具有较好的查全率和查准率。(2)针对不同事件表象所提供的事件信息存在不完整、过时、矛盾、错误等问题,文本提出了一种基于D-S证据理论的事件冲突解决方法,可以有效地解决事件表象间存在的冲突问题。该方法根据事件冲突的类型,采用有针对性的冲突解决策略,利用D-S证据理论的组合规则,有效地提高了事件冲突解决的准确率。在事件属性可信度的计算方面,利用事件属性事实出现的频率、在文档中的位置、数据源的质量等因素,采用半监督学习的方法,分别计算事件属性事实的可信度。针对传统D-S证据理论存在的合成法则悖论问题,对D-S证据理论进行了扩展,提高了事件冲突解决的准确度,并且允许新的策略和特征的加入,因此该方法具有较强的适应性。(3)针对无法从单个事件描述洞悉事件的起因、发展以及走向等问题,本文提出了基于事件关系和实体关系构建实体事件关联图的方法,可以有效地建立起实体事件间的关联。该方法利用事件间最基本的五种关系模式还原事件发生与发展的脉络,并借助于实体关系的发现,将实体事件间的复杂关联以图的形式描述出来,为挖掘事件间存在的隐式关系奠定基础。在事件关系方面,根据已有的事件关系模式,提出了一种事件关联图的构建方法;通过实验验证,本文所提方法能够有效建立实体事件间关联,具有较高的准确率。通过对Web实体事件融合的研究,解决了市场情报分析面临的数据质量问题,并为大规模情报分析奠定了基础,因此,本文的研究具有积极的意义。另外,事件关系检测、事件模式发现以及新的事件关系的表示机制是下一步的研究方向。

【Abstract】 With the rapid progress of Internet, Web has become an open and global information center. The companies want to obtain valuable market intelligence by big data analysis, obtain the opportunity in fierce market competition. On the Web, the companies concern about the events of entities related to them (include companies, products, people, etc.), these events describe the entities’activity and the latest status, and provide the first-hand information for mining market intelligence. A large number of event information on the Web as the form of news, reviews and message. It has high redundancy, poor accuracy and discrete characteristic, brings great inconvenience for market intelligence analysis. How to eliminate redundancy, discriminating, association events, integrate event information become a preconditions that accurate access to market intelligence.As an important step for market intelligence analysis, web entities event fusion can provide high-quality data, comprehensive, truthful and reliable data for market intelligence analysis. Therefore, it has attracted more and more researchers. However, event information in the form such as news appeares on the web, has the characteristics of expression freely, various forms and publish freely, etc. Web entity event fusion has to solve the following problems:(1)There is a big difference to describe the same event in different web sites. So the first thing shoud be solved is event coreference resolution;(2)Since the reason of events progress, different sites provide different event mention, website preferences and editor errors, makes the information on the web incomplete, outdate, erroneous, false, etc. Therefore, in order to ensure market intelligence analysis has accuracy data, web entity event fusin need to solve the events conflict resolution;(3) It is difficult to find the event whole picture from a single event, cannot know the ins and outs. So in order to provide an entities, events panorama, Web entities event fusion need to found the correlation between entities and events. The research of Web entity event fusion is a prerequisite for high-quality data and market intelligence analysis. The main work and contribution of this thesis is summarized as follows:(1) How to identify a number of different event mention on the Web, we presents a methods of Web entity event coreference resolution based on heterogeneous information network in this paper, it effectively improve the accuryacy of event coreference resolution.The method adopts a hierarchical clustering algorithm of event coreference resolution, and using the interaction between decision and making, then iteractive implement the event coreference resolution. In the event similarity measurement, the method of this thesis uses the relation of entities, events, documents and data sources, using event similarity measurement from different angle, obtain reasonable the similarity of event mentions. The experiments on the enterprises event data set, characters event data set and products event data set, the proposed algorithm can accomplisth the tasks of event coreference resolution, has better recall and precision.(2)Since the different event mentions provide incomplete, outdate and contradiction data, we puts forward a solution of event conflict resolution based on D-S evidence thory in this paper, can effectively solve the problem of event conflict resolution.According to the type of event conflict, the method adopts the strategy to solve the confliction, and uses the combination rules of D-S evidence theory, can effectively improve the accuracy of event conflict resolution. In the calculation of the credibility of event attributes, using the frequency of event attributes, location in the document, the quality of data source and other factors, adopting semi-supervised merchine learning method, calculating the credibility of event attributes’s fact. As the combination rule paradox problems existing in the traditional D-S evidence theory, then extend the theory and increase the accuracy of event confliction resolution, and allows to add new features, therefore the method has strong adaptability. (3)Since it cannot describe the event’s cause and progress from one event mention, we present a method to construct panorama based on entity and event in this paper.The method of this thesis uses five basic event relations and entity relation, describes the complex relation of entities and events, and lay the foundation for mining implicit relationship exists in events. In the event relation, according to the event relation types, we put forward a method to construct an event relation graph; We use the entity relationsip to link the event relation graph form a panorama in this paper. According to the experimental results, the proposed method can effectively establish entity, event correlation, has high accuracy.The research of Web entity event fusion solves the data quality problems of market intelligence, and lays the foundation for large-scale information analysis. Therefore, the research of this paper is very significance. In addition, the event detection, event pattern discovery and new event representation mechanism is the next research direction.

  • 【网络出版投稿人】 山东大学
  • 【网络出版年期】2014年 10期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络