节点文献

基于Web的实体信息搜索与挖掘研究

Research on Entity Retrieval and Mining with the Web

【作者】 包胜华

【导师】 俞勇;

【作者基本信息】 上海交通大学 , 计算机应用技术, 2008, 博士

【摘要】 随着网络技术的迅猛发展,当今的万维网出现了多代共存、共同发展的新局面。传统万维网(Web 1.0)构成了当今万维网的主体。社会化万维网(Web 2.0)近年来飞速发展,成为了当今万维网的新兴力量。同时,为了能够让机器和人一样地理解并处理各种网络数据,人们正积极推进语义万维网技术的发展,并预期其将成为下一代网络的主流载体(Web 3.0)。所有这些网络的应用均层出不穷,各类实体描述信息散布其间。这给用户带来便利的同时也带来了一个关键的问题,即信息过载。如何从这一巨大而复杂的信息空间中,有效地找到用户所需要的各类实体信息也成为近年来的一个研究热点。根据这一需求,本文分析了各代网络的特点,提出了在Web 1.0、2.0和3.0中进行实体信息检索与挖掘的概念,针对每代网络进行了体系化的理论研究工作,并提出了一系列的挖掘算法。在传统网络(Web 1.0)中,大部分研究工作都以提供用户最为相关的网页为目标,而现实中,越来越多的用户开始关心网页内部所蕴含的信息,而非网页本身。针对这一需求,本文第一部分提出了以下算法对网页中的实体信息进行挖掘:1)专家搜索:本文提出了基于概率的细粒度专家搜索模型。2)专家-技术隐式关联挖掘:本文提出了多类型的可分混合模型用于高效地挖掘专家和技术之间的隐式关联。3)竞争者挖掘:本文提出了一个创新的算法(CoMiner)用于从网上自动地挖掘领域无关的竞争对手信息。4)时间关联的事件挖掘:本文提出了一个新的算法(TESer)用于挖掘网络中的事件信息并按照时间进行整合。Web2.0的快速发展带来了大量对网页、图片、论文、专家等实体进行的大众标注,比如Del.icio.us书签网、Flickr图片共享网等。本文第二部分分析Web 2.0的特性,挖掘其中的各种实体关系,并用挖掘到的信息改善各种现有的应用:1)社会化搜索:本文提出了两个新算法分别用于改进网页搜索的动态排序和静态排序。2)社会化语言模型:本文提出了一个语言标注模型用来进一步改进语言模型的检索效果。3)社会化浏览:本文提出了一个改进的网页浏览算法,该算法能够充分地利用网页标注之间的语义关联和隐含的层次信息。为了让机器也能理解网络信息,人们提出了语义万维网。目前语义万维网正处于早期发展阶段。作为现有万维网的下一个自然扩展,本文将其称为Web 3.0。本文第三部分对Web 3.0的构建及其应用进行了探讨性的研究:1)语义浮出:通常语义万维网通过专家定义本体信息来构建,本文提出了基于社会化标注自动浮出层次化语义的算法。2)语义应用:本文进一步将语义信息应用到Web服务组合中,并提出了一个新的语义服务的查找与组合算法。研究结果表明,通过对Web 1.0、2.0和3.0环境下的实体挖掘研究,能够极大地减少用户获取目标信息所需的时间,并能更好地帮助用户理解搜索目标。

【Abstract】 With the rapid development of the Web technologies, the World Wide Web is comingto a new status containing multiple mixed generations, each of which keeps developing fastas well. The traditional web (Web 1.0) still acts as the principal part of the current Web.Recently, social World Wide Web (Web 2.0) develops rapidly and becomes the rising notablepart of today’s Web. At the same time, many people are working on the development of theSemantic Web where machine can understand and process various web data like humanbeings. It is expected to be a main stream in the next generation of Web (Web 3.0). Variousapplications emerge endlessly in all these generations of the Web. They bring the web usersgreat convenience as well as a key problem, i.e. information overload. How to effectivelyfind the desired information for the user from such a huge and complex information spacebecomes a hot research topic in recent years. In this paper, we propose to mine the entityinformation in Web 1.0, 2.0 and 3.0. For each generation, we analyze the properties of theWeb and propose a series of mining algorithms as follows.In the traditional web (Web 1.0), most work targets on providing the user with the mostrelevant web pages. In reality, more and more users are concerned with information of en-tities scattered in the web page, but not the web page itself. Motivated by this, the first partof this paper proposes the following algorithms for entity mining. 1) Expert search: Wepropose a new algorithm, namely fine-grained model, to address the problem. 2) Expert-expertise mining: We propose a new typed separable mixture model to mine the latent as-sociations between expert and expertise effectively. 3) Competitor mining: We propose anew algorithm, CoMiner, to mine the competitors automatically in a domain-independentmanner. 4) Temporal event mining: We propose a new algorithm, TESer, to mine the eventschronologically.With the boost of Web 2.0, more and more web resources like web pages, picturesare annotated by web users with different backgrounds, for example, various resources areannotated with services provided by Del.icio.us, Flickr and so on. The second part of thispaper analyzes the properties of Web 2.0 and mines the entity relations. 1) Social search:We propose two new algorithms to improve the web pages’similarity ranking and static ranking, respectively. 2) Social language model: We propose a new algorithm to smooththe estimation of language model with social annotations. 3) Social browsing: We proposean effective algorithm to utilize the semantic association and hierarchical information toimprove the social browsing experience.To make machine understand web information, researchers propose the Semantic Webto define the semantics of web resources explicitly. The Semantic Web is in an early stageof rapid development. As a natural extension of the current web, Semantic Web (referredas Web 3.0 here) is expected to be the coming next generation of the Web. The third partof the paper takes a try on mining the semantic information of Web 3.0.1) Emergent seman-tics:We propose an effective algorithm for emerging hierarchical semantics from social an-notations. 2) Semantic web service composition: We propose a semantic rewriting approachfor semantic web service composition based on query rewriting.The experimental results show that the mining of entities in web 1.0, 2.0 and 3.0 benefitsthe web users a lot in saving time to find the target information and facilitates the understand-ing of the target entities.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络