节点文献

维基百科大数据的知识挖掘与管理方法研究

Research on Knowledge Extraction and Management of the Big Data in Wikipedia

【作者】 肖奎

【导师】 李兵;

【作者基本信息】 武汉大学 , 计算机软件与理论, 2013, 博士

【摘要】 当前,人类已经进入大数据时代,生产、生活、科研、服务等无不因大数据而改变。与此同时,传统的“数据→信息→知识→智慧→决策”的知识形成过程与决策产生模式面临着大数据的体量巨大、模态多样、真伪难辨以及更新迅速等特性的严峻挑战。将繁芜庞杂的大数据,转换为信息和知识,才能帮助我们做出聪明的选择。实践证明,通过大规模群体协作、非线性、去中心化、自下而上的群体智慧方法,是实现大数据“去芜存菁”、“沙里淘金”的有效途径。维基百科是通过群体协作生产知识的最典型平台,同时也是大数据的典型代表。如何从维基百科大数据中挖掘高质量的领域知识,并实现高质量的知识管理是本文主要研究目标。围绕此目标,本文的主要研究工作如下:(1)总结了维基百科群体协作环境的特征,其中包括协同编辑词条的方法、词条质量等级的设置、高质量词条的评选规则。(2)研究了编辑者群体协作行为对词条质量的影响。基于用户讨论页建立了编辑者网络,分析了编辑者群体里对话者比例与编辑者网络聚类系数对词条质量升级速度的影响,为后面的词条质量检测打下了基础。(3)提出了一种维基百科知识质量管理方法,同时应用词条属性与编辑者属性,实现对全部等级的词条评价质量。这些属性数据都可以从维基百科数据库获取,而不同语言版本的维基百科数据库结构都是相同的,因此本文的词条质量检测方法可以方便的用在各种语言版本的词条上。(4)应用上述知识质量管理方法,筛选出维基百科大数据里指定领域的高质量词条,并进一步分析这些高质量词条与领域的相关度。抽取那些与领域紧密相关的高质量词条作为本体的概念,抽取这些词条的关系作为本体的关系,构建高质量的领域本体。作为对这个构建本体方法的检验,本文也将构建的领域本体用到O-RGPS领域建模工具中,用来标注角色(Role)、目标(Goal)、流程(Process)、服务(Service)等领域模型。同时,也把领域本体用到S2R2这个Web服务注册管理平台,以支持Web服务的语义标注以及语义搜索。

【Abstract】 At present, we have entered the Age of Big Data. Manufacturing, living, researching, serving are all changed by big data. At the same time, the process of knowledge creating and the model of decision making,"data→information→knowledge→wisdom→decision", are facing adverse conditions. Big data is so large, and has too many models, and cannot be distinguished whether it is genuine or fake, and changes so frequnently. Only by transforming the so large and complex data sets into informations and knowledges can we make right choices. Practices show that the methods based on group intelligence, such as mass collaborative methods, nonlinear methods, decentralized methods, can help people hunt valuable knowledges.Wikipedia is a typical platform which creates knowledges based on mass collaboration, as well as a typical example of big data. As a mass collaboration platform, knowledge qualities are always uneven. The main goals of this paper are extracting high-quality domain knowledges and managing the knowledges. The contributions are as follows:(1) The characteristics of the mass collaboration environment of Wikipedia are summarized, including article editing tasks, article quality rating system, and the voting process of high-quality articles.(2) The impacts of mass collaboration behaviors on article qualities are analyzed. The editor netwoks are built based on the User Talk Pages. The impacts of the attributes, such as the ratio of conversational editors and the clustering coefficient of editor network, on the speed of quality promotion are clarified. It is the groundwork of the knowledge quality management task.(3) A new method of knowledge quality management in Wikipedia is proposed. This method employs both article attributes and editor attributes, and can assess article qualities of all quality levels. Because all the attribute values can be extracted from the Wikipedia database, this method can be used to detection article qualities of any languages. (4) High-quality articles of the specific domain were extracted from Wikipedia by using the quality detection method. After that, the degree of domain relevancy of every article was analysed. The closely related articles were used as concepts of ontology. Then the relations of the concepts were also extracted to build domain ontologies. The domain ontologies were used in a domain modeling tool, O-RGPS, in order to annotate the domain models Role, Goal, Process and Service. On the other hand, the domain ontologies were used in the platform, S2R2, which can support the semantic annotation and semantic search of web services.

  • 【网络出版投稿人】 武汉大学
  • 【网络出版年期】2014年 05期
节点文献中: