节点文献

链接数据网构建的关键问题研究

Research on Key Issues of Constructing the Web of Linked Data

【作者】 张晓辉

【导师】 邸瑞华;

【作者基本信息】 北京工业大学 , 计算机应用技术, 2013, 博士

【摘要】 链接数据是一种基于语义技术在互联网上发布和共享数据的方法。语义网不仅仅是将互联网上的数据以一种机器可理解的方式进行表达,它还需要将数据进行链接,构建规模巨大且链接丰富的数据网(The Web of Data),使人们在计算机辅助下获取信息和知识的过程更加智能化和精细化。链接数据网区别于传统互联网的最大特点是链接的对象及类型不同。在数据网中,链接的对象由HTML文档变为指代某个具体事物的URI,而随着链接对象粒度的减小,对象之间的链接也由传统的超文本链接变为包含明确语义信息的RDF链接。链接数据能够解决现有互联网中信息共享的粒度过粗和语义缺失问题,促进传统文档链接网络向数据网络的演进。虽然目前已经有多个领域基于链接数据技术构建了链接数据网,但链接数据的深入发展和应用仍然面临着诸多问题和瓶颈。首先是链接数据网的数据来源匮乏导致其规模增长缓慢;其次,异构语义数据集中广泛存在的对象共指现象阻碍了数据集之间丰富链接的自动化构建。本文分别从应用模式和共指分析技术两方面对上述问题进行研究,并提出了相应的解决方法,最后对云计算环境下的共指分析系统进行了研究与实现。论文的主要工作和研究成果主要包括:(1)将云计算引入链接数据网的构建,提出了一种基于云计算的链接数据应用模式,并设计了支持这种创新模式的链接数据云平台的架构。链接数据云平台提供链接数据共享所需要的各种服务,能够有效降低普通数据用户参与链接数据网构建的技术门槛,促使数据拥有者将数据融合到链接数据网中,支持互联网范围内的链接数据共享社区的建设。(2)对基于相似度模型的共指分析方法进行了研究,针对传统方法在计算属性权重和处理多值属性方面的不足,提出使用Renyi熵描述属性值的分布特征,并设计了相应的权重计算模型,同时还改进了多值属性的属性值相似度计算方法。通过基于开源语义数据集的实验,证明本章提出的基于相似度模型的共指分析方法能够取得更高的准确率。(3)提出了基于Marko逻辑网的共指分析方法,解决了链接数据共指分析中属性值相似度信息与语义约束信息有机结合的问题。设计了语义数据模式到Markov逻辑网的转换模型,以及相应的闭Markov逻辑网的构造方法。此外,还针对大规模数据集由于规模过大无法直接用于构造闭Markov逻辑网的问题,设计了优化的预匹配方法。预匹配可以大幅缩小闭Markov逻辑网的规模,提高共指分析的速度。实验表明,基于Markov逻辑网的共指分析方法在处理包含丰富语义约束信息的数据集时能够更全面地发现数据集中的共指关系。(4)对云计算环境中的资源弹性伸缩机制进行了研究,并基于上述两种共指分析方法设计实现了面向云计算环境的弹性共指分析引擎,作为链接数据云平台中的核心功能组件。该系统能够根据数据集中是否包含语义约束信息自动选择适合的共指分析方法,同时实现了共指分析作业的并行优化。基于动态集群和缓冲池的机制设计了系统的动态资源调度模型,以及相应的资源伸缩策略和作业调度算法,保证了系统的弹性。最后基于开源的云平台管理软件OpenStack部署共指分析引擎,并对其性能进行了测试和验证。

【Abstract】 Linked data is a method to publish and share data in Internet based on semantictechnology. Semantic web not only expresses the data on the Internet in amachine-understandable way, but also makes links between the data to construct ahuge web of data with rich links. The web of data enables people to obtain knowledgeand information form Internet more intelligently and refinedly. The biggest differencebetween the web of data and the traditional internet are the object being linked and thetype of links. In the web of data, the object being linked changes from HTMLdocument to the URI referring to a specific thing, and the hypertext links also turninto typed RDF links containing explicit semantics. During the information sharing intraditional Internet, the granularity is always too coarse and the semantics of data aremissed. Linked data can be a good solution to the above problems, and will promotethe traditional Internet linking documents evolving into the web of data.Although there are already web of data being constructed by multiple areas basedon linked data technology, the in-depth development and application of linked data isstill faced with many problems and bottlenecks. First, the lack of data sources leads tothe slow growth of data scale. Second, the phenomenon of object coreferencewidespread in the heterogeneous semantic datasets hinders the automated building ofrich links between datasets. In this paper, the research work is focused on theapplication model of linked data and the technology of coreference resolution.The main work and research results include:(1)By introducing cloud computing into the building of the web of data, anapplication model of linked data based on cloud computing is proposed and thearchitecture of cloud based linked data platform is designed as the support of theinnovative model. Cloud based linked data platform supplies a variety of servicesneeded by the sharing of linked data to effectively reduce the technical threshold forordinary data owner sharing data based on linked data and support the building oflinked data sharing community across the Internet.(2)The study of the method of coreference resolution based on similarity modelindicates that the traditional methods have deficiencies in the computing of propertyweight and the processing of multi-valued properties. A new weight calculationmethod based on the distribution characteristics of property values described by Renyi entropy is proposed, and the similarity calculation method between the values ofmulti-valued properties is improved. Through the experiment based on the opensource RDF datasets, the advantage of the method presented by this chapter is proved.(3)A coreference resolution method based on Markov logic network is proposed. Theconversion model from the schema of semantic data to Markov Logic Network andthe corresponding ground method are designed. In addition, there are some datasetscan not be used directly to construct the ground Markov logic network because of thelarge-scale. This paper presented an optimized method of pre-match to narrow thematching range. The experiment show that Markov based method can perform betterwhen processing the dataset containing rich semantic constraints.(4)By studying the elastic telescopic mechanism of resource in cloud computingenvironment, an elastic coreference resolution system for cloud computingenvironment is designed based on the methods proposed by the above two chapters.The system can select automatically the appropriate method for coreference resolutionaccording to the characteristics of the dataset. The jobs in the system are optimizedbased on parallel mechanism to make full use of the computing resources. Thedynamic resource scheduling model is designed based on the mechanism of dynamiccluster and buffer pool. In addition, the corresponding elastic stretch strategy ofresource and job scheduling algorithm are also presented. Finally, the system isdeployed based on OpenStack which is an opensource management software forcloud computing, and the performance of the system is validated through some tests.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络