节点文献

RDF图数据管理的关键技术研究

Research on Key Technologies of RDF Graph Data Management

【作者】 吴刚

【导师】 王克宏; 李涓子;

【作者基本信息】 清华大学 , 计算机科学与技术, 2008, 博士

【摘要】 语义Web使跨应用、企业和团体的数据共享与重用成为可能,而RDF是语义Web的基础,其数据模型是RDF图。与已有数据模型不同,RDF图是有向超图,能够表达隐含语义,富含文本信息,且规模庞大。这些特点造成RDF图数据管理中存在存储设计难度大,查询处理复杂且效率低,查询结果排序困难等问题。针对以上问题,本文对RDF图数据管理中的若干关键技术展开研究。首先,本文研究了隐含数据查询过程中的自反传递闭包计算问题,提出了一种基于有向图素数编码标记机制的方法:PLSD。PLSD将任意有向图上结点间可达关系(属性的自反传递性)计算转化为标记中整数的整除关系计算。与传统基于forward chaining和backward chaining的推理相比,PLSD能够更有效地实现RDF图中自反传递闭包的计算。实验表明PLSD优于同类其它标记机制。其次,针对RDF图的有向超图特点,本文提出了一种原生的RDF图存储方法:PI。该方法能够有效避免由数据模型不一致而导致的数据模型转换开销。它还具有降低存储空间开销,易于实现各种图论算法,聚簇存储RDF图有向边等特点。PI存储上结合PLSD等推理策略的语义查询系统,在LUBM测试基准实验中综合性能指标要高于对比系统。对于RDF图中的文本信息,本文提出以资源文档为索引和查询基本单位的细粒度关键词查询方法。克服了以RDF文档为单位的粗粒度关键词查询方法难与语义查询结合的问题,提高了语义查询和关键词查询的综合查全率和查准率。最后,在查询结果排序方面,提出在本体层次上对概念与关系重要性的排序方法CARRank。基于CARRank实现了实例数据层资源全局重要性排序和结合查询结果相似度与资源全局重要性的综合排序。CARRank算法利用本体中概念和关系相互增强的迭代方式计算概念重要性和关系权重,避免了对资源统计信息的依赖。并给出了其收敛性的理论证明和实验检验。实验验证了基于CARRank算法的概念重要性排序与关系权重的合理性。原型系统在中文新闻等领域的成功应用验证了本文工作的价值和意义。

【Abstract】 Based on the Resource Description Framework (RDF), the Semantic Web pro-vides the ability of data sharing and reusing across applications, enterprises, and com-munity boundaries. The RDF graph is the fundamental data model of RDF, and it isquite different from the traditional data models. Consequently, it presents new chal-lenges to the traditional data management approaches. First, an RDF graph is a hy-pergraph that requires a more complex storage scheme. Second, implicit semanticinformation and full-text information in an RDF graph complicate the process of queryevaluation. Third, since web-scale RDF graphs are very common, an effective rankingscheme is indispensable. In this thesis, we tried to solve the above problems and havedone the following work.We study the computation of re?exive and transitive closure in the inference en-gine, and propose a prime number labeling scheme, called PLSD, for directed graphs.PLSD translates the reachability between nodes in a directed graph, i.e. re?exivity andtransitivity, into the divisibility between integers in their labels. In comparison withthe conventional forward and backward chaining approaches, PLSD can compute there?exive and transitive closure more efficient. The experimental results also show thatthe performance of PLSD is better than that of other labeling schemes.In terms of the hypergraph property of the RDF graph model, we propose a nativeRDF graph storage approach called PI. It avoids“impedance mismatch”existing inthe transformation between two inconsistent data models. PI has several advantages:1) it reduces the cost of space; 2) it makes it easier for the implementation of differ-ent graph-based algorithms; and 3) it clusters directed edges in an RDF graph. Weimplemented a semantic query system based on this storage approach and the PLSDinference approach described above. Experimental results using the LUBM benchmarkshow that the proposal approaches outperform the existing approaches with respect tothe combined metric. In terms of the large scale full-text information in an RDF graph, we propose a finegrained keywords search approach which takes RDF resource as the unit of indexingand retrieving. In this way, keywords search and semantic query can be combinedseamlessly.For query result ranking, we propose three levels of ranking on the RDF graphmodel: 1) ranking the importance of concepts and relations on the level of ontology; 2)ranking the global importance of resources based on the results of ranking concepts andrelations; and 3) ranking based on keywords search similarity and global importanceof resources. The algorithm for ranking on the level of ontology is named CARRank.It mutually reinforces the importance of concepts and the weights of relations in theiteration process. We present a proof for the convergence of CARRank. Experimentsand evaluations indicate the effectiveness of the proposed ranking algorithms.Finally, the proposed approaches and algorithms have been applied to a prototypesystem. The system has been successfully utilized for managing semantic data in thefield of Chinese news, which also indicates the practical significance of the researchwork in this thesis.

【关键词】 RDF图数据管理语义Web本体
【Key words】 RDF GraphData ManagementSemantic WebOntology
  • 【网络出版投稿人】 清华大学
  • 【网络出版年期】2009年 08期
  • 【分类号】TP393.092
  • 【被引频次】14
  • 【下载频次】770
  • 攻读期成果
节点文献中: