节点文献

图数据库查询处理技术的研究

Research on Techniques for Query Processing on Graph Databases

【作者】 张硕

【导师】 李建中;

【作者基本信息】 哈尔滨工业大学 , 计算机软件与理论, 2010, 博士

【摘要】 作为一种通用的数据结构,图可以用来表示数据对象之间的复杂联系。例如:图可以表示化合物的分子结构,蛋白质交互网络,社会网络等。随着科学与工程领域中图数据的大量出现和累积,图数据管理已成为数据管理领域一个重要和热点研究的子领域。图数据库查询处理是其中最重要的研究分支之一,其对图相关的绝大部分处理和应用(例如:图挖掘、化学数据库PubChem)起着基础支撑作用。本文主要对图数据库中的查询处理技术进行深入研究,归纳总结了现有研究成果的主要思想和优缺点,提出了一些新的图数据库查询处理方法,主要研究成果如下:1.提出一种图数据库中高效处理超图包含查询的新方法。新方法综合的从图数据库的压缩组织、构造有效的特征索引以及基于压缩组织来处理查询三个方面着手考虑问题。(1)在图数据库的压缩组织方面,提出图数据库的有效组织方法,以提高整体查询处理效率。现有的采用过滤-验证机制的方法将图数据库中的图逐个的独立存放。提出方法将图数据库中图结构化的压缩组织起来。通过压缩组织方法,产生一个逻辑数据结构GPTree,其中记录了数据库中图的公共子图的信息。为了优化的构造GPTree,形式化定义了最优诱导子图选择问题;证明了其是一个NP难问题,并提出了一个近似比为2的近似算法。(2)在构造有效的特征索引方面,提出高效而不依赖于历史查询的子图索引特征生成方法,以及两种索引结构CRGraph和FGPForest。首先基于分析,给出索引特征的显著性度量。提出了找出所有显著性不小于用户需求的索引特征的方法,即精确索引特征生成方法。为了适应需要更加快速的生成索引的应用场景,提出了特征索引构造的一个近似方法。这两种方法都是基于图模式挖掘的方法。为了高效使用索引特征,对索引特征进行排序;并且基于理论分析给出了求解其最优排序的算法。(3)在基于压缩组织来处理查询方面,提出从多个图到一个图的子图同构检测的新方法,称为GPTreeTest。现有方法逐个的考察每个图对进行检测,新方法能够利用压缩组织中公共子图的信息,显著减少对多个图的子图同构检测的总时间。最后,在真实数据集和合成数据集上的实验结果表明,提出方法比目前最好方法高效1至2个数据量级。2.提出不确定图数据库上概率top-k子图匹配查询的新问题、以及一种查询处理方法。首先给出不确定图数据模型,结合现实需求提出概率top-k子图匹配查询问题。一个顶点的邻居子图是由其距离不大于给定阈值内的所有顶点和边构成的子图。基于图结构空间相关性的特点,以附带概率信息的邻居子图为基础,设计一种有效的索引结构NG-Index。NG-Index索引可以很容易实现于成熟的关系数据库中,具有强健壮性。提出一种高效的基于搜索树的算法来进行查询处理。其中运用了一种概率剪枝技术来提高性能。最后通过实验考察并证实提出方法具有良好的效率和可扩展性。3.提出结合概念分层的图统计信息定义以及查询处理方法。具体地说,给出了结合顶点关联的概念分层,根据用户指定的搜索兴趣来高效地计算数据图中统计信息的方法。首先提出一种结合概念分层的图统计分布表示。本文将用户搜索兴趣建模为概念图,并以用户概念图的子图匹配计数为基础来表示图统计信息。其次,为了高效计算此统计分布信息,设计了一种基于子图密度的索引结构并提出两阶段的计算方法: (1)先基于索引快速地去除数据图中的不相关边并将数据图打散划分为若干小尺寸的连通图;(2)再对这些连通小图分别计算统计信息,最后合并得出结果。在连通小图上计算统计信息的核心是概念图的子图匹配计数问题。文中针对这个子问题着重提出两种高效算法:前向计算算法和后向计算算法。这种在精确计算之前将数据大图快速打散为多个小图的分治思想是总体效率提升的关键所在。最后,在真实数据集上的实验结果表明所提出方法具有良好的效率和可扩展性。4.提出了一种较大尺寸的标签图子图同构检测方法及其应用方法。所提出的检测方法是一种基于搜索的方法。本文从标签图的特性出发,以标签信息和图拓扑结构相结合的方式来缩减搜索空间。首先,将标签按照出现的频率比转换为数值。然后,将标签信息与结构相结合,来构造多组细粒度的顶点不变量。顶点不变量是关于顶点的固有属性,其在同构映射下保持不变。借助于所构造的细粒度的顶点不变量,将标签信息沿图拓扑结构传播开来,并缩减匹配顶点候选集来减小搜索空间。再次,基于顶点不变量生成了细粒度的剪枝条件。由于结合标签信息和拓扑结构,这些条件具有更强的剪枝能力。另外,将提出检测方法中的技术细节应用到第2章提出的GPTree结构上,来显示其可用来优化已有方法的适用性。最后实验结果表明,提出方法具有良好的高效性,同时应用新技术的GPTreeTest*算法效率优于原始方法GPTreeTest。

【Abstract】 Graph is a general data structure in computer science, and can be used for modelingcomplex relationships between data objects. For example, graphs can model chemicalcompound structures, protein interaction networks and social networks etc.. With therapid proliferation of graph data in various domains, graph data management has becomean important research subfield in the data management domain. Query processing ongraph databases is one of the most important and fundamental research topics in graphdata management, as a large number of graph-related tasks and many applications, suchas graph mining and the PubChem chemical database, rely on processing queries in anefficient way. In this dissertation, we focus on the techniques for processing querieson graphs databases, review the current research results, formulate several new researchproblems and propose some novel approaches for efficiently query processing. The maincontributions of the dissertation are as follows:1. We propose a novel approach for efficiently processing supergraph queries ongraph databases. The proposed new approach involves the techniques of (1) compactlyorganizing graph databases, (2) constructing index by using significant subgraph fea-tures and (3) processing queries based on the compact organization and the index. (1)For graph query processing, many existing methods use the filtering-and-verification (orF&V) methodology, i.e. first fast filter some negative results to obtain a candidate set andthen verify each candidate. For processing supergraph queries, the existing F&V-basedmethod arranges the graphs in databases one by one separately. We propose to mergesome common subgraphs shared by multiple graphs and compactly organizing graphdatabases, which derives a logical data structure, namely GPTree. In order to form a goodGPTree, an optimization problem of optimal induced subgraph selecting is formulated.After proving it is NP-hard, we propose an approximation algorithm with ratio bound 2.(2) For constructing effective indices, two algorithms for generating significant indexedsubgraph features, which are independent of historical queries, are proposed. Based onthe generated indexed subgraph features, two indices CRGraph and FGPForest are de-vised. In particular, by analyzing the effectiveness of a set of subgraphs, a significancemeasure w.r.t. subgraph is defined. An algorithm for finding all the subgraphs with no less than a user-specified significance threshold is presented. Also, another algorithm forextracting an approximate set of the significant subgraphs is proposed, for the purpose offast index creating that is required by some real-life applications. In order to upgrade theeffectiveness of index, based on the mathematical analysis, we propose a method for op-timally ordering the extracted indexed subgraphs. (3) An algorithm, called GPTreeTest,for testing subgraph isomorphisms from multiple graphs in a GPTree to a query graph isproposed. This new algorithm can utilize the information of common subgraphs recordedin a GPTree and reduce the overall computational cost for subgraph isomorphism test-ing. The experimental results on both real and synthetic datasets show the new approachoutperform the existing counterpart by one to two orders of magnitude.2. We present a new problem of probabilistic top-k subgraph matching query onuncertain graph databases, and propose a query processing method. Firstly, a uncertaingraph data model is presented. Based on the data model, the formulation of probabilis-tic top-k subgraph matching query is introduced. The neighborhood subgraph w.r.t. avertex in a graph is the subgraph consisting of all the vertices with the distance no morethan a threshold and all the edges between these vertices. By using the characteristics ofspace-locality of structures in a graph, based on neighborhood subgraphs, we propose aneffective index structure, namely NG-Index, with probabilistic information in a data un-certain graph. NG-Index can be easily implemented in relational databases. In addition,based on NG-Index, we present an efficient search-tree-based algorithm with probabilis-tic pruning technique to search large uncertain graphs. Experiment results validate theefficiency and scalability of the proposed techniques.3. We present a new problem of querying graph statistics in the presence of concepthierarchy. We formulate the user search interest by a concept graph. Given a concepthierarchy, a user-issued concept graph, a solution to the problem aims at efficiently evalu-ating the statistics of a given data graph. Firstly, we present a definition of graph statistics,which is based on the number of subgraph matchings of user-issued concept graphs. Then,we devise an index structure based on subgraph density, and propose a two-phase evalu-ation method. In the first phase, we fast break apart the original data graph into a set ofsmall-size connected graphs; in the second phase, each small graph is evaluated and thenthe results are merged together. The core sub-problem involved in the second phase iscomputing the number of subgraph matchings of concept graphs. For this sub-problem, we propose two algorithms, i.e. forward computation and backward computation. At last,the experimental results on real datasets show the elegance of the proposed method.4. We propose a new search-based method for testing subgraph isomorphisms formoderate-size labeled graphs; and after that we apply the proposed new techniques inthe GPTree data structure and propose GPTreeTest* algorithm. We analyze and utilizethe characteristics of labeled graphs, and reduce the search space by integratedly usingthe label and structure information. Firstly, labels are converted into numbers based onlabel frequency. Secondly, we record a set of fine-grained vertex invariants. Vertex invari-ants are some inherent properties of the vertices that do not change across isomorphismmappings. With the help of these fine-grained vertex invariants, label information is prop-agated through the structure in graphs, which reduces the overall search space. Thirdly,we derive pruning conditions from the fine-grained vertex invariants. These conditionsare quite strong since the label and structure information are combined together. Throughapplying the proposed new techniques in GPTree, we show the proposed new techniquesare able to improve a number of tasks. The experimental results show the high efficiencyof the proposed testing method and show that GPTreeTest* outperforms the original GP-TreeTest algorithm.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络