节点文献

动态图数据挖掘与查询算法的研究

Research on Mining and Query Processing Algorithms on Dynamic Graphs

【作者】 杨雅君

【导师】 高宏;

【作者基本信息】 哈尔滨工业大学 , 计算机软件与理论, 2013, 博士

【摘要】 随着信息科技的高速发展,各个应用领域的的信息量都呈现了爆炸性的增长趋势。因此,不同应用领域所关注的实体对象之间的关系也变得愈加庞大和复杂。在这些复杂关系的背后,往往蕴含着巨大的科学价值和商业价值。近年来,许多领域的研究者都开始专注于实体对象之间关系的研究。“图”作为计算机科学中一般性的数据结构,它可以很好的反映数据对象之间的复杂关系。因此,图数据被广泛地用来刻画现实世界中各种各样实体间的复杂关系。然而,在现实世界中,实体对象间的关系每时每刻都在发生着变化。例如:在蛋白质交互网络中,各类蛋白质分子相互作用有着先后顺序,这使得整个蛋白质交互过程要随着时间顺序进行;在社会关系网络中,各类人和各类社会团体之间的交互关系会随着时间发生变化;在交通网络中,通过每条公路的时间和费用因交通拥塞的发生而随着时间发生变化。因此,描述实体间关系的图数据也会随着时间发生变化。动态图是指会随时间发生变化的图数据。动态图可以根据其变化的形式分为两类:(1)图中拓扑结构关系发生变化;(2)图中顶点和边所代表数据对象内容、或者图中某一特定对象的评价方式发生变化。我们分别称之为图的结构变化和图的内容变化。由于动态图在现实世界中广泛存在,因此对动态图上各类问题的研究就有着十分重要的意义。然而,传统静态图数据上解决各类问题的算法无法用于解决动态图数据上的各类问题,主要有以下几个原因:第一,传统的静态图模型无法描述图数据随时间发生演绎与进化的情况;第二,传统的静态图模型上的各类问题的定义在动态图模型上不再适用;第三,图数据的动态性使得各类问题的计算复杂性大大增加,原有的静态图上的各类方法将无法有效地解决这些问题。此外,现有的动态数据管理的研究主要关注于数据流和传统关系数据的动态维护,这些方法无法处理结构复杂的图数据。本文运用计算复杂性和算法学的理论和知识,分别对基于结构变化和基于内容变化的动态图上的相关问题开展了研究工作,主要研究成果如下:(1)本文研究了动态图上结构变化最频繁子图挖掘问题,提出了一个基于“累积连通度变化”概念的评分函数来评价子图的结构变化的频繁程度。本文提出一个两段式算法来计算动态图中任意一对顶点之间的累积连通度变化。进一步地,本文提出一个基于“划分树”结构的算法挖掘目标子图,该算法逐步地将不属于目标子图中的顶点从全图中移除。本文分析了算法的时间复杂度和空间复杂度。真实数据上的实验结果表明,本文算法所挖掘到的子图在整个动态图中变化最频繁。同时,本文算法具有很高的的时间效率。(2)本文研究了动态图上连通度变化最大的k-顶点集挖掘问题,提出了一种名为“连通等价类树”的结构,利用该结构可以高效地计算图中所有顶点对之间的连通度变化。本文提出了一个高效的连通等价类树构建算法和更新算法。其中,更新算法是一个逐边增量式算法,当动态图发生变化时,该算法只需要更新图中一小部分顶点之间的连通度即可完成连通等价类树的更新。本文证明,在相同的空间复杂度下,本文提出连通等价类树更新算法的时间复杂度远远小于现有最快的连通度更新算法。进一步地,本文提出一种分支界限算法和三种高效的剪枝策略用于挖掘连通度变化最大的k-顶点集。真实数据上的实验结果表明,本文算法所挖掘到的顶点子集可以良好的反映整张动态图中连通度变化情况,因此,其可以作为动态图连通关系变化的一个概括。同时,实验结果表明,本文中算法具有很高的时间效率。(3)本文研究了多维代价图上动态最优路径查询问题。本文证明了多维代价图上动态最优路径查询问题是一个NP-难问题,并提出了一个分支界限算法和三种高效的剪枝策略用于计算动态最优路径。进一步地,本文提出了一种新的索引结构,k-簇索引,使得本文算法在大规模图上是时间高效和空间高效的。k-簇索引利用“contour skyline”加速查询进程。本文证明计算“contourskyline”是一个NP-难的问题,并提出一个2-近似的算法。本文证明了不存在近似比小于2的多项式时间算法。同时,本文提出一个顶点过滤算法,该算法可将图中大部分不属于最优路径的顶点过滤掉,进一步提高了算法效率。真实数据上的实验结果表明,本文算法具有非常高的时间效率和空间效率。(4)本文研究了时间依赖图上满足时间限制的费用代价最优路径查询问题,提出了一个两阶段算法计算满足时间限制的费用代价最优路径。算法在第一阶段计算了起点到终点满足时间限制的最小费用代价,在第二阶段根据最小费用代价计算了起点到终点的最优路径。本文分析了查询算法的时间复杂度和空间复杂度。真实数据上的实验结果表明,本文提出的查询算法具有非常高的时间效率和空间效率。

【Abstract】 With the development of information technology, the amount of information in eachapplication field presents explosive growth trend. The relationships among various enti-ties become more and more complex and there are the huge scientific value and commer-cial value behind them. Recently, people study the relationships among entities. Graph,as a general data structure in computer science, is a good way to describe the relationship-s among entities. Graphs have been widely used to model complex relationships amongvarious entities in real applications.However, the relationships among entities often evolve over time. For example, inprotein interaction networks, proteins interact with each other in order, thus the relation-ships of interaction evolve over time. In social networks, the communication relationshipsamong people and communities also change with time. In addition, in road networks, thetime and toll fee through a road are time dependent. Therefore, graphs often evolve overtime.Dynamic graphs are the graphs that evolve over time. They can be divided into t-wo categories:(1) the topologies of graphs change with time; and (2) the data contentrepresented by vertices or edges changes with time or the evaluation model for a specif-ic data object in graph changes with time. We call them topology change and contentchange respectively. Since there are many dynamic graphs in real applications, it is veryimportant to study the problems on dynamic graphs. However, the existing methods forstatic graphs cannot solve the problems on dynamic graphs. The reasons are as follows.First, the static graph model cannot describe the changes in dynamic graphs. Second,the problem definitions on static graphs are not suitable for dynamic graphs. Third, theinherent computational complexity of problems on dynamic graphs is extremely high, theexisting methods on static graphs cannot solve these problems efficiently and effectively.In addition, the existing methods about dynamic data focus on data stream and relationaldata maintenance, these methods cannot handle the complex graph data.This thesis investigates the problems for topology dynamic graphs and content dy-namic graphs respectively, utilizing the knowledge of computational complexity theoryand algorithmic technology. The main contributions of this thesis are as follows. (1) This thesis studies the problem of mining most frequently change componentfrom evolving graphs. An objective function based on “cumulated connectivity change”is proposed to measure the change frequency for a subgraph. A two-way algorithm isproposed to compute the cumulated connectivity change when graph evolves from a snap-shot to the next snapshot. Furthermore, an algorithm based on partition tree is proposedto mine the objective subgraph by removing the vertices not in this subgraph one by one.This thesis analyzes the time and space complexity for algorithms. The experimental re-sults on real-life datasets state the objective subgraph found by algorithm changes mostfrequently in the whole evolving graph and the algorithm is efficient and effective.(2) This thesis studies the problem of mining k-vertex set with the maximum sumof connectivity change among vertices in this set. A new structure named “connectivitytree” is proposed, it can maintain the connectivity information efficiently for every pairof vertices in graphs. An algorithm is proposed to construct connectivity tree. Further-more, an efficient algorithm is proposed to update connectivity tree. This algorithm is anedge incremental algorithm. When graphs are evolving, by this algorithm, there is onlya small fraction of vertices among which the connectivity needs to be updated. A branchand bound algorithm with three efficient pruning rules is proposed to mine k-vertex set.The experimental results on real-life datasets confirm the k-vertex set can reflect and cap-ture the connectivity change in the whole evolving graphs effectively. Therefore, it canserve as a summary of connectivity change of dynamic graphs. The experimental resultsconfirm the efficiency and effectiveness of algorithms.(3) This thesis studies the problem of dynamic optimal path over multi-cost graphs.The problem is proved to be NP-hard in this thesis and it cannot be solved by the exist-ing shortest path algorithms. A branch and bound algorithm with three efficient pruningrules is proposed to compute the optimal path. Furthermore, a novel “k-cluster index” isproposed to make our algorithm more efficient on large graphs. k-cluster index utilizes“contour skyline” to facilitate the optimal path query processing. The problem to com-pute contour skyline is proved to be NP-hard in this thesis. A2-approximate algorithm isproposed to compute contour skyline and we show there is no (2)-algorithm in polyno-mial time. Finally, a vertex-filter algorithm is proposed to enhance the efficiency, whichcan filter a large fraction of vertices not in the optimal path. The experimental results onreal-life datasets confirm the efficiency of algorithm.(4) This thesis studies the problem of the cost-optimal path with time constraint over time-dependent graphs. A two-step algorithm is proposed to find the cost optimalpath with time constraint. In the first step, algorithm computes the minimum cost fromsource to destination. In the second step, algorithm finds the optimal path from sourceto destination. This thesis analyzes the time and space complexity for algorithm. Theexperimental results on real-life datasets confirm the efficiency of algorithm.

节点文献中: