节点文献

基于图的数据挖掘算法研究

Research on Algorithm of Graph-Based Data Mining

【作者】 唐德权

【导师】 夏幼明;

【作者基本信息】 云南师范大学 , 计算机软件与理论, 2007, 硕士

【摘要】 与一般的数据比较,图能够表达更加丰富的语义,在科学研究和许多商业领域有着更为广泛的应用,如它可以描述世界万物之间错综复杂的联系,在社会网络分析中,人与人之间、人与物之间的联系是复杂的。通过抽象方法,可以将整个社会变成一个网络拓扑图,其中每个人可以是图中的节点,而人与人之间的联系则可以看作为图中的边。对社会人群的分析,自然就可以转化为对社会网络结构的挖掘。在生物技术领域,在生物学家发现蛋白质基因结构配位实验上频繁子图挖掘可以减轻结构匹配实验的代价。在Web挖掘、空间数据挖掘、药物分子式设计及其功能预测等领域也都有广泛应用。因此,对频繁子图的挖掘算法的研究有着重要的理论意义和应用价值。同时,这种丰富的语义也增加了数据结构的复杂性和挖掘令人感兴趣的图的子结构的难度。因此,需要综合应用图论知识与数据挖掘的各种技术。本文工作主要包括以下几部分:(1)在分析当前频繁子图挖掘定义的基础上结合支持度能反映数据库元素的共性和频繁度揭示了元素的个性特点,给出了基于支持度和频繁度的频繁子图挖掘定义;(2)基于子图同构理论和判断两个图同构的思想:如果两个图同构,那么它们的规范化编码一定相等。提出了一种有效进行图的规范化编码算法,从而避免子图同构NP完全问题带来的困难;(3)基于目前产生许多冗余候选子图的技术,提出了一种新的候选子图产生方法,通过连接和扩展操作产生所有候选子图,并且无冗余候选子图产生,从而可以正确计算候选子图的支持度,也减少了一些无效子图匹配问题;(4)引用嵌入集概念,基于候选子图的规范化邻接矩阵(CAM)在某个图的规范化邻接矩阵(CAM)的嵌入特征有效地计算候出选子图的支持度和频繁度;(5)基于图的规范化理论、候选子图的产生技术和候选子图支持度的计算方法,提出了频繁子图挖掘算法FSubgraphM,它能有效地从图数据库中挖掘频繁导出子图。实验研究表明,FSubgraphM能有效地从数据库carcinogen中挖掘其中的频繁导出子图结构,并根据频繁结构集提取有趣的关联知识,有着重要的理论指导意义和应用价值。本文解决了频繁子图挖掘算法中三个关键问题:(1)提出新的规范编码解决了判断一个图是否与另一个图同构,即子图同构问题;(2)提出新的连接和扩展操作算子有效解决了生成候选子图问题;(3)引入嵌入集概念,巧妙地结合连接和扩展操作计算嵌入集,解决了计算频繁子图问题。

【Abstract】 Graph can express more semantence compared with other data. It can be intuitively presented and has a wide variety of applications both in research and in business. Such as it may describe relations in the world intricate things. Because the relation of between person and person ,and the person and thing relation is complex in the social network. Through the abstract method, may turn the entire society into network topology graph and each person may be the node of graph, but between the person and person’s relation may regard as for the edge of graph. Therefore, analysis to the social crowd’s may transform to the mining of society network structure. In the biological technology domain, the biologist discovered frequent subgraph mining may reduce the price of structure match experiment in the protein gene structure match experiment. Frequent subgraph mining has the widespread application in the Web mining the space data mining, the medicine molecular formula design and it’s the function forecast and so on. Therefore, research on the frequent subgraph mining has the important significance of theory and the application value. Simultaneously, The rich semanteme enhance the complexity of data structure and increase the difficulty of mining interested sub-graph. So, we must imply the graph knowledge and data mining techology to solove this problem.Our contribution in this paper includes: (1) according to the particular that support reflects the common characteristic of the elements in a database, then frequent reveals the personality of the elements, we propose a novel definition of frequent subgraph mining based on the support and frequency subgraph mining definition, (2) base on the theory of subgraph isomorphism and the idea of determine two graph isomorphism: if two graph isomorphism, then canonical code must be equal, we propose a novel graph canonical form to determining graph isomorphism and avoid the NP-Complete problem of subgraph isomorphism, (3) base on the subgraph mining technology which generate redundancy subgraph, we propose two efficient candidate generation operations: FSubgraphM-Join and FSubgraphM-Extension that can significantly reduce the invalid graph matchings during the counting,(4) Can efficient enumerated candidate subgraph’s supporty and frequency by maintaining an embedding set, which base on the features that one CAM is another graph’s CAM;(5) base on the theory of canonical graph, candidate subgraph generate techlogy and method of support, we propose algorithm FSubgraphM which employs the above techniques to mine frequent induced subgraph from a graph data sets.Performance study indicates that FSubgraphM can effectively discovery the frequent induced subgraph from the database carcinogen, it also can form some interesting association rules from the frequent subgraphs which has some theoretical and practical significance.In this paper solve three key problem that frequent subgraph mining: (1) solve subgrph isomorphism problem by novel canonical coding to determine one graph whether isomorphism to another graph; (2) solve generate candidate subgraph problem by propose novel join and extension operator; (3) solve counting frequent subgraph problem by embedding sets, apply skill of combine join and extension calculate embedding sets.

  • 【分类号】TP311.13
  • 【被引频次】1
  • 【下载频次】244
节点文献中: 

本文链接的文献网络图示:

本文的引文网络