节点文献

聚类CLIQUE算法及其并行化研究

CLIQUE Algorithm and Parallelize

【作者】 祖峰

【导师】 熊忠阳;

【作者基本信息】 重庆大学 , 计算机系统结构, 2003, 硕士

【摘要】 数据挖掘是帮助人们在海量数据中发现信息和知识的工具。近年来数据挖掘技术成了商业智能的核心技术,被广泛应用到了诸多领域,引起了学术界极大的关注。聚类分析是数据挖掘中的一个重要研究领域,它从数据库中寻找数据间的相似性,从而优化大规模数据库的查询和发现数据中隐含的有用信息或知识。如何进行快速聚类以及如何取得更好的聚类结果成了聚类数据挖掘算法研究的重点和难点。CLIQUE算法综合了基于密度和基于网格的聚类方法,它有着速度快的优点。但是由于方法太简化,可能会降低聚类结果的精确性。通过深入的研究和分析,发现由于CLIQUE算法没有考虑到如何利用当前挖掘数据的特性,而是进行一种硬性的网格划分,因此增加了计算复杂程度,而为了降低计算的复杂程度就只能降低聚类结果的精确性。针对上述问题论文引入了自适应的网格划分方法,通过在一维的情况下预先分割区间,然后找出密集分割区间并对分界进行调整来得到密集区间,最后把这些密集区间作为划分网格的依据。这种划分网格的方法很好地利用了当前要挖掘的数据的特性,同时减少了网格的数量以及密集单元候选集的数目,大幅度减少了计算的复杂程度,从而使得在每个子空间进行计算成为了现实,也大大提高了聚类结果的精确性,但算法的时间复杂度仍是指数级的。只是这个指数是维数,使得算法的时间复杂度比起很多聚类算法的仍然简单很多。为了进一步提高算法的执行效率,论文还对并行CLIQUE算法进行了研究。选用通过商用网络连接起来的PC机,以及并行虚拟机PVM和分布式操作系统LINUX,共同构成了一个机群系统作为并行计算平台。在并行程序的模型上选用了Master/Slave模型。该并行算法将数据集分配到各个节点机上实现了数据并行,在数据并行的基础上,当生成密集单元候选集以及验证密集单元的时候又采取了任务并行的方法。由于主体是数据并行,因此达到了接近线性的加速比。每个节点计算任务的时间复杂度由两部分构成,一部分是指数级的验证密集单元的时间复杂度,另一部分是线性的通信时间复杂度。最后,通过实验验证了并行CLIQUE算法的可行性,从实验中得到的并行算法的加速比与理论分析结果一致。实验表明,并行CLIQUE算法在提高了聚类挖掘结果精确度的同时达到了较高的效率,同时由于算法是基于PVM的机群系统开发的,因此算法的通用性较强。

【Abstract】 Data mining technology is used to help people finding the information and knowledge in the data. It has become the core technology of the intelligence commerce. It has been widely used in many areas and drawn the attention of the whole academe. Clustering is one of the most important areas in data mining Clustering finds the similarity among the data and use it to optimal the query of the large scale databases and find the hidden useful information and knowledge. How to make the clustering faster and the result of the clustering more accurate is of the most importance and hardness.CLIQUE is integrated density-based and grid-based method. It has the advantage of faster speed. But due to simplify the procedure, the accuracy of the clustering may be degraded. After deeply investigate and analysis, we found the drawback of CLIQUE lies in its inconsideration of the characteristic of the data being processed. It grid the data into a predefined grid and this adds up to the complexity of the computation. Then it has to degrade the accuracy of the result to degrade the complexity of computation,. We introduce adaptive-grid method to settle this problem. We divide each dimension into a fix interval and join the dense interval to dense part. At the boundary of the each dense part, boundary is adjusted by dividing a smaller interval. Finally the adaptive-grid is produced according to the dense part. This method makes full use of the characteristic of the data being processed. The number of dense unit and candidate dense unit is great reduced. At the same time the complexity of the computation is greatly decreased. So, computation in each dimension is feasible. This make the accuracy of cluster upgraded. But the computation complexity of the algorithm is still exponential. Due to the fact the exponent is dimension, the complexity of algorithm is still less than other clustering algorithms.To make the algorithms more efficient, it was parallelized. The hardware platform is PC connected with LAN. The software platform is PVM and LINUX. They construct the whole PC-cluster system. The parallel program model is master/slave model. The algorithm assign data set to each node realizes the data-parallel. When produce dense unit, task-parallel is used. Due to the fact the algorithm is complete data-parallel; the speedup of the algorithm is nearly liner. The time complexity of the each node is composes of exponential computation time and liner communication time. At last, the experiment proves the feasibility of the algorithm and the speedup gets <WP=6>from the experiment is in accord with theoretical one. The experiment also proves the parallel algorithm upgrade the accuracy of the clustering result combined with more efficient. Because the algorithm is based on PVM cluster, it is more popular.

【关键词】 数据挖掘聚类并行算法工作站机群
【Key words】 Data miningclusteringparallel algorithmNOW
  • 【网络出版投稿人】 重庆大学
  • 【网络出版年期】2004年 01期
  • 【分类号】TP311.13
  • 【被引频次】3
  • 【下载频次】382
节点文献中: 

本文链接的文献网络图示:

本文的引文网络