节点文献
一种并行分层聚类算法的研究和实现
The Research and Realizing of Parallel Cluster Analysis Algorithm
【作者】 李朝健;
【导师】 肖建华;
【作者基本信息】 湘潭大学 , 计算机技术, 2007, 硕士
【摘要】 聚类分析是数据挖掘的重要研究领域之一,在工程、商业、生命科学、社会科学以及其他许多领域得到了广泛的应用。但由于聚类对象在高维特征空间分布的复杂性,聚类效果评价的不确定性和灵活性,以及聚类作为一个优化问题求解的高计算复杂性,聚类算法仍然面临着众多的问题和挑战。目前,研究者提出了大量的聚类算法。其中层次聚类算法是其中的主要方法之一,受到了大量学者的密切关注。目前最好的串行算法的时间复杂性可达到O(n2),但依然难于处理生物信息学或入侵检测中的海量数据;并行算法目前多基于CREW-PRAM或CRCW-PRAM模型,其运行成本不低于O(n2)。这些算法多使用随机或概率算法,而且算法中的处理器数目无法根据运行环境改变,也没有考虑各并行处理器对共享存储器的存储冲突。本文通过利用完全图求欧几里德最小生成树算法和无存储冲突的连通分支确定算法,提出一种基于EREW-SIMD共享存储模型的无存储冲突并行层次聚类算法,其成本为O(n2)。通过与其他算法性能比较,比较结果说明本文提出的算法在保证存储无冲突的前提下,能以最优的成本在最弱的PRAM—EREW模型运行,且处理器可根据实际可用的计算资源动态调整。为了验证本文算法的性能,利用基准测试数据集在高性能计算中心的IBM P690机器上进行了计算实验。实验结果表明:本文算法在计算时间和空间上具有一定的比较优势,对大规模数据集具有较强的可扩展性。
【Abstract】 Clustering analysis is one of the important areas in the data mining research. Especially, it is applicable in many fields, such as engineering, business, biology, social sciences and others. However, because of the complexity of the distribution of clustering object at high dimensional feature space, the uncertainty and the flexibility of the evaluation of clustering results and high computing complexity of cluster problem as an optimized problem, clustering algorithm still are confronted with lots of problems and challenges.At present, researchers put forward lots of clustering algorithms. Therein, hierarchical clustering algorithms as a main kind, are paid much attention. So far time complexity of the best sequential algorithms is up to O(n2), but huge data in biological information or intrusion detection is hard for them to process; most of the parallel algorithms use the CRCW-PRAM or CREW-PRAM models of computing, the cost of these algorithms are O(n2) or larger, most are randomized or probability algorithms, unable to automatically adapt to the number of processors, and not take account of sharing memory conflicts. This paper proposed a new parallel adaptive deterministic algorithm without memory conflicts based on EREW-SIMD shared memory model by use of complete graph and Euclidean minimum spanning tree, with cost of O(n2) . Through performance comparison with other algorithms, it is demonstrated that on guarantee of no memory conflicts, the algorithm can run on the weakest PRAM-EREW model at an optimal cost, and automatically adapt to the change of the number of processors.To validate the performance and extensibility of our algorithm in practice, we made several experiments and simulations based on the synthetic and benchmark data sets from Internet on IBM P690 of high performance computing centre. The results showed that our algorithm has a great superiority in both computing time and space, and consequently a stronger adaptability and operability for large scale data sets.
【Key words】 Hierarchical clustering; Memory conflicts; Parallel algorithms; EREW-SIMD; High performance computing;