节点文献

分布式环境下谱聚类算法研究

Spectral Clustering in Distributed Systems

【作者】 孟超

【导师】 崔晓燕;

【作者基本信息】 北京邮电大学 , 检测技术与自动化装置, 2013, 硕士

【摘要】 聚类分析是人类一项最基本的认识活动,是机器学习中的经典问题。所谓聚类就是按照事物的某些属性,把不同的事物聚集成类,使类间的相似性尽可能小,类内的相似性尽可能大。κ-means聚类算法作为一种基于中心的聚类算法,是一种比较经典和普遍的算法。当数据集为凸球型分布时,κ-means算法有很好的性能,能够得到较好的聚类结果。但是当样本空间不为凸时,κ-means算法往往会失效,而且算法利用迭代最优化方法求解最优解,因此算法会陷入局部最优解的情况。为了能在任意形状的样本空间上聚类,且能够收敛于全局最优解,近几年新出现了一种无监督的聚类算法即谱聚类算法克服了κ-means算法陷入局部最优解的缺点。该算法具有识别任意形状样本空间的能力,不会陷入局部最优解,能够很好的应用在实际问题中。但是应用在海量数据样本空间时,谱聚类算法会受到计算机存储空间不足和运行时间的限制,为了使算法能够在海量数据情况下使用,我们可以将该算法移植到分布式环境中,同时使用两种不同的方法将矩阵稀疏化,减小对内存空间的使用。本文重点是如何实现基于分布式环境下的高效谱聚类算法,具体内容包括:1.对相似矩阵进行稀疏化,同时比较两种不同的稀疏化方法的优劣。这里我们采用的两种稀疏化的方法有t最近邻方法和Nystrom方法,为了选择一种较优的方法,这里对两种方法从不同角度进行比较。最后通过实验验证我们发现使用t最近邻方法能够得到更好的聚类结果。2.利用以上由t最近邻来实现相似矩阵的稀疏化的方法,我们可以使用MPI模型和谷歌的Map/Reduce系统来搭建我们需求的分布式环境。之后将谱聚类算法移植到分布式平台上进行验证,结果表明,算法能够充分的利用各节点的资源,提高算法的运行效率,具有良好的扩展性,为海量数据的处理提供了很好的解决方案。

【Abstract】 Cluster analysis is a basic understanding activity in human life and is a classic problem in machine learning. The clustering algorithms gather different things into a class according to certain attribute of things, so that the similarity between the classes as small as possible, within the class as great as possible. K-means clustering algorithm is a classical and popular algorithm as a center-based clustering algorithm. When the data set distributed for the convex spherical, K-means algorithm has good performance and be able to get a better clustering results. However, when the sample space is convex, K-means algorithm often fails, and the algorithm using iterative optimization method for solving the optimal solution, so the algorithm will fall into the local optimal solution.Recent years, An unsupervised clustering algorithm that spectral clustering algorithm emerge for clustering in the sample space of any shape, overcome the K-means algorithm emerging into a local optimum the shortcomings of the solution, and can converge to the global optimal solution. The algorithm has the ability to identify the sample space of arbitrary shape, and will not fall into the local optimal solution. The application in practical problems is well for the algorithm. However, spectral clustring suffers from a problem in both memory use and computational time when the size of data set is large. To perform clustering on large data sets, We can transplanted the algorithm in a distributed environment, at the same time we use different sparse matrix methods to reduce the use of memory space.This article focuses on how to achieve efficient spectral clustering algorithm based on a distributed environment. Specific content including:1. Thinning similarity matrix using two different methods and compare two different sparse methods. The two methods are t nearest neighbor method and Nystrom method, in order to select a more excellent method; the two methods were compared from different angles. Finally, we found that the use of the t nearest neighbor method can get better clustering results verified by experiments.2. By the t nearest neighbor to the similarity matrix sparse, we can use the MPI model and Google’s MapReduce model to build a distributed environment. After that we transplant spectral clustering algorithm to distributed platforms. The results show that the algorithm is able to fully utilize the resources of each node to improve the operating efficiency of the algorithm and has good scalability. It provides a good solution for the application of massive data.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络