节点文献

基于共享近邻的自适应谱聚类算法

Adaptive Spectral Clustering Based on Shared Nearest Neighbors

【作者】 李静伟

【导师】 张宪超;

【作者基本信息】 大连理工大学 , 计算机应用技术, 2010, 硕士

【摘要】 谱聚类作为一种新颖的聚类算法,近年来在模式识别领域受到广泛关注。它不对数据的全局结构作假设,而是通过直接求图的拉普拉斯矩阵的特征分解,获得聚类判据在放松了的连续域上的全局最优解。因此,它能在任意形状的样本空间上聚类,且收敛于全局最优。由于谱聚类算法直接基于相似度矩阵对应的拉普拉斯矩阵进行求解,因此相似度定义对谱聚类算法的性能有至关重要的影响。本文首先介绍了谱聚类算法涉及的数学基础知识,并从图划分和随机游走两个角度阐述了谱聚类算法的基本原理,然后对谱聚类中常用的计算相似度的函数——高斯核函数以及现有的相似度改进算法进行了详细的分析和研究。发现当两对数据点的距离相等,数据点邻域也类似时,同一簇中的两点应该比不同簇中的两点具有更高的相似度。但无论高斯核函数还是自调节谱聚类中使用局部邻域的相似度都不能满足该聚类假设。本文在总结已有相似度优缺点的基础上,提出基于共享近邻的自适应高斯核函数。它用两点的共享近邻表征局部密度,从而获知隐含的簇结构信息,并将这一信息与自调节的高斯核函数相结合,使中间有较多数据分布的两点具有更高的相似度。新的相似度矩阵满足聚类的两条假设,具有明显的块对角性,对应的谱聚类算法称为基于共享近邻的自适应谱聚类算法。最后,在若干具有挑战性的人工数据集和4个UCI真实数据集上将该算法和经典谱聚类算法以及自适应谱聚类算法进行了对比实验。实验结果表明该方法相对于经典谱聚类算法和自适应谱聚类算法,性能有明显提高,能有效识别数据点之间的内在联系,得到正确的聚类结果。

【Abstract】 Spectral clustering has become one of the most popular clustering algorithms in recent years. Traditional clustering methods often suppose the data coincide with certain distribution such as k-means. But spectral clustering method doesn’t make any assumption. It directly computes the leading eigenvectors of Laplacian matrix and gets an approximation to graph partitioning problems, so it is applicable to a wide range especially when the data set is non convex. As spectral clustering bases directly on the corresponding Laplacian matrix of the similarity matrix, similarity measurement is crucial to the performance of spectral clustering.In this paper it firstly gives an introduction to the mathematical objects used by spectral clustering and explains why spectral clustering algorithms work in two different views. One is graph partitioning and the other is from random walk perspective. Then the mostly used similarity measure, the Gaussian kernel function, and other similarity measures used in spectral clustering are analyzed. With the same Euclidean distance, two points in the same cluster should have higher similarity than two points in different clusters, this is the clustering assumption. Neither the traditional Gaussian kernel function nor the measure using the local scale parameter proposed in Self-Tuning satisfies the clustering assumption. Through analyzing of the strength and weakness of these measurements, we propose a novel similarity measure, namely adaptive Gaussian kernel function based on shared nearest neighbors in this paper. It can detect the intrinsic structure of the cluster embedded in the data sets through exploiting the information about local density embedded in the shared nearest neighbors and obtain higher similarity value when there are many common neighbors between the two data points. This measure holds the clustering assumption, and the affinity matrix has clear block structure. The corresponding spectral clustering method is called adaptive spectral clustering based on shared nearest neighbors. Finally, a number of experiments on some synthetic dataset and four real data sets which comes from the UCI data repository are carried out. Experimental results show that the proposed algorithm achieves considerable improvements over the typical spectral clustering algorithm and the Self-Tuning spectral clustering algorithm. It can reveal the relationship between the data points and find the real clusters.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络