节点文献

高维数据上的聚类方法研究

Research of Clustering Methods on High Dimensional Data

【作者】 任亚洲

【导师】 张国基;

【作者基本信息】 华南理工大学 , 计算机应用技术, 2014, 博士

【摘要】 近年来,随着信息技术的飞速发展、数据集规模的不断膨胀,如何有效地分析这些海量数据并从中提取有用的信息成为研究的热点和难点。聚类分析,作为一种无监督的机器学习方法越来越受到人们重视并得到了快速的发展,已经被广泛应用于生物信息学、互联网技术和图像分析等重要领域。一般的聚类算法在低维数据上能取得较好的结果,而在处理高维数据时会发生“维数灾难”。随着数据维度的增加,数据变得稀疏,样本之间的距离差距不再明显,同时噪声特征和冗余特征也随之增多,这些因素都可能导致聚类算法的有效性大大降低。因此,针对高维数据的聚类算法研究已经成为机器学习领域研究的难点与重点。同时,很多聚类算法对数据有很多约束条件,如限制簇的数目以及形状等等,而这些限制在实际问题中往往不能得到满足,所以如何设计有效的“无参”的聚类算法也非常重要。本文以高维数据上的聚类方法研究为主线,结合集成技术、Boosting技术等,对高维数据的聚类问题展开深入研究,提出了一些新的聚类算法。全文的主要贡献包括:(1)集成聚类能够综合利用多个聚类结果提高聚类结果的稳定性和准确性近年来大量的集成聚类算法被提出,然而其中绝大部分算法将每个基聚类、每个样本或每个簇平等地对待。一些算法尝试在集成过程中使用簇或者基聚类的权重,然而还没有相关的研究工作在集成过程中更细粒度地考虑样本的权重。为了解决这一问题,本文提出样本加权的集成聚类算法(Weighted-Object Ensemble Clustering, WOEC)。WOEC首先通过共联矩阵去评估每个样本难划分的程度,并为样本分配相应的权重。本文提出三种集成聚类方法来利用样本的权重,这三种方法都把集成聚类问题转化为图的分割问题。大量实验证明WOEC算法的优越性以及对参数的鲁棒性。(2)Mean Shift(均值偏移算法)是一种“无参”的聚类方法,它不需要指定簇的数目和形状。它为每个点做概率密度估计,并不断沿着邻域内的概率密度增加最大的方向移动直至收敛。收敛到同一个模的所有样本点被划分为同一类。运行时,由于高维数据的稀疏性以及噪音特征的存在,Mean Shift的有效性大大降低。为解决这一问题,本文提出一种加权的自适应均值偏移聚类算法(A Weighted Adaptive Mean Shift ClusteringAlgorithm, WAMS)。首先,WAMS分析每个样本点所在的子空间信息,并将这些信息应用到Mean Shift算法中,从而避免在原始空间里计算距离。WAMS能够有效地处理高维数据,并同时保持了Mean Shift的“无参”特性。利用随机采样技术,可以加快WAMS的运行速度,而不会牺牲WAMS的准确性。本文在大量人工和真实数据集上证明了WAMS算法的有效性。(3)Mean Shift算法的另一个缺点是对参数(带宽)的选择敏感,而且不能处理一簇多模的情况。DBSCAN是另一种流行的基于密度的聚类算法,它也对参数敏感且容易合并有交集的簇。为了克服这些缺点,本文提出一种增强的均值偏移聚类算法(BoostedMean Shift Clustering, BMSC)。BMSC通过一个网格划分原始数据并局部地在网格的每个单元执行Mean Shift算法,这样每个单元可以提供一组中间过程的模(iModes)。本文提出一种模-增强的技术以迭代地选择稠密区域的样本,而DBSCAN被用来划分已得的所有iModes。计算复杂度分析说明了BMSC有处理大规模数据的潜力,实验也证明了BMSC算法的有效性和鲁棒性。

【Abstract】 In recent years, with the rapid development of information technology and sharp increaseof the size of data, how to effectively analyze these massive amount of data and extract use-ful information has become one of the research hotspots and difficulties. Cluster analysis, anunsupervised machine learning method, attracts more and more attentions and has developedrapidly. It has been used in bioinformatics, Internet technology and image analysis, et al. Mostclustering algorithms can achieve good performance on low-dimensional data, but will occur”curse of dimensionality” when dealing with high-dimensional data. With the increase of di-mensionality, the data becomes sparse, distances among objects tend to be the same and noisyand redundant features increase as well, thereby reducing the effectiveness of clustering algo-rithms. Therefore, research of clustering methods on high-dimensional data becomes a popularand difficult reasearch area. Meanwhile, many clustering algorithms have a lot of constraints onthe data, such as restrictions on the number of clusters and on their shapes, which probably arenot satisfied in real-world tasks. So how to design effective nonparameter clustering techniquesis very important as well.To address the problem associated with clustering on high-dimensional data, we make useof ensemble technique and Boosting, et al., and propose several novel clustering methods. Themain contributions of the thesis are summaried as follows:(1) Ensemble clustering aims to generate a stable and robust clustering through the con-solidation of multiple base clusterings. In recent years many ensemble clustering methods havebeen proposed, most of which treat each clustering and each object as equally important. Someapproaches make use of weights associated with clusters, or with clusterings, when assemblingthe different base clusterings. However, not much effort has been put towards incorporatingweighted objects into the consensus process. To fill this gap, in this paper we propose an ap-proach called Weighted-Object Ensemble Clustering (WOEC). We first estimate how difficultit is to cluster an object by constructing the co-association matrix that summarizes the baseclustering results, and we then embed the corresponding information as weights associated toobjects. We propose three different consensus techniques to leverage the weighted objects. Allthree reduce the ensemble clustering problem to a graph partitioning one. We present extensive experimental results which demonstrate that our WOEC approach outperforms state-of-the-artconsensus clustering methods and is robust to parameter settings.(2) The mean shift algorithm is a nonparametric clustering technique that does not makeassumptions on the number of clusters and on their shapes. It achieves this goal by performingkernel density estimation, and iteratively locating the local maxima of the kernel mixture. Thesetofpointsthatconvergetothesamemodedefinesacluster. Whileappealing, theperformanceof the mean shift algorithm significantly deteriorates with high dimensional data due to thesparsity of the input space. Noisy features can also disguise the mean shift procedure.Inthispaperweextendthemeanshiftalgorithmtoovercometheselimitations, whilemain-taining its desirable properties. To achieve this goal, we first estimate the relevant subspace foreachdatapoint, andthenembedsuchinformationwithinthemeanshiftalgorithm, thusavoidingcomputing distances in the full dimensional input space. The resulting approach achieves thebest-of-two-worlds: effective management of high dimensional data and noisy features, whilepreserving a nonparametric nature. Our approach can also be combined with random samplingto speedup the clustering process with large scale data, without sacrificing accuracy. Extensiveexperimental results on both synthetic and real-world data demonstrate the effectiveness of theproposed method.(3)Mean shift is a nonparametric clustering technique that does not require the number ofclusters in input and can find clusters of arbitrary shapes. While appealing, the performance ofthe mean shift algorithm is sensitive to the selection of the bandwidth, and can fail to capture thecorrect clustering structure when multiple modes exist in one cluster. DBSCAN is an efficientdensity based clustering algorithm, but it is also sensitive to its parameters and typically mergesoverlapping clusters. In this paper we propose Boosted Mean Shift Clustering (BMSC) to ad-dress these issues. BMSC partitions the data across a grid and applies mean shift locally on thecells of the grid, each providing a number of intermediate modes (iModes). A mode-boostingtechnique is proposed to select points in denser regions iteratively, and DBSCAN is utilized topartition the obtained iModes iteratively. Complexity analysis shows its potential to deal withlarge-scale data and extensive experimental results on both synthetic and real benchmark datademonstrate its effectiveness and robustness to parameter settings.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络