节点文献

聚类算法的维度分析

Clustering Algorithms Analysis on Data Dimension

【作者】 张强

【导师】 赵政;

【作者基本信息】 天津大学 , 计算机应用技术, 2007, 博士

【摘要】 本文主要研究不同维度条件下聚类分析的特点、需求以及相应的对策和解决方法。针对低维度、高维度全空间和高维度子空间聚类这三个问题分别提出了新的算法。在低维度聚类方面,提出了GDMS聚类算法。算法的主要贡献是:1、提出了探针窗口过滤法来检测数据的分布特性,通过选取不同的滤波函数得到不同密度、不同属性的聚类簇,通过选择探针的不同运动方式,实现精度和效率的统一。2、提出了一个新的数学形态学算子来提取聚类簇,算子的精度优于以往使用的开、闭算子。3、将尺度空间理论和形态学相结合,聚类结果是一个多尺度的、层次化的结构。4、算法支持含障碍物的聚类。算法的特点是:计算复杂度与数据量成线性关系;能够发现任意形状的聚类;对噪声不敏感;算法对网格尺寸有一定的适应性;能够区分不同密度的聚类簇;能够区分特定属性聚类簇;层次化的聚类结果有利于用户的理解、解释。在高维度全空间聚类方面,提出了MDCLUS、IMDCLUS和PMDCLUS算法以提高聚类速度。1、采用蒙特卡络法获取核心对象,降低了聚类的运算量。定量地给出了抽样率的最小估计值,以避免小聚类簇的丢失和大聚类簇的断裂。提出了标签散列法合并聚类簇,合并的计算量与数据量成线性关系。2、实现了增量聚类。3、实现了分布式并行化处理。算法的特点是:能够发现任意形状的聚类簇;对噪声不敏感;与DBSCAN算法相比速度明显提高;运算量与维度成线性关系;能够在局域网中的多台计算机上以分布式方式同时聚类;支持增量聚类,速度相对于重新聚类有大幅度提升。在高维度子空间聚类方面,提出了活跃空间和活跃网格的算法。主要贡献有:1、证明了聚类簇区域的密度、连通性、覆盖度都具有向下封闭性。2、提出了自上而下的搜索方法。3、提出了基于活跃轴数量的噪声过滤法。4、在网格大小固定的基础上扩展为网格大小自适应。5、实现了分布式的并行化聚类。6、提出了以层次化的树形结构组织聚类子空间和聚类簇的方法。算法的主要特点有:既能发现全空间聚类簇也能发现子空间聚类簇;算法的计算量与数据对象个数、数据空间维度数以及聚类簇维度数分别近似成线性关系;算法的抗噪声能力强;能够在多台计算机上分布式地处理聚类;聚类结果有利于用户的理解和解释;算法既能发现相斥型聚类簇,也能发现相容型聚类簇。

【Abstract】 In this paper, the characters and requirements of clustering algorithms on different data dimensions are analyzed. We give out our resolutions and algorithms for low dimensional clustering, high dimensional clustering, and subspace clustering.We design an algorithm called GDMS to resolve low dimensional clustering. A detecting window is introduced to find out the distributions of data sets. A filter function framework is introduced to distinguish clusters of different density or attributes. Different movement styles and their combinations are designed to speedup algorithm with high accuracy. A new mathematics morphological operator is introduced to discover clusters, which is more accurate than the ordinary operators: open and close. The scale space theory is integrated with mathematics morphology to get multi-scale and hierarchical clusters. In addition, GDMS is extended to support obstacle clustering. The advantages of GDMS are: it is very efficient with a complexity of O(N); it is effective in discovering clusters of arbitrary shape; it can distinguish clusters of different density or attributes; it is not sensitive to noise; it is not sensitive to the grid size; its hierarchical result is easy to understand and interpret.We design three algorithms MDCLUS, IMDCLUS, and PMDCLUS to speedup high dimensional clustering. Monte Carlo method is adopted for acquiring core objects to reduce computation complexity. The estimation of minimum sample ratio is analyzed quantitatively to avoid clusters’ losing and breaking. A label hash method is introduced to get a linear computation complexity with object’s number for clusters emerging. We realized increment clustering and distributed parallel clustering. The advantages of our approach are: it is effective in discovering clusters of arbitrary shape; it is not sensitive to noise; the clustering complexity is linear with data dimensions; it is faster than DBSCAN algorithm.We design active spaces and grid algorithms for subspace clustering. We prove the downward closure properties of density, connection, and coverage. A top-down search method is introduced to reduce searching spaces. A new filter method based on active axis numbers is introduced to filter noise objects. The algorithm based on fixed grid is extended to adaptive grid for efficiency. A distributed parallel algorithm is realized for large and high dimensional data sets. We introduce a hierarchical method to arrange clustering result. The advantages of our approach are: it can discover clusters both on entire space and subspace; the computation complexity is proximate linear with object’s number, space dimension, and clusters’ dimension respectively; it is not sensitive to noise; the clustering result is easy to understand and interpret; it can find both disjoint clusters or overlap clusters.

  • 【网络出版投稿人】 天津大学
  • 【网络出版年期】2009年 04期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络