节点文献

高伸缩性聚类分析方法研究

Research on High Scalable Clustering Analysis Method

【作者】 刘晨

【导师】 卢志茂;

【作者基本信息】 哈尔滨工程大学 , 信号与信息处理, 2013, 博士

【摘要】 聚类问题一直是模式识别领域的热点课题,其应用领域广泛,包括统计学、图像处理、医疗诊断、信息检索、生物学以及机器学习等。近年来,很多聚类方法纷纷涌现。这些方法大多受到自身算法的伸缩性限制,在特定数据规模的数据集上表现出优良的性能,但在超出其规定规模的数据集上往往收效甚微,甚至无法运行。随着信息采集与存储技术的飞速发展,数据的多样性越发突出,因此,对于高伸缩性的聚类方法的探索越发成为关注的焦点。本文主要针对聚类算法的伸缩性以及一些聚类算法存在的高昂计算复杂度和巨大内存需求而难以应用于大规模数据集的处理中的问题展开研究和讨论。在此过程中,本文的主要创新体现在以下几方面:(1)许多经典的聚类算法在小数据规模的数据聚类任务中取得了非常优秀的效果,但由于其伸缩能力不强,使得大多数算法在大规模数据的聚类任务中很难胜任或无法完成聚类分析的任务。针对探索高伸缩性聚类方法的问题,使得算法能够适应大幅度的数据规模变化,本文以化整为零的处理思想为基础,对于数据集先进行切分后划分的处理方式进行了深入的研究。提出一种基于这种处理方式的聚类方法—基于数据切分与划分的聚类方法。该方法处理数据不须将数据一次读入主存,可以大幅度的降低了算法对硬件资源的需求,相比于传统迭代产生的质心不易陷入局部最优。(2)DP是一种伸缩性较强的聚类方法,在小数据集合和大数据集合的聚类任务中都表现出了优异的聚类性能,但对于数据规模过大的情况下,其局部特征样本集过大,超出主存要求,仍然存在不足。针对这种情况,本文对于DP理论进行深入分析后提出了逐级压缩的思想,并对DP方法进行了改进,提出了一种基于均值径向压缩的聚类方法(Means Radial Compression,MRC),相比于DP方法,均值径向压缩方法MRC具有更好的伸缩性能,并且其优良的时间复杂度O(n)也使得其应用范围更广。(3)提出一种基于最小距离谱的数据特征聚类特性的可视化分析方法(MinDS)。通常情况下,用于参与聚类分析的数据是经过数据表示后产生的数据特征,应具有内在的联系,使其呈现出分组特性,聚类分析则指按照某种相似性测度找出这种数据分组。因此,数据表示过程、数据特征的选择将直接影响最终聚类结果。MinDS首先定义了最小距离谱模型,通过对最小距离谱特征分析,可以将多维数据间数据关系映射到二维数据空间中,对于直观的评价数据特征聚类特性,聚类方法失效原因等都获得了很好的效果。同时MinDS方法也可用于处理噪声,识别孤立点,寻求数据类别等方面。

【Abstract】 The clustering problem has been a hot topic in the field of pattern recognition, whichhas wide range of applications, including statistics, image processing, medical diagnostics,information retrieval, biology, machine learning and so on. In recent years, many clusteringmethods have emerged. Most of these methods are limitated by the scalability of itselfalgorithm, and show excellent performance on the specific data-scale data sets, but oftenhave little effect on the outside of its scale data sets, and even can not run. With the rapiddevelopment of information collection and storage technologies, the diversity of the data ismore prominent, so the exploration for highly scalable clustering method becomes more andmore popular.The article study and discuss on the clustering algorithm scalability, and the problemthat the clustering algorithms is difficultly applied to the processing of large data sets due tothe high computational complexity and huge memory requirements of the clusteringalgorithm. In this process, the main innovation is reflected in the following aspects:(1) Many classic citrus algorithm get very good results in the small data size of dataclustering task, but the algorithm is not strong because of its scalability, which make themost algorithms is difficult to competent or can not be completed in large-scale dataclustering task. To explore the clustering method for high scalability problem and make themethod adapt to the wide range of data set, this thesis based on the thought of piecesprocessing further researches the data set processing way of first segmentation then partition.And a clustering method based on this approach, clustering method based on the datasegmentation and partition, is proposed in this thesis. The proposed method does not need toread all the data into main memory at the same time and result in the greatly reduceddemand for hardware resource. It is uneasy for the method to fall into local optimumscompared with the traditional method of generating centers iteratively.(2) DP is a strong elasticity clustering method, and have shown excellent clusteringperformance in the clustering task of the large data collection and the small data collection.But the DP method still has the limitation on its application, because when the data scale istoo large and the local characteristics of a sample set is too large to exceed the requirements of the main memory. For this situation, DP which is a kind of clustering method of strongscalability can perform well in clustering small and large data sets. However, DP can notwork well in very large data set because the local characteristics sample set of the data is toolarge resulting in the out of main memory requirements. To solve the shortage, the thesisdesigns the thought of compression step by step after deep analyzing the theory of DP andproposes an improved DP, clustering method based on the Means Radial Compression,MRC. The experimental results show that means radial compression algorithm can makebetter solutions compared with DP with time complexity of O(n).(3) The method of the visual analysis based on the minimum distance spectrum datafeature clustering characteristics is proposed. Usually, after the data representation the dataused to participate in the clustering analysis will generate the data characteristics. Thecharacteristics of the data should have an inherent connection, so make the data present thepacket characteristics, and the cluster analysis is to identify this data packet in accordancewith some similarity measure. Therefore, the process of the data represents and the choiceof the data characteristics will directly affect the final clustering result. MinDS first definethe minimum distance spectrum model, and the cube data can be mapped to thetwo-dimensional data space by the minimum distance spectrum characteristics analysis. Sofor the intuitive evaluation of the features of data characteristics clustering, the failurereasons of clustering method get very good results. At the same time, Minds method canalso be used to deal with the noise, to identify outlier and to seek data categories.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络