节点文献

粒度聚类方法研究

Study on Granularity Clustering

【作者】 朱红

【导师】 丁世飞;

【作者基本信息】 中国矿业大学 , 计算机应用技术, 2013, 博士

【摘要】 聚类分析是模式识别与人工智能中发现知识的重要途径之一,传统的聚类分析是一种硬划分。大数据时代出现了高维海量数据,它们往往具有不完备性、不精确性、不一致性等特征,传统聚类算法很难满足这些数据的聚类需求。粒度计算是不确定信息处理的重要工具,是当前计算智能领域中模拟人类思维和解决复杂问题的新方法。粒度计算的兴起将聚类分析拓展到了软计算领域,实用价值进一步提高,理论意义更加贴近现实。通过粒度的变换,聚类可以在不同层次、不同角度进行,使得“亦此亦彼”的聚类有了研究的理论基础和实践方法,弥补了传统聚类的不足,有利于问题的解决。本文将粒度计算与聚类分析融合在一起,对粒度聚类方法做了深入研究。将粒度的思想贯穿到聚类的数据预处理与聚类分析的整个过程,同时将聚类作为属性粒化与样本粒化的手段,用聚类的目标函数、参数的值来描述粒化的角度和层次。本文主要工作有以下几个方面:1.针对聚类分析预处理中属性约简时间、空间复杂度高的问题,采用聚类的方法对属性并行粒化。基于属性区分能力和AP聚类的属性粒化方法利用AP聚类算法将属性分为若干簇类后在每个簇类中依据信息熵、属性重要度等指标选取代表属性构成最后的属性集合,从而完成属性粗粒化的要求。对大数据集的特征降维,这种算法比传统的属性约简算法大大提高了运算效率,在属性粒化精度要求不是很严格的情况下,算法优势明显。基于AP聚类的并行属性约简算法可以在保持分类能力不变的情况下提高传统属性约简算法的效率,但是由于并行约简中仍然采用的是传统的算法,所以对规模特别大的数据集,算法的时效性有一定的局限性。2.将粒度计算与聚类算法相结合,一般只是将粒度计算的模型应用到聚类算法中去。聚类结果之间仍然无法自由转换。由于所有聚类算法统一在粒度的思想下,提出基于聚合网络的变粒度二次聚类方法,通过粒度计算将两种聚类算法融合在一起,首次聚类的目的并不是完成对整个数据集的聚类操作,而是找到合适的聚合粒层,是在较细的粒度上进行,用以寻找数据局部结构,并依据粒度的粗细形成聚合网络中的某一聚合粒层,二次聚类在此基础之上完成对论域的聚类操作。提出基于K均值与层次聚类的变粒度自适应二次聚类方法,可以同时解决K均值算法易受初始聚类中心的影响而聚类错误、不能识别任意形状数据集和层次聚类速度较慢的问题。提出的基于AP聚类的变粒度二次聚类方法首次聚类采用AP聚类,效果稳定,一次聚类粒度较细,正确率高,寻找合适粒度时间少。3.为了解决AP聚类不能适用于子空间聚类的问题,提出了两种改进算法。一种是属性样本同步粒化的AP熵加权软子空间聚类算法去除冗余属性后,在每次聚类算法的迭代过程中增加一步修改属性权重。迭代终止时,就得到了兴趣度子空间的准确的属性集的粒化结果。另一种属性样本异步粒化的AP子空间聚类方法是一种异步软子空间聚类算法,首先通过计算属性的基尼值与联合基尼值得到属性的关系矩阵,然后将子空间的查找转换成查找矩阵的极大全1子矩阵,降低了时间复杂度,最后在各兴趣度子空间使用AP算法聚类,完成子空间聚类的任务。算法既保留了AP聚类算法的优点,又克服了AP算法不能进行子空间聚类的不足。4.对并行程序的粒化方法做了研究,在细粒度并行思想的指导下,提出基于改进属性约简的细粒度并行AP聚类算法。算法将粒度思想引入到并行计算中,首先分析了程序并行计算中的粒度原理,对传统的基于差别矩阵的属性约简算法做了改进与并行化处理,降低了它的时间空间复杂度,然后对AP算法做了细粒度并行化处理,提高了算法的效率。整个算法将任务划分到多个线程同时处理。

【Abstract】 Clustering analysis is one of the important ways of knowledge discovering inpattern recognition and artificial intelligence. Traditional clustering is a kind of harddivision. As the time for big data is coming, high-dimensional, incomplete, complex,vague, massive data are produced. These plentiful data and their high dimensionalcharacter make the traditional data analysis method be outshone. Granular computingis an important tool of uncertain information processing, and it is also new method tosimulate human thinking and solve complex problems in the field of computationalintelligence. The rise of granular computing develops the field of clustering into softcomputing which further promotes its value for practical uses and makes thetheoretical significance of clustering more close to reality. Clustering analysis can beperformed from different levels and different angles through the change of granularity,thus "either the one or the other" clustering has its research foundation and practicemethod. This might make up for the shortage of the traditional clustering and ishelpful for the solution of the problem.This paper focuses on granularity clustering through combining granularcomputing and clustering analysis together. The thought of granularity runs throughthe procedure of data preprocessing and clustering analysis. And at the same time,clustering is a main method of attribute granulation and sample granulation. Thepaper describes different levels and different angles of granulation through objectfunction and the values of parameter of clustering. This paper mainly includes thefollowing aspects:To reduce time and space complexities of attribute reduction in clusteringproceeding, we granulate attributes through clustering method in parallel. Attributegranulation based on attribute discernibility and AP clustering method calculates thesimilarity of attributes according attribute discernibility first, and then clustersattributes into several group through affinity propagation clustering algorithm. At last,representative attributes are produced through some algorithms to form a coarserattribute granularity. The method is a more efficient algorithm than traditionalattribute reduction algorithm for large data set. It has obvious advantages under thecondition of less strict precision of attribute granularity. A parallel attribute reductionalgorithm based on affinity propagation clustering improves the efficient of attributereduction under maintained the same classification ability. But it is limited when the data set is large scale because traditional attribute reduction algorithm is selected inparallel reduction.We can apply granular computing model to clustering method in order tocombine them together. But the clustering results are unable to translate freely.Because all clustering algorithms are uniformed by granular thought, this paperpresents a new twice clustering method based on the variable granularity andclustering network(VGTC). VGTC combines two clustering algorithms togetherthrough granularity computing in order to have better performance than any singlemethod. The aim of the first clustering is not to complete the task of clustering forthe whole data set, but to find an appropriate clustering layer. On this basis, secondaryclustering completes clustering operation for domain. Variable granularity twiceclustering based on K-means algorithm and hierarchical clustering (an example ofVGTC) can cluster the non-spherical shape data sets correctly, and avoid someproblem of K-means(such as it is influenced by initial clustering center) andhierarchical clustering(such as the lower efficiency). Furthermore, the algorithm canimprove the accuracy and efficiency of clustering. Another twice clustering ofvariable granulation based on AP and hierarchical clustering selects AP algorithm asthe first clustering method, so the granulation of clustering is finer, the result is stableand has high accuracy. The time of searching appropriate granulation is shorter thanK-means.AP algorithm is not appropriate for subspace clustering. In order to solve thisproblem, two improved AP algorithms are put forward. An entropy weighting APalgorithm for subspace clustering based on asynchronous granulation of Attributesand Samples removes the redundant attributes first, and then a step of modifyingattribute weight is added to the clustering procedure in order to obtain the exactweight value. At the end of clustering, an accurate result of attribute granularity willbe produced. Another method is AP subspace clustering algorithm based on attributesrelation matrix. It is asynchronous soft subspace clustering algorithm. This algorithmfilters out redundant attributes by computing the gini coefficient. To evaluate thecorrelateion of each two non-redundant attributes, the relation matrix is constructedbased on two dimensional united gini coefficients. The candidate of all interestingsubspaces is achieved by looking for the maximum sub-matrixes which contain only1.Finally, all subspace clusters can be gotten by AP clustering on interesting subspaces.The method obtains interesting subspaces correctly and reduces time and space complexity at the same time. It keeps the advantages of AP clustering and overcomethe shortage of it.Research on granularity of parallel program is done in this paper. Under theguidance of fine-grain parallelism, an AP clustering algorithm based on improvedattribute reduction and fine-grain parallelism is proposed. Firstly, granularity thoughtis introduced into parallel computing and granularity principle is applied as well.Secondly, data set is preprocessed by the method of improved attribute reductionalgorithm through which elements in discernibility matrix will be calculated andselected in parallel, in order to reduce the complexity of time and space. Finally, dataset is clustered by the means of parallel AP algorithm. The whole task can be dividedinto multiple threads to be processed simultaneously.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络