节点文献

属性加权及不完备数据的模糊聚类方法研究

Research on Attribute Weighted and Incomplete Data Fuzzy Clustering Approaches

【作者】 李丹

【导师】 顾宏;

【作者基本信息】 大连理工大学 , 控制理论与控制工程, 2011, 博士

【摘要】 模糊聚类是模式识别领域的研究热点之一,主要用于识别数据内在结构。相似性度量是模糊聚类算法中的一个关键问题,常用方法,如欧氏距离、海明距离等,隐含假定样本的属性对聚类的贡献均匀,具有一定的局限性;另外,由于获取数据的限制、随机噪音等原因,往往造成样本属性缺失,而多数聚类方法无法直接对这类数据集进行聚类分析。因此,本文研究属性加权及不完备数据的模糊聚类方法。本文的主要工作概述如下:1.针对属性加权模糊聚类问题,提出了基于属性权重区间监督的模糊聚类算法,解决了权重确定的合理性问题,进而提高算法聚类性能。首先,从认知和数据集信息复杂性的角度出发,提出在聚类问题中采用区间数描述属性权重,由区间层次分析法获得属性对聚类的贡献度,相对于数值型属性权重更能提高权重表示的鲁棒性;其次,通过对属性权重与隶属度、聚类中心的迭代优化进行聚类分析,提出若权重计算结果超出区间约束,将其强制为区间中心值后再参与迭代计算,并设定最大强制次数以保证算法收敛。仿真实验表明,算法能够避免迭代计算陷入不必要的局部极小解,得到了更为准确的聚类结果。2.针对不完备数据模糊聚类问题,提出了基于最近邻区间的不完备数据聚类算法。首先,鉴于缺失属性的不确定性,本文依据不完备样本的近邻信息提出了缺失属性的最近邻区间描述;其次,基于最近邻区间描述,提出了两种不完备数据聚类算法。第一,将不完备数据集转化为区间型数据集进行聚类分析,算法所得聚类中心为属性空间中的超凸多面体,能够在一定程度上反映数据集子类形态,有利于得到更切实际的聚类结果;第二,鉴于最近邻区间描述能够将缺失属性估算限定在合理范围内,提出了遗传算法—模糊C均值的混杂框架,利用遗传算法在区间范围内搜索缺失属性的优化估算值,进而通过模糊C均值算法对“还原”后的完整数据集进行聚类分析,该算法在合适的缺失属性估算值基础上能够获得更为满意的聚类结果。3.针对现有不完备数据模糊聚类算法未考虑样本各属性对聚类贡献不同的问题,提出了基于属性加权的不完备数据模糊聚类算法。首先,利用经典算法对不完备数据集进行一次聚类,得到较为准确的缺失属性估算值和样本类别;其次,利用ReliefF算法对“还原”后的完整数据集进行属性评价;最后,通过加权欧式距离将属性权重引入聚类分析,实现缺失属性及聚类结果的一体化求解。仿真实验表明,所提算法通过强调重要属性的作用能够明显提高不完备数据的聚类效果。

【Abstract】 Fuzzy clustering is one of the research focuses in the field of pattern recognition. It is mainly used to identify the internal structure of data. Similarity metric is a key problem in fuzzy clustering. However, the existing methods for similarity metric, such as Euclidean distance and Hamming distance, have certain limitations since they assumed implicitly that each attribute of the sample has equal contribution to the clustering performance. Moreover, in most cases, attribute values of samples might be missing because of the limitations in data collection, random noise and some other reasons. But most of the existing clustering algorithms may not be directly applicable to such incomplete samples. Aiming at the aforementioned problems, this dissertation concentrates on the attribute weighted and incomplete data fuzzy clustering approaches. The main contributions of the research can be summarized as follows:1. For attribute weighted clustering, a fuzzy clustering algorithm with interval-supervised attribute weights is presented, which can enhance the rationality of attribute weights and improve the clustering performance. Firstly, from the viewpoint of cognition and information complexity of datasets, attribute weights are represented as intervals in clustering analysis, which can be obtained by interval analytic hierarchy process to describe the different contribution of attributes, as a result, it improves the robustness of attribute weight representation compared with numerical attribute weights; Secondly, attribute weights, memberships and cluster prototypes can be obtained by iterative optimization. If any calculated weight in certain iteration is out of its interval-constrained range, it will be forced to the corresponding interval center for further iterations. And a maximum number of iterations is set to ensure the convergence of the algorithm. Experimental results show that the proposed algorithm can avoid the local minima, and can achieve better clustering performance than the existing algorithms.2. For incomplete data fuzzy clustering, two algorithms are presented based on nearest-neighbor intervals. Firstly, concerning the uncertainty of missing attributes, missing attributes are represented by nearest-neighbor intervals according to the nearest-neighbor information of the incomplete sample; secondly, based on the nearest-neighbor interval representation of missing attributes, two algorithms are proposed in this dissertation. The first approach is to transform the incomplete dataset into an interval-valued one, and then to perform clustering analysis by using the existing clustering algorithms for the interval-valued dataset. Since the cluster prototypes are convex hyperpolyhedrons in the attribute space, which can present the shape of the clusters to some degree, more accurate clustering results can be achieved. Because the missing attributes can be limited to appropriate ranges by the interval representation, the second approach hybridizes fuzzy c-means and genetic algorithm to solve the incomplete data clustering problem. Genetic algorithm is involved to search for optimal imputations of missing attributes in the corresponding nearest-neighbor intervals, and then fuzzy c-means is used to obtain compact clusters on the "completed" dataset. Therefore, more satisfying clustering results can be obtained on the basis of the appropriate imputations of missing attributes.3. In most of the existing algorithms, they seldom concern the problem that different attributes may contribute differently to the clustering. Aiming at this disadvantage, an attribute weighted fuzzy clustering algorithm for incomplete data is proposed. Firstly, comparatively accurate imputations of missing attributes and classification labels are obtained by an existing algorithm; Secondly, each attribute of the "completed" dataset is evaluated by the ReliefF algorithm; Finally, the attribute weights are combined into fuzzy clustering by weighted Euclidean distance, so the missing attributes and clustering results can be obtained simultaneously. Experimental results of the simulation show that the algorithm can achieve better clustering performance on incomplete datasets by emphasizing the contribution of important attributes.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络