节点文献
基于模糊Fisher准则的聚类与特征降维研究
The Study of Fuzzy Fisher Based Clustering and Feature Dimension Reduction
【作者】 曹苏群;
【导师】 王士同;
【作者基本信息】 江南大学 , 轻工信息技术与工程, 2009, 博士
【摘要】 聚类分析与特征降维是模式识别领域两个重要的研究课题。聚类分析作为一种重要的非监督模式识别工具,可用于多种领域,如数据挖掘、生物学、计算机视觉、文档分析等。它旨在将最相似的数据聚为一类,而将最不相似的数据聚为不同的类。特征降维包括特征抽取和特征选择,在模式识别中起着非常重要的作用,它有助于去除多余特征,降低原始数据集的维数。本文针对模糊聚类与特征降维中的几个问题进行了研究,包括基于模糊Fisher准则的半模糊聚类算法、无监督特征抽取以及不平衡数据集特征选择等。本文的创造性研究成果主要有:1将Fisher线性判别扩展为模糊Fisher线性判别,并基于此提出了一种新的聚类算法,称为基于模糊Fisher准则的半模糊聚类算法。该算法将鉴别矢量引入迭代更新方程,因此其异于常见的FCM聚类方程形式。严格地讲,该算法不仅仅基于模糊类内散布矩阵,还基于模糊类间散布矩阵,不同于大多数类似于FCM的聚类只基于模糊类内散布矩阵,因此,从以模糊Fisher准则作为聚类目标函数这个意义上说,FBSC可以视为一个新的模糊聚类算法。实际上,该研究也拓展了Fisher线性判别的应用;2提出一种将最佳鉴别平面特征抽取技术扩展到无监督模式的方法,其基本思想是通过最优化定义的模糊Fisher准则函数求得无监督模式下的第一个最佳鉴别矢量以及模糊散布矩阵。基于此,求得最大化模糊Fisher准则函数前提下满足正交、共轭正交或者既正交又共轭正交的第二个鉴别矢量,由这两个鉴别矢量分别构成无监督最佳鉴别平面、无监督统计不相关最佳鉴别平面或改进的无监督统计不相关最佳鉴别平面;3提出一种将最佳鉴别矢量集扩展到无监督模式下的方法,其基本思想是通过定义的模糊Fisher准则函数将Fisher线性判别扩展成一种半模糊聚类算法,通过该算法求得最佳鉴别矢量和模糊散布矩阵,进而构造出最佳鉴别矢量集。实验结果表明,尽管该方法无法优于传统的有监督最佳鉴别矢量集技术,但却具有与同属无监督特征抽取的主成分分析算法可比的性能;4提出了一种针对不平衡数据的基于后验概率的分类器独立的特征选择算法。该算法首先引入基于Parzen-window方法估算的不平衡因子,并以Tomek Links中点为初始值进行迭代,找出满足后验概率相等的判别边界点,通过对这些点法向量进行投影计算得到反映各特征重要性的权值。实验表明,对于不平衡数据,该算法在不降低分类器总体性能地基础上,不仅可以有效降低维度,节省计算开销,而且能够避免常规特征选择算法用于不平衡数据时忽视小类的缺点。
【Abstract】 Clustering analysis and feature dimension reduction are two important research topics in pattern recognition field. As an important unsupervised pattern recognition tool clustering analysis has been used in diverse fields such as data mining, biology, computer vision, document analysis. It aims to cluster a dataset into most similar groups in the same cluster and most dissimilar groups in different clusters. Feature dimension reduction including feature extraction and feature selection plays a very important role in pattern recognition. It helps to remove noisy features and reduce the dimensionality of original datasets.This paper is aimed at several issues based on fuzzy clustering and feature dimension reduction, including fuzzy Fisher criterion based semi-fuzzy clustering, unsupervised feature extraction and feature selection for imbalanced dataset etc. In this paper, the creative research results are:1 Fisher linear discriminant (FLD) is extended to fuzzy FLD and then a novel fuzzy clustering algorithm, called fuzzy Fisher criterion based semi-fuzzy clustering algorithm FBSC, is proposed based on fuzzy FLD. The proposed fuzzy clustering algorithm incorporates the discriminating vector into its update equations such that the obtained update equations do not take commonly-used FCM-like forms. Strictly speaking, the proposed fuzzy clustering algorithm here is rooted at both the fuzzy within-class scatter matrix and the fuzzy between-class scatter matrix, unlike most fuzzy clustering algorithms such as FCM are rooted only at fuzzy within-class scatter matrix. Thus, in the sense of fuzzy Fisher criterion as the objective function of the proposed clustering algorithm, FBSC can be viewed as a novel fuzzy clustering algorithm. In fact, this study also exploits a new application aspect of FLD.2 A method is presented to extend optimal discriminant plane feature extraction technology for unsupervised pattern. The basic idea is to optimize the defined fuzzy Fisher criterion function to figure out the first optimal discriminant vector and fuzzy scatter matrixes in unsupervised pattern. Based on these, the second discriminant vector which maximizes the fuzzy Fisher criterion function with the orthogonal constraint or the conjugated orthogonal constraint or both the orthogonal constraint and conjugated orthogonal constraint is obtained. Then this two discriminant vectors make up an unsupervised optimal discriminant plane (UODP), an unsupervised uncorrelated optimal discriminant plane(UUODP) or an improved unsupervised uncorrelated optimal discriminant plane(IUUODP) respectively.3. An extension of optimal set of discriminant vectors in unsupervised pattern is presented. The basic idea is to extend Fisher linear discriminant to a novel semi-fuzzy clustering algorithm through the defined fuzzy Fisher criterion function. With the proposed algorithm, an optimal discriminant vector and fuzzy scatter matrixes can be figured out and then unsupervised optimal set of discriminant vectors can be obtained. The experimental results demonstrate that although this method is unable to surpass traditional supervised optimal set of discriminant vectors, it has comparable performance with principal component analysis algorithm which belongs to unsupervised feature extraction. 4 A novel classifier-independent feature selection algorithm based on the posterior probability is proposed for imbalanced datasets. First, an imbalanced factor is introduced and computed by Parzen-window estimation. The middle point of Tomek links is chosen as the initial point. Accordingly, this algorithm is iterated to find out the boundary points which have the equality of posterior probability. Through the project computation on the normal vectors of these points, the weights of each feature can be obtained, which actually indicate the importance degree of each feature. The experimental results demonstrate that this proposed algorithm can not only reduce the computational cost but also overcome the shortcoming that the minority class may be ignored in the conventional feature selection algorithm.
【Key words】 Fuzzy clustering; Feature dimension reduction; Unsupervised pattern; Fisher criterion; Fuzzy Fisher criterion; Feature extraction; Optimal discriminant vector; Optimal discriminant plane; Optimal set of discriminant vectors; Imbalanced data; Feature selection;