节点文献
基于聚类与流形正则化的分类方法研究
Classification Methods Based on Clustering and Manifold Regularization
【作者】 刘兵;
【导师】 夏士雄;
【作者基本信息】 中国矿业大学 , 计算机应用技术, 2013, 博士
【摘要】 分类器设计一直是模式识别领域研究的重要课题之一。近十年来,随着统计学习和核函数理论的深入研究,涌现出许多新方法。这些理论和方法较好地解决了模式分类中的局部最优、过拟合以及维数灾难等问题。然而,在以支持向量机为代表的核分类方法的基础上,近年来又涌现出了一些新的研究热点,这些新的热点往往是传统模式分类方法存在的弊端,例如,海量高维数据的分类、类重叠和噪声干扰下的数据分类、多标记数据分类、类不平衡数据的分类、非线性分类中的核函数(矩阵)优化以及非线性快速分类等等。在此背景下,本文主要从快速鲁棒聚类算法、不平衡样本的分类、核优化、基于流形正则化的快速半监督分类等几个方面进行深入研究,提出了解决类不平衡、核优化以及快速分类的新方法。论文的主要研究工作包括以下四个方面的内容:(1)针对实际应用中样本重叠以及噪声干扰问题,提出了一种基于样本加权的可能性模糊聚类算法和一种鲁棒可能性模糊核聚类算法。第一种聚类算法主要解决近似线性可分问题,算法通过为孤立点或噪声点赋予较小的权重缩小典型值的收敛范围,减小其对聚类的影响。在分析算法收敛性的基础上,证明了其具有比传统IPCM(Improved Possibilistic C-Means)算法更快的收敛速度,在有效降低时间复杂度的同时能够取得较好的聚类准确率。第二种聚类算法主要解决线性不可分问题,同时,为解决无监督条件下的核函数参数选择问题,提出了一种核函数参数优化方法。因此,所提出的聚类算法不仅可以同时处理线性不可分和部分重叠数据集,而且具有更强的鲁棒性,在噪声干扰下能够取得较好的聚类准确率。(2)针对实际应用中正负样本数量分布不平衡分类问题,基于两种鲁棒聚类算法,建立了可能性模糊支持向量机(Possibilistic Fuzzy Support VectorMachine,PFSVM)模型,提出了基于可能性模糊聚类的不平衡数据分类方法。所设计的分类器较好地解决了分类中的类不平衡、孤立点和噪声干扰问题,通过鲁棒聚类算法为训练样本分配模糊隶属度和典型值,减小了孤立点和噪声对SVM的分类精度以及泛化能力所造成的影响。(3)针对多核学习效率较低以及需要预先定义一组核函数等缺陷,建立了无监督非参数核学习模型,该模型易于拓展至有监督学习。提出了非参数核学习分类方法。该方法通过对多核学习优化问题进行放松,使其可以转化为一系列的稀疏特征值分解子问题,每次迭代中只需进行闭合解的计算,从而提高了核学习的性能和效率。所建立的模型通过把谱核学习和间隔最大化标准进行有机结合,充分利用了数据的低维流形结构,增强了决策函数的光滑性,同时可以有效利用未标记数据进行最大间隔分类。实验验证了非参核学习的有效性,在有监督和无监督情况下,提出的非参核学习方法的性能均优于多核学习方法。(4)为解决半监督快速学习问题,建立了扩展的流形正则化框架E-MR(Extended Manifold Regularized Framework),提出了推广的决策函数表示定理、单输出极速学习机与流形正则化框架关系定理和多输出极速学习机与流形正则化框架关系定理。这些定理为快速半监督分类模型和算法的提出提供了理论依据,表明所建立的流形正则化极速学习机模型(Manifold Regularized ExtremeLearning Machine, MRELM)是E-MR框架的一个特例,其本质是随机地离散化核函数。因此,所提出的算法是传统核分类的近似算法。MRELM继承了ELM无需调整模型参数的优点,能够为不同的学习任务提供统一的解析解。实验结果验证了MRELM算法的有效性。本文研究的内容主要涉及到了不平衡数据分类方法、基于非参数核优化的分类方法以及快速半监督分类方法三个方面的相关研究内容。在研究了相关前期工作的基础上,建立了多种分类和学习模型,提出了新的学习算法,并使用标准数据集和多个人脸数据集对算法进行了测试。通过和相关算法进行对比,进一步验证了本文提出算法的有效性。本文的研究成果将丰富分类问题的解决途径,具有一定的理论意义和较好的应用前景。
【Abstract】 In past years, the classifier design has been one of the important research topicsin the field of pattern recognition, it has made rapid progress in the aspects of theoryand application and a lot of new approaches have been emerging. These newrequirements are often the disadvantages of traditional pattern classification methods,such as classification of massive high-dimensional data, classification under noisejamming and class overlapping, multi-labeled data classification, classification onclass imbalance data sets, optimization of kernel matrices in nonlinear classification,fast non-linear classification, etc. Under this background, this thesis has made furtherresearch on fast robust clustering, classification on class imbalance samples, kerneloptimization and fast semi-supervised classification based on manifold regularization.Several new models and methods have been present to solve the problem of classimbalance, kernel optimization and fast classification.The studies mainly include the following four aspects:(1) To solve the problem of class overlapping and noise jamming, a novelpossibilistic fuzzy clustering algorithm based on the sample-weighted idea wasproposed. In this method, the outliers and the noises have smaller weights, which canlimit the convergence range of typical values. Thus, their contributions to theclustering process are reduced. We proved that its convergence speed was faster thanthat of IPCM (Improved Possibilistic C-Means) algorithm. It can not only reduce thetime complexity effectively, but obtain good clustering accuracy. To solve theclustering problem in linear inseparable case, a robust possibilistic fuzzy kernelclustering was proposed. This method can not only handle linear inseparable and classoverlapping datasets, but also overcome noise interference effectively.(2) The existing class imbalance learning methods can decrease the sensitivity ofSVM to the imbalanced class, but it still suffers from the problem of noises andoutliers. We proposed a new class imbalance method based on the fuzzy and typicalmemberships, which can handle the imbalance problem. Since improved clusteringalgorithm is robust to noises, this method is not only effective on class imbalancedatasets, but also robust to noises and outliers. Experimental results on artificialdatasets and real datasets show that the proposed method is effective for solving theclass imbalance problem, especially for the imbalanced datasets existing noises andoutliers. (3) Most research on Non-Parametric kernel learning (NPKL) has tended tofocus on the semi-supervised scenario. In this paper, we propose a novel unsupervisednon-parametric kernel learning method, which can seamlessly combine the spectralembedding of unlabeled data and manifold Regularized Least-Squares (RLS) to learnnon-parametric kernels efficiently. The proposed algorithm enjoys a closed-formsolution in each iteration, which can be efficiently computed by the Lanczos sparseeigen-decomposition technique. Meanwhile, it can be extended to supervised kernellearning naturally.(4) Compared with traditional computational intelligence techniques such assupport vector machine (SVM), Extreme learning machine (ELM) provides bettergeneralization performance at a much faster learning speed without tuning modelparameters. In order to deal with unlabeled data, we extend the manifoldregularization framework, and demonstrate the relationship between the extended MRframework and ELM. A manifold regularized extreme learning machine is derivedfrom the proposed framework, which maintains the properties of ELM, especially insolving the problem of large-scale data training.This thesis mainly studied three aspects, e.g., classification method for classimbalance samples, classification methods based on non-parametric kerneloptimization and fast semi-supervised classification. Based on the related preparatorywork, several classification and learning models have been established and newalgorithms have been designed. Experiments on benchmark datasets and face datasetsvalidate the effectiveness and efficiency of these proposed algorithms. The results ofthis thesis will enrich the ways to solve the classification problem and has a certaintheoretical significance and good application prospect.
【Key words】 Clustering; manifold regularization; kernel optimization; non-parametrickernel learning; extreme learning machine;