节点文献

基于最小化训练误差的子空间分类算法研究

Research on Training Error Minimized Subspace Algorithm for Classification

【作者】 沈道义

【导师】 俞能海;

【作者基本信息】 中国科学技术大学 , 信号与信息处理, 2008, 博士

【摘要】 子空间方法是模式识别领域一个重要的研究方向,很多年来一直受到该领域学者们的广泛关注。Fisher线性判别分析方法(Fisher Linear DiscriminantAnalysis,FLD或LDA)及以其为代表的其他一些子空间分类方法,在分类问题中有着突出的作用。然而,这些子空间算法也存在一定的缺陷。其中最主要的问题是,大部分传统子空间算法的特征提取准则并不与训练误差直接相关联,而是根据某种准则由样本数据分布(通常假设为高斯分布)的统计特征得出。所以当统计准则不能正确反映样本分布情况时,算法往往会失效。这个问题导致传统子空间算法应用于某些数据分布较为复杂的情形时,难以取得理想的效果。本文所提出的方法正是围绕这个问题而展开的。本文第3章首先指出,传统的LDA方法由于其固有的缺陷,在处理多分类问题时,即使符类数据都满足高斯同方差分布,也可能无法找到最优分类子空间。接着通过分析数据样本分布与LDA算法得到的投影向量之间的关系,讨论了LDA投影向量与类间散布矩阵和类内散布矩阵特征值之间存在的关联,并以此提出一种基于遗传算法的LDA算法。该算法以子空间上的洲练误差最小为目标,通过遗传算法调整LDA算法中类间矩阵特征位的大小,达到搜索最佳特征子空间的效果。通过模拟数据和真实数据的实验,表明这种方法的分类正确率比现有的线性子间方法有所提高。集成学习理论中的AdaBoost(Adaptive Boosting)算法是一类以最小训练误差为准则构建分类器的学习算法。本文在第4章中通过结合AdaBoost算法与LDA子空间方法提出了基于提升自举LDA投影的特征提取算法,完成两类问题中的特征提取与组合。AdaBoost算法是一种将若干分类性能仅好于随机猜测的弱分类器提升为强分类器的算法框架,要求各弱分类器具有较大的分离度和不稳定性。所以,本文提出的算法首先借助Bagging(Bootstrap Aggregating)算法中的自举采样(Bootstrap Sampling)原理对训练样本进行随机抽样形成若干训练样本自举子集,再通过结合LDA算法和最近邻分类器由这些自举子集得出若干弱分类器,并由AdaBoost算法提升为强分类器。该算法克服了传统子空间方法特征提取准则不与训练误差相关联的弱点,生成的分类器有较好的泛化性能,能够很好地解决数据分布复杂的分类问题。文章通过复杂分布的两类问题实验证明了该算法的可行性和优越性。由于多类问题的研究,特别是人脸识别问题,具有更加广泛的应用价值,本文第5章在第4章的基础上,借助AdaBoost.M2算法与LDA子空间方法的结合将以上算法推广到多类问题中,提出了基于提升自举LDA子空间的分类算法。第5章通过改善的自举采样方法,使AdaBoost.M2算法在原有基础上更注重难分样本的分类,同时兼顾弱分类器的多样性,达到更好地提升和组合基于LDA子空间的弱分类器。通过手写数字图像和人脸图像识别的实验,比较了该算法与传统子空间方法及其他基于集成学习的分类算法的性能,征明了该算法的效果达到或超越了其它算法。

【Abstract】 As a significant research direction, Subspace Methods attract wide attention from the scholars in the field of Pattern Recognition. Fisher Linear Discriminant Analysis (FLD or LDA) and other related subspace methods exert outstanding effects in classification problems. However, these subspace methods have some shortcomings. The main shortcoming is that traditional subspace methods, such as LDA, do not relate the feature extraction criteria directly to the training error, but relate to the statistical feature of the distribution (generally assumed as Gaussian) of training data. Thus, when the statistical feature is not able to reflect the data distribution properly, these methods will probably fail. As a result, traditional subspace methods are not competent for the problems with complex data distribution. In this dissertation, all proposed methods are to solve this problem.In chapter 3, we first point out that in multi-class problems, even that each class bearing a Gaussian homoscedastic distribution, LDA may fail in some cases. Then, through analyzing the relationship between the data distribution and the projection directions of LDA, we discuss that LDA result relates to the eigenvalues of the inter-class scatter matrix and the intra-class scatter matrix. Based on this observation, we propose a modified LDA method based on Genetic Algorithm. Aiming for the minimum training classification error, utilizing Genetic Algorithm, our method pertinently adjusts the eigenvalues of the inter-class scatter matrix to find the optimal feature subspace. Experiments on both synthetic data and real data show that the proposed method is superior to other linear subspace method.AdaBoost (Adaptive Boosting) algorithm, derived from the ensemble learning theory, is a learning method that directly related the training performance to the construction of classifier. In chapter 4, we propose a feature extraction algorithm based on boosting bootstrap LDA projections, which combines AdaBoost algorithm and LDA algorithm, to solve 2-class problems. AdaBoost algorithm is a learning framework that can boost a number of weak hypotheses to a strong classifier and it needs the weak hypotheses to be unstable and diverse. Therefore, we first utilize the Bootstrap Sampling method from the Bagging (Bootstrap Aggregating) algorithm to randomly sample the original training data into a number of bootstrap training subsets. Then we employ LDA and Nearest Neighbor (NN) classifier to make the same number of weak hypotheses from these training subsets, which are to be boosted by AdaBoost algorithm into a final classifier. This method overcomes the shortcoming of traditional subspace methods mentioned above. At the same time, it is proved to have good generalization performance and is qualified for classification problems with complex data distribution. Experiments on 2-class problems with complex distribution prove the feasibility and superiority of this method.Research on multi-class problems, such as face recognition tasks, is very valuable in application. Hence, in chapter 5, we use AdaBoost.M2 algorithm to generalize the method proposed in chapter 4 to solve multi-class problems and propose the method of boosting bootstrap LDA subspaces. In this method, we sophisticatedly improve the bootstrap sampling step to ensure more concentration on hard-to-classify samples by AdaBoost.M2 algorithm. Meanwhile, the diversity of weak hypotheses is kept so that the LDA based hypotheses can be boost and combine more efficiently. In experiments we compare our algorithm with the traditional subspace methods and other ensemble learning based algorithms on handwritten digit image recognition and face image recognition. The results show that our algorithm is superior or comparable to other methods.

  • 【分类号】TP301.6
  • 【下载频次】409
节点文献中: 

本文链接的文献网络图示:

本文的引文网络