节点文献

基于核方法的不平衡数据学习

Imbalanced Data Learning Based on Kernel Methods

【作者】 林智勇

【导师】 郝志峰;

【作者基本信息】 华南理工大学 , 计算机应用技术, 2009, 博士

【摘要】 不平衡数据学习(IDL)是最近几年才引起人们广泛关注的一类特殊的有监督(分类)学习,它主要解决类间训练样本分布不均衡的分类问题,即所谓的类不平衡问题(CIP)。CIP存在于许多重要的实际领域,包括医疗诊断和入侵检测等。现有的学习算法大多是基于类分布平衡和精度最大化而设计的;当用于处理CIP时,它们容易对多类“过学习”,进而导致分类器的整体性能退化。客观地说,CIP已给当前的机器学习界带来了巨大挑战。围绕着如何合理且有效地处理CIP,本文以新兴的核方法尤其是以支持向量机(SVM)为途径开展了如下一系列的相关研究工作:(1)研究了IDL中分类器性能的合理评价这一基本问题。首先,对目前常用的一组评价准则进行了系统的归纳分析,从理论上探讨了传统的精度准则不适于IDL的原因。在此基础上,利用元学习方法,从实验角度研究了在不同准则上得到优化的SVM分类器的性能差异。研究结果表明,SVM虽然是先进的学习方法,但是在IDL中若以精度为优化准则去选择分类器,那么所得到的SVM分类器也是极易产生类别“偏斜”的,它更倾向于将数据预测为多类。而在其他一些更合理的准则上进行优化,则可以获得“纠偏”的SVM分类器,它们的整体性能更高。这一部分的研究结果不但揭示了不同评价准则的差异,而且为SVM的模型选择方面提供了有益的启示。(2)系统地研究了如何通过样本非对称加权而将若干拓展SVM用于处理CIP。以最小二乘SVM以及临近SVM等为代表的拓展SVM,由于求解容易且性能较好,它们也与标准SVM一样被广泛使用。然而,直接将这些拓展SVM用于IDL,往往难以获得令人满意的效果;对样本进行非对称加权是提高它们处理CIP能力的一种最简单易行的做法。针对某些已有样本加权策略的不足,提出了一种新的加权策略。新策略一方面赋予属于少类的样本比属于多类的样本更多的权重,另一方面也尽量减少异常样本的权重。不同的加权策略可方便地与不同的拓展SVM相结合;利用15个基准测试数据集,对各种SVM与加权策略组合进行了实验比较。实验结果表明,新的加权策略在某些情况下有较明显的性能优势。(3)受标准SVM模型的间隔最大化以及结构风险控制训练原则的启发,提出了一种新的大间隔核分类器训练模型,这是本文的一个主要创新之处。新模型不仅具有几何直观意义,更重要的是它强调对分类器泛化能力的优化。原始模型是一个难解的非凸优化模型,通过适当的松弛处理,得到了两个不同的易解的二阶锥规划(SOCP)模型。借助于SeDuMi优化工具箱,在12个基准测试数据集上进行了仿真实验。实验结果表明,与标准SVM模型相比,两个SOCP模型无论是在平衡数据集还是在不平衡数据集上都有一定的性能优势,其中一个还具有较强的稳定性。(4)针对下抽样技术容易造成训练样本信息丢失的问题,提出将它与集成学习相结合,进而提高SVM处理CIP的能力。以Bagging和AdaBoost为集成框架整合下抽样技术,针对已有算法的不足,提出了两个新算法,即,“基于聚类的反对称集成”(CABagE)以及“修正的反对称AdaBoost集成”(MAAdaBE),这是本文的另一个主要创新之处。基于20个基准测试数据集,对各种算法进行了实验比较。实验结果表明,与传统的单一SVM分类器相比,集成SVM分类器对少类的预测能力能得到显著提高,其整体性能也往往更理想。而与已有的集成算法相比,CABagE和MAAdaBE能构建具有更高少类预测精度的SVM集成分类器。进一步地,综合多个评价准则上的比较分析表明,MAAdaBE的整体性能是最好的,这与MAAdaBE中嵌入了一种有效的样本权重平滑机制有关。

【Abstract】 Imbalanced data learning (IDL), which has got people to pay intensive attention in recent years, is a special kind of supervised (classification) learning. The main goal of IDL is to handle the classification problems with unevenly-distributed training examples between classes, i.e., the so-called class imbalance problems (CIPs). CIPs exist in many important real-world domains, including medical diagnosis and intrusion detection etc. Most of the existing classification algorithms are designed based on balanced class-distribution and classification-accuracy maximization; when applied to CIPs, they often“over-learn”majority class and further degrade the overall performance of trained classifiers. Objectively speaking, CIPs have posed an enormous challenge for the current machine learning research community.Focusing on how to deal with CIPs reasonably and effectively, via the newly-developed kernel methods, especially via support vector machine (SVM), we have carried out a series of related research work, which are summarized as follows:(1) Study on a basic issue of IDL, that is, how to evaluate classifier performance reasonably. We firstly summarize and analyze a set of evaluation metrics, which are frequently used in the current machine learning fields. In particular, the reason that traditional accuracy doesn’t suit for IDL is appropriately explored from theoretical aspect. Then, by using meta-learning method, we experimentally study the performance differences between SVM classifiers, which are optimized under different metrics. The experimental results show that although SVM is a state-of-the-art method, but the classifiers constructed by SVM are still readily biased to the majority when they are optimized under accuracy. Whereas, when optimizing under other more reasonable metrics, we can obtain“bias-rectified”SVM classifiers, which have better overall performance. The results obtained in this part not only exposit the distinction among different evaluation metrics, but also supply beneficial enlightenment for SVM’s model selection.(2) Study on how to apply several extended SVMs to CIPs by way of weighting the training examples asymmetrically. With least square SVM and proximal SVM as representatives, some extended SVMs have also been used as extensively as the standard SVM due to their easily-resolution and good performance. However, if these extended SVMs are implemented in IDL directly, we usually can’t obtain satisfying results. To improve their efficacy, one of the most simple and practical ways is to weight the training example asymmetrically. A new weighting strategy is proposed in this dissertation to overcome the deficiencies of some existing weighting methods, which assigns more weights to majority-class examples than to minority-class examples, and tries to decrease the weights of abnormal examples as well. The weighting strategies can be easily embedded in the extended SVMs. Based on 15 benchmark datasets, we have conducted the numerical experiments to compare the performance of different combinations of extended SVMs and weighting mechanisms. The experimental results show that our new weighting strategy has significant performance advantages over other strategies in some cases.(3) Enlightened by the margin-maximization and structural risk control training principles of the standard SVM, we have proposed a new model for training kernel classifier with large margin, and this is one significant innovation of this dissertation. The proposed model has intuitive geometrical meaning; and more importantly, it emphasizes on optimizing classifier’s generalization capacity. The original optimization form of new model is non-convex and it is difficult to be handled. But, after appropriate relaxation, the original model can be transformed into two different and easily-resolved second order cone programming (SOCP) formulations. With the help of SeDuMi, a kind of freely-used optimization toolbox, we have conducted the numerical experiments on 12 benchmark datasets. The experimental results demonstrate that no matter for dealing with balanced datasets or unbalanced datasets the new SOCP models both outperform the standard SVM significantly in some cases; furthermore, one SOCP model shows relatively higher robustness than the standard SVM.(4) In view that under-sampling technique may suffer from training examples’information loss, we propose to combine it with ensemble learning to enhance the efficacy of SVM on CIPs. Bagging and AdaBoost are utilized as the ensemble learning frameworks to integrate the under-sampling technique. To overcome the deficiencies of some existing ensemble learning algorithms, two new ones, namely,“Clustering Based Asymmetric Bagging Ensemble”(CABagE) and“Modified Asymmetric AdaBoost Ensemble”(MAAdaBE), are proposed, and this is another significant innovation of this dissertation. Numerical experiments for comparison between different algorithms have been conducted on 20 benchmark datasets. The experimental results show that the ensembling SVMs can improve the prediction ability for minority class and usually have better overall performance than the single SVM. Compared with the existing ensemble learning algorithms, both CABagE and MAAdaBE can build the ensembling SVMs with higher prediction ability for minority class. Furthermore, the comparison analyses of experimental results under different metrics demonstrate that MAAdaBE has best overall performance, and this should be attributed to an efficient example-weight smoothing mechanism embedded in it.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络