节点文献

基于支持向量机的不平衡数据集分类算法研究

Research on Classification Algorithm for Imbalanced Data Sets Based on Support Vector Machines

【作者】 郝姝雯

【导师】 张健沛;

【作者基本信息】 哈尔滨工程大学 , 计算机应用技术, 2011, 硕士

【摘要】 现代计算机技术的高速发展,使得在科学研究和社会生活的各个领域中积累了大量的数据,为将这些数据转换成有用的信息和知识,数据挖掘技术应运而生并得以迅速发展。但是存在一类数据集称为不平衡数据集,这种数据集中一类数据的数目远远大于另一类数据的数目,而且往往少数类提供的信息更加重要,所以不平衡数据集的分类问题成为现在数据挖掘领域研究的一个热点。支持向量机是一种建立在统计学习理论基础上的分类方法,具有坚实的理论基础,对于普通数据集有比其他分类算法好的分类效果,但是对于不平衡数据集的分类效果并不是很好。本文的研究内容首先从不平衡数据集的特点入手,提出基于聚簇的下采样方法,通过分析得到支持向量机在不平衡数据集分类时失效的原因,采用提出的下采样方法,对多数类的支持向量进行下采样,目的是删除一部分多数类样本,以降低多数类与少数类的不平衡程度,然后利用不同类惩罚支持向量机对新样本集进行训练,达到提高分类精度的目的。现今流行的处理不平衡数据集分类的方法之一是代价敏感学习,但是支持向量机本身并不具有代价敏感性,所以并不适用于代价敏感数据挖掘,本文提出基于数据集分解的代价敏感支持向量机,通过输出后验概率和元学习过程,重构一个集成了误分类代价的新样本集,使用代价敏感支持向量机对重构的新样本集进行训练,以使分类的误分类代价最小。对每一个算法都进行了仿真实验,使用不同的评价准则,通过实验结果和对实验结果的分析表明两个算法分别从提高分类精度,使误分类代价最小方面达到了很好的效果。

【Abstract】 The rapid development of modern computer technology, making the research and all areas of social life have accumulated large amounts of data, in order to convert these data into useful information and knowledge, data mining techniques emerged and developed rapidly.But there is a class of data set known as the imbalanced data set, this data set the number of a class of data is far greater than the number of another type of data and information provided by the minority class is often more important, so the classification of imbalanced data sets Data mining is becoming a hot research field. Support vector machine is built based on statistical learning theory of classification, has a solid theoretical basis for common data set than other classification algorithms achieve the best performance, but for the imbalanced data set is not very good classification results.This paper will first of all the characteristics of imbalanced data sets from the uneven start, The next proposed under-sampling based on cluster methods, By analyzing the obtained support vector machine classification in the imbalanced data set causes the failure, under the proposed sampling method used for majority class support vector for the under-sampling, the purpose is to remove part of the majority class samples to reduce the imbalanced degree of majority class and minority class, and then use SVM to train the new sample set, to improve the classification accuracy purposes.Current popular classification of imbalanced data sets dealing with one of the methods is cost-sensitive learning, but the support vector machine itself does not have the cost of sensitivity, it does not apply to consideration of cost-sensitive data mining, data sets based on decomposition of the proposed cost-sensitive support vector machine, through the output a posteriori probability and meta-learning process,an integrated reconstruction of misclassification cost of the new sample set, using the support vector machine on the reconstruction of the new training sample set, so that the minimum misclassification cost classification.Have carried out an algorithm for each simulation experiment, using different evaluation criteria, the experiment results and analysis of experimental results shows that the two algorithms are from improving the accuracy and to make the minimum misclassification cost have reached good results.

【关键词】 数据挖掘不平衡数据集SVM代价敏感
【Key words】 data miningimbalanced data setSVMcost-sensitive
节点文献中: 

本文链接的文献网络图示:

本文的引文网络