节点文献

基于支持向量机的代价敏感数据挖掘研究与应用

Cost Sensitive Data Mining Based on Support Vector Machines: Theories and Applications

【作者】 郑恩辉

【导师】 李平; 宋执环;

【作者基本信息】 浙江大学 , 控制科学与工程, 2006, 博士

【摘要】 产生于20世纪90年代的数据挖掘(Data Mining,DM)技术是一种基于海量数据获取知识的技术。随着计算机和存储技术的快速发展,人们已经积累了大量的历史数据,迫切需要把这些历史数据转化为可用的知识,因此DM相关内容已得到广泛的研究,并有一些成功的应用。但当面对的挖掘任务涉及不同类型的代价时,大多现有DM算法的直接应用不能很好地完成DM任务,需引入代价敏感数据挖掘(Cost Sensitive DM,CSDM)。对于给定的样本集,常用的DM算法假定每个样本的误分类代价都相等,以泛化精度为学习目标;而CSDM则是考虑到不同样本的误分类代价不相等,以最小化期望代价为学习目标。 支持向量机(Support Vector Machines,SVM)源于统计学习理论(Statistical Learning Theory,SLT),是一种强有力的DM算法。不同于神经网络、决策树等传统算法基于经验风险最小化(Empirical Risk Minimization,ERM)准则,SVM基于结构风险最小化(Structural Risk Minimization,SRM)准则,即同时考虑经验风险和模型复杂度,因而获得良好的泛化性能。但和传统算法一样,SVM不具有代价敏感性,不能直接用于CSDM。 针对CSDM问题,本论文提出一系列基于改进SVM的CSDM算法,并进行应用研究。本论文主要内容如下: 1.基于SVM及其启发,提出并证明了支持向量率(和数)与边界支持向量率(和数)的界,并把这些界分别扩展到正例与反例;提出并证明了正例的支持向量率与边界支持向量率分别依概率大于反例的支持向量率与边界支持向量率;证明了正例的分类性能依概率差于反例的分类性能,即证明SVM算法应用于不平衡数据挖掘时同传统基于精度的算法一样存在“有偏性”。虚拟数据集试验和Benchmark数据集试验表明了假设的合理性和结论的正确性。 2.基于SVM实现SRM原则的启发,在SVM中嵌入拒识代价和误分类代价,提出了SVM-RMC分类器的设计,并基于修改的SMO算法给出了该优化问题的求解方法。在SVM-RMC中,决策函数和拒识区域的确定在训练过程中实现。试验结果表明:相比于SVM,SVM-RMC减少平均测试代价,提高分类可靠性。 3.基于SVM,通过引入概率估计和代价最小化过程,提出了一个基于SVM的CSDM算法CS-SVM,在此基础上提出了一个通用CSDM算法G-CSC。CS-SVM和G-CSC以误分类代价最小作为优化目标,G-CSC中包含的分类算法可以是任意的,只要把分类算法的输出构造成后验概率的形式。基于虚拟和Benchmark数据集的试验结果表明CS-SVM能有效减小平均测试误分类代价。 4.基于K最近邻(KNN)算法,提出了确定噪音代价的方法,并将其引入到SVC和SVR算法,进而提出了嵌入噪音代价的代价敏感SVC(SVC-NC)算法和代价敏感SVR(SVR-NC)算法。基于虚拟和Benchmark数据集的试验结果表明,

【Abstract】 Data mining, emerged during the late 1980’s, has made great strides and is expected to continue to flourish. Data mining is the process of extracting knowledge hidden from large volumes of raw data. There is growing interest in data mining theories and applications in recent years due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. However, the majority of the data mining literature ignores all types of cost (unless accuracy is interpreted as a type of cost measure) involved in so many real-world applications as medical diagnosis and fraud detection fields, so these algorithms without taking all types of cost into account do not perform well, and cost sensitive data mining should be introduced. Cost sensitive data mining is defined as the problem of learning decision model minimizing expected total costs, given a training set. For example, in medical diagnosis, the cost of erroneously diagnosing a patient to be healthy may be much bigger than that of mistakenly diagnosing a healthy person as being sick, because the former kind of error may result in the loss of a life.Support vector machines (SVM) are a new class of data mining algorithms, motivated by the structural risk minimization (SRM) induction principle of statistical learning theory. SVM captures the main insight of statistical learning theory (in order to obtain a small risk (cost) function, one needs to control both training error and model complexity) and shows better generalization ability than data mining algorithms based on empirical risk minimization (ERM) principle. SVM have proven to be effective in many practical applications. However, SVM are not cost sensitive, like traditional algorithms.Some implementation algorithms of cost sensitive data mining based on SVM are proposed in this thesis aiming to develop practical and near-optimal algorithms for learning how to solve cost sensitive data mining tasks. In detail, the major contributions of this dissertation are as following:1. Based on SVM, firstly, the bound of both the SV number (and rate) and BSV number (and rate) is proposed and is proved, further the bounds are extended to positive class and negative class respectively. Secondly, it is presented and testified that the SV rate and BSV rate of positive class is higher than that of negative class. Thirdly, that the positive class yields poorer classification and predictive accuracy than the negative class does is attested. Experimental study based on German credit and Heart disease data sets shows that the hypothesis and conclusion proposed is true and effective.2. A novel cost sensitive data mining method for support vector machine classifiers with reject cost and unbalanced misclassification cost (SVM-RMC) is proposed based on SRM principle. In SVM-RMC, the decision function, rejection region included, can be determined during the training phase of a classifier, by the learning algorithm. To implement SVM-RMC, we develop a novel formulation of the training problem, and a specific algorithm to solve it. Experimental results based on some artificial and benchmark data sets shows that SVM-RMC reduced the total cost and improved the classification reliability.3. A novel general cost sensitive data mining (G-CSDM) algorithm for making an arbitrary classifier cost sensitive is proposed by wrapping probability estimation and a cost minimizing procedure around it, and a particular implementation based on SVM, called CS-SVM, is achieved. Experimental results based on artificial and benchmark data sets shows that CS-SVM reduced the total misclassification cost.4. In order to overcome the overfitting problem caused by noise in training data set, a noise cost model based on k nearest neighbors (KNN) algorithm in feature space is presented and is applied to SVC and SVR algorithms, then SVC algorithm with noise cost (SVC-NC) and SVR algorithm with noise cost (SVR-NC) are proposed. Experimental results show that both SVC-NC and SVR-NC algorithms can largely reduce the effect of noise in training set on learning model, and have better generalization ability.5. Under some restrictions, the functional equivalence between SVM and a kind of FIS is proposed, and further MBFIS-SRM and MBFIS-SRM-MC is devised based on SRM principle. In MBFIS-SRM and MBFIS-SRM-MC, the number of rules and rules base generate automatically by algorithm. Experimental results based a few benchmark data sets show that MBFIS-SRM have better generalization ability and MBFIS-SRM-MC reduced the average test misclassification cost.6. A few data mining process model and data mining software are introduced and general data mining patterns and technologies are reviewed. A novel bidirectional feedback data mining process model (BFDM) is proposed base on the understanding of both data mining process model and metallurgy process industry. A novel data mining system software DMP is implemented in which some data mining models in metallurgy process industry are built to solve real industrial problems.

  • 【网络出版投稿人】 浙江大学
  • 【网络出版年期】2006年 09期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络