节点文献

基于支持向量机的数据挖掘应用研究

The Research of Data Mining Based on Support Vector Machine

【作者】 王从胜

【导师】 王士同;

【作者基本信息】 江南大学 , 计算机软件与理论, 2008, 硕士

【摘要】 数据挖掘是从大量、复杂的数据中迅速获取新颖、有效的知识的过程。支持向量机(support vector machine,SVM)是数据挖掘中的一项新技术,是借助于最优化方法解决机器学习问题的新工具。它是在统计学习理论基础上发展起来的一种通用学习机器,具有全局最优、结构简单和推广能力强等优点。传统的支持向量机是一种有监督的机器学习算法,即要求训练样本的类别信息是已知的。但在将SVM应用到实际问题中时,经常只能获得少量的有标签样本,而大量的样本是没有标签的,这时传统的SVM算法在这类问题面前就无能为力了。为了解决这一问题,T.Joachims提出了直推式学习的方法TSVM(Transductive SupportVector Machine)。陈毅松等人对TSVM作了改进,提出了渐进直推式支持向量机PTSVM(Progressive Transductive Support Vector Machine)。本文对PTSVM作了进一步的改进,提出了基于离散度量的支持向量机SDSVM (Separation Degree Support Vector Machine)。该算法引入了Fisher准则中的样本离散度作为度量标准,利用Fisher准则函数作为评价函数,试图使算法在训练结束时能找到这样一个分割平面,使同类样本间尽量密集而不同类样本间距离尽量拉大。达到了降低了算法训练的时间复杂度和提高测试精度的目的。简单的支持向量机只能处理二值分类问题,不能直接处理多值分类问题。而现实世界中的大部分数据都是多类数据,所以需要对简单支持向量机作进一步扩展,使之能解决多值分类问题。本文介绍了几种用于多值分类的SVM算法,包括“一对多”、“一对一”、有向无环图SVM以及基于决策树的SVM,并比较了它们各自的优点和缺点。通过分析SDSVM的不足之处,对它作了进一步的改进,并将其成功与多值分类的SVM算法相结合。实验结果表明,SDSVM在应用于半监督的多值分类问题中取得了较好的性能。

【Abstract】 Data mining is a technology that finds underlying rules and extracts valuable knowledge.data mining aims at extracting novel and useful knowledge from large volumes of data.Support Vector Machine (SVM) is a new technology of Data Mining and a new implement recurred to optimization techniques to solve the problems of Machine Learning.It is a kind of new general learning machine based on statistical learning theory and has the advantages of global optimization, simple structure and high practicability.The traditional SVM is a supervised machine learning algorithm,which requires the label of the training samples is known.We only get a few labeled samples when SVM is applied to practical problems.In fact,a large number of samples are unlabeled.At this time the traditional SVM algorithm is so powerless to face such problems.In order to solve this problem, T.Joachims proposed the method of TSVM.Chen Yi-song and others improved TSVM and proposed PTSVM.This paper makes a further improvement for PTSVM,and SDSVM is proposed which is based on seperation degree. a semi-supervised classification algorithm based on the combination of the separation degree and support vector machine is devised, which uses the separation degree in Fisher criteria as metric and Fisher criteria as evaluation function. Try to make the algorithm get such a split plane which makes the same labeled samples’ distance so close and the different labeled samples’ so far at the end of training, to achieve the objective of improving classification accuracy. It reduces the number of training and the time complexity.The traditional SVM is only able to deal with binary classification.It can not deal with multiclass problems directly. In the real world,most of samples are multiclass datas.We need make a further expansion for traditional SVM so that it can deal with multiclass problems.This paper introduced some SVM algorithms which can deal with multiclass problems,such as one-a-rest,one-a-one,DAGSVM and based on decision tree SVM and Compared their performance. By analyzing the shortcomings of the SDSVM,we make a further improvement for it and successed in combining it with multiclass SVM. The results show that SDSVM gets a better performance in appling to semi-supervised classification problems than PTSVM.

  • 【网络出版投稿人】 江南大学
  • 【网络出版年期】2009年 03期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络