节点文献

支持向量机及半监督学习中若干问题的研究

Studies of Some Problems in Support Vector Machines and Semi-supervised Learning

【作者】 薛贞霞

【导师】 刘三阳;

【作者基本信息】 西安电子科技大学 , 应用数学, 2009, 博士

【摘要】 随着信息技术的飞速发展,在信息收集和处理的过程中,人们面临的各种数据信息规模越来越大,构成也越来越复杂,这使得机器学习日益受到人们的关注,成为目前研究的热点问题之一.由Vapnik提出的统计学习理论为机器学习问题提供了理论基础,着重研究有限样本的统计规律及学习性质,使用结构风险最小化原则,有效地提高了算法的推广能力.支持向量机是统计学习理论的最新发展,它具有全局最优、适应性强、推广能力强以及解的稀疏性等优点,能较好地解决小样本、非线性、过学习、维数灾难和局部极小等实际应用中的难题,是机器学习领域的又一里程碑,从而广泛应用于模式识别、回归估计、函数逼近以及密度估计等领城.近年来,受支持向量机的这些优势的启发,有学者提出了一些支持向量机的推广算法,比如最小二乘支持向量机,中心支持向量机,超球支持向量机(也称为支持向量域描述),基于一个球的模式分类方法等,分别从不同的方面对支持向量机进行了完善和补充.许多机器学习问题中,大量可获得的数据中仅有一小部分容易获得类别标签,而另一相对大量的部分由于各种原因(不容易获得类别标签或者获得标签的代价较大)而未能获得标签,同时利用这些样本(包括已标签样本和未标签样本)进行学习的问题被称为半监督学习.本文主要研究支持向量机及其几种推广方法与半监督学习中存在的若干问题,主要工作如下:1.研究了大样本条件下,提高支持向量机学习速度和精度的问题.针对支持向量机中大规模样本集训练速度慢且分类精度易受野点影响的问题,提出基于壳向量和中心向量的支持向量机算法.其基本步骤是:首先分别求取每类样本点的壳向量和中心向量;然后将求出的壳向量作为新的训练集进行标准的支持向量机训练得到超平面的法向量;最后利用中心向量更新法向量以减少野点的影响得到最终的分类器.实验表明采用这种学习策略,不仅加快了训练速度同时提高了分类精度.2.研究了支持向量机的两种推广方法(最小二乘支持向量机和基于一个球的模式分类方法)对不平衡数据集的分类问题.针对最小二乘支持向量机对不平衡数据集的分类问题,同时考虑各类样本的数量和样本分散程度的不同,对分离超平面进行调整.该方法克服传统算法只考虑样本数量不平衡的不足,提高了最小二乘支持向量机的泛化能力.针对基于一个球的模式分类方法对不平衡数据的分类问题,通过引入两个参数来分别控制两类错分率的上界,不仅提高了不平衡数据集的分类和预测的性能,而且大大缩小了参数的选择范围.实验表明我们的方法可以有效提高不平衡数据的分类性能.3.本文从以下两条途径研究了半监督学习中的直推式学习方法,一是,针对Chen提出的渐进直推式支持向量机学习算法存在的诸如训练速度慢、回溯式学习多、学习性能不稳定等缺点,提出两种改进的渐进直推式支持向量机分类学习算法.它们继承渐进直推支持向量机渐进赋值和动态调整的规则,同时利用支持向量的信息或者可信度选择新标注的无标签样本点,结合增量支持向量机或支持向量预选取方法减少训练代价.实验结果表明所提算法不仅能较大幅度地提高算法的速度,而且在一般情况下能提高算法的精度.二是,针对支持向量机的一种推广算法—基于一个球的模式分类方法提出了其直推式学习策略,即通过一个超球面将两类数据以最大的分离比率分离,同时利用有标签样本点和无标签样本点来建立一个超球分类器的渐进直推式学习算法,这种算法在没有足够的有标签样本的信息时利用了无标签样本所提供的额外的信息,获得了更好的分类性能.实验结果表明该算法确实具有更好的性能.4.本文研究了在已知少量有标签样本点和大量无标签样本点条件下的半监督野点探测问题.野点(也称离群点)探测问题一直是机器学习的一个难题,在许多实际问题中,野点往往是人们更感兴趣的更重要的样本点,比如在网络的入侵检测、故障诊断、疾病诊断等领域中.本文将粗糙集和模糊集理论应用于半监督野点探测问题中,提出了模糊粗糙半监督野点探测方法.这个方法是在少量有标签的样本点和模糊粗糙C均值聚类算法的帮助下,通过一个目标函数,同时最小化聚类平方误差、有标签样本点的分类误差和野点的个数.每个聚类用一个中心、一个清晰的下近似和一个模糊边界来表示,只对位于边界的样本点进一步讨论其是否为野点的可能性.实验结果表明所提的方法能在一般意义下提高野点探测精度,减少误警率,还能减少需要进一步讨论的候选野点的个数.

【Abstract】 With the flying development of information technologies, during the course of collecting and processing information, the size of data sets confronting human becomes larger and larger, and the constitution of data samples also becomes more and more complicated. These facts have made machine learning received more and more attention and become one of the hot topics of research. Statistical Learning Theory (SLT) proposed by Vapnik provides a theoretical basis for machine learning. SLT concerns mainly the statistical laws and learning properties when samples are limited and can effectively improve the generalization ability of algorithm with using the principle of Structural Risk Minimization (SRM). As the latest development of SLT,Support Vector Machine (SVM) has many advantages such as global optimization, excellent adaptability and generalization ability, and sparsity solution. It can solve many practical appication problems such as small samples, nonlinear learning, over fitting, curse of dimentionality, and local minima and is a new milestone in the field of machine learning. So SVM has been widely used in pattern recognition, regression estimation, function approximation, density estimation, etc. Recently, inspired by the above advantages of SVM, some researchers proposed extend algorithms of SVM, which include Least Squares Support Vector Machines (LSSVM), Center Support Vector Machine (CSVM), Hypersphere Support Vector Machines (also called Support Vector Domain Description (SVDD), Sphere Sphere-based Pattern Classification (SSPC), etc. These algorithms improve and complement SVM from different aspects. In many machine learning problems, a large amount of data is available, but only a few of them can be labeled easily and others relative large amount of data can not be labeled because of all kinds of reasons (not easy or fairly expensive to obtain). The problem, combining unlabeled and labeled data together to learning the labels of unlabeled ones, is called semi-supervised learning. This thesis focuses on some problems existed in SVM, several extensions of SVM, and semi-supervised learning. The main works of the thesis is as folows:1. Study how to improve the learning speeds and classification accuracies of SVM under the condition of large scale sample sets. SVM takes very long time when the size of training data is large and the precision of classification is easily influenced by outliers, and we propose an SVM algorithm based on hull vectors and central vectors. Firstly, we find out convex hull vectors and center vectors for each class. Secondly, the obtained convex hull vectors are used as the new training samples to train standard SVM and the normal vector of hyperplane is obtained. Finally, in order to weaken the influence of the outlier, we utilize center vectors to update the normal vector and obtain final classifier. Experiments show that the learning strategy not only quickens the training speed, but also improves the classification accuracy.2. Study imbalance dataset classification problem for two variations of SVM, i.e., LSSVM and SSPC. For the problem of LSSVM on imbalance dataset classification problem, we take the number of samples and the dispersed degree of each class into consideration and adjust separation hyperplane in standard LSSVM. It overcomes disadvantages of traditional designing methods which only consider the imbalance of samples size and improves the generalization ability of LSSVM. As for SSPC, we provide the facility to control the upper bounds of two classes error rates respectively with two parameters. As such, the performance of classification and prediction of imbalance data sets can be improved, and the range of selection of parameters can be greatly narrowed. Experimental results show that the method can effectively enhance the classification performance on imbalance data sets.3. In this paper, We study the transductive learning in the field of semi-supervised learning via the following two ways. Firstly, progerssive transductive support vector machines (PTSVM) proposed by Chen have obvious deficiencies such as slower training speed, more back learning steps, and unstable learning performance. In order to overcome these shortcomings, we give two improved progressive transductive support vector machine algorithms. They inherit the PTSVM’s progressive labeling and dynamic adjusting and utilize the information of support vectors or reliability values to select new unlabeled samples to label, and also combine with incremental support vector machines or pre-extracting support vector algorithm to reduce the calculation complexity. Exiperimental resuls show the above proposed learning algorithms can obtained satisfactory learning performance. Secondly, we proposed transductive learning strategies for a extend algorithm of SVM—SSPC. The proposed algorithms seek a hypersphere to separate data with the maximum separation ratio and construct the classifier using both the labeled and unlabeled data. This method utilizes the additional information of the unlabeled samples and obtain better classification performance when insufficient labeled data information is available. Experiment results show the proposed algorithm can yield better performance. 4. In this paper, we study semi-supervised outlier detection (SSOD) under the situation of the few labeled data and a wealth of available unlabeled data. The problem of outlier detection has always been a difficult task. In many applications, such as, network intrusion detection, fraud detection, medical diagnosis, outliers that deviate significantly from majority samples are more interesting and useful than the common samples. Fuzzy rough based semi-supervised outlier detection (FRSSOD) is proposed, which applies the theory of rough and fuzzy sets to SSOD. With the help of few labeled samples and fuzzy rough C-means clustering algorithm, this method introduces an objective function, which minimizes the sum squared error of clustering results and the deviation from known labeled examples as well as the number of outliers. Each cluster is represented by a center, a crisp lower approximation and a fuzzy boundary and only those points located in boundary can be further discussed the possibility to be reassigned as outliers. Experiment results show that the proposed method, on average, keep, or improve the detection precision and reduce false alarm rate as well as reduce the number of candidate outliers to be further discussed.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络