节点文献

支持向量机模型选择研究

Research on Model Selection for Support Vector Machine

【作者】 汪廷华

【导师】 田盛丰;

【作者基本信息】 北京交通大学 , 计算机软件与理论, 2010, 博士

【摘要】 统计学习理论(Statistical Learning Theory,STL)为系统地研究有限样本情况下的机器学习问题提供了一套比较完整的理论体系。支持向量机(Support VectorMachine,SVM)是在该理论体系下产生的一种新的机器学习方法,它能较好地解决小样本、非线性、过学习、维数灾难和局部极小等问题,具有很强的泛化能力。支持向量机目前已经广泛地应用于模式识别、回归估计、概率密度估计等各个领域。不仅如此,支持向量机的出现推动了基于核的学习方法(Kernel-based LearningMethods)的迅速发展,该方法使得研究人员能够高效地分析非线性关系,而这种高效率原先只有线性算法才能得到。目前,以支持向量机为主要代表的核方法是机器学习领域研究的焦点课题之一。众所周知,支持向量机的性能主要取决于两个因素:(ⅰ)核函数的选择;(ⅱ)惩罚系数(正则化参数)C的选择。对于具体的问题,如何确定SVM中的核函数与惩罚系数就是所谓的模型选择问题。模型选择,尤其是核函数的选择是支持向量机研究的中心内容之一。本文针对模型选择问题,特别是核函数的选择问题进行了较为深入的研究,其中主要的工作和贡献如下:1.系统地归纳总结了统计学习理论、核函数特征空间和支持向量机的有关理论与算法。这些内容是本文工作的基础,作者力求在介绍这些内容时尽量做到简洁但又不失完整与系统性;同时在许多内容的叙述中也融入了作者自己学习的一些体会。2.研究了SVM参数的基本语义,指出数据集中的不同特征和不同样本对分类结果的影响可以分别由核参数和惩罚系数来刻画,从而样本重要性和特征重要性的考察可以归结到SVM的模型选择问题来研究。在对样本加权SVM模型(例如模糊SVM)分析的基础上,提出了特征加权SVM模型,即FWSVM。FWSVM本质上就是SVM与特征加权的结合,本文将特征加权引入到核函数的构造中,从而可以从核函数的角度来研究特征加权对SVM分类性能的影响。理论分析和数值实验的结果均表明,FWSVM比标准的SVM的泛化能力要好。3.在系统归纳总结SVM模型选择、尤其是核函数参数选择的常用方法(例如交叉验证技术、最小化LOO误差及其上界、优化核评估标准)之后,进一步研究了核极化的几何意义,指出高的核极化值意味着同类的数据点相互靠近而异类的数据点则相互远离,并提出了一种基于优化核极化的广义Gaussian核的参数选择算法KPG。和优化后的标准Gaussian核相比,使用优化后的广义Gaussian核的SVM具有更好的泛化能力。此外,提出了KPG算法的一种变体,即KPFS算法,并通过实验初步验证了KPFS用于SVM特征选择的有效性。4.在局部Fisher判别分析算法的启发下,对存在局部结构信息条件下的核评估标准问题进行了深入地讨论,指出目前常用的核评估标准都没有考虑同类数据的局部结构信息对分类性能的影响,这种“全局性”的评估标准有可能会限制增强数据可分性的自由度。基于这个缺陷,提出了一个“局部化”的核评估标准,即局部核极化。局部核极化通过引入亲和系数在一定程度上保持了同类数据的局部结构信息,能够进一步增强异类数据之间的可分性。该标准的有效性通过UCI数据集上的实验得到了充分的验证。

【Abstract】 The main goal of statistical learning theory (STL) is to provide a comparatively integrated theoretical basis for studying the machine-learning problems with finite learning examples. Support vector machine (SVM) is a new learning algorithm, which was introduced in the framework of the STL. Compared with the traditional learning algorithms, SVM can overcome the problems such as small samples, nonlinear, overfitting, curse of dimensionality, local minima, etc., and generalize well for unseen data. Nowadays, SVM has been successfully applied for a wide range of different data analysis problems, such as pattern recognition, regression estimation, probability density estimation, etc. Furthermore, SVM brings about the growing popularity of the kernel-based learning methods, which can analyze efficiently the nonlinear relationship. Currently, SVM and other kernel methods have become one of the research focuses in machine learning community.It is well known that the performance of SVM depends mostly on the selection of kernel function and penalty coefficient (regularization parameter) C. Given a specific problem, how to select the kernel function and regularization parameter is well known as the model selection issue. Model selection, especially kernel selection, is one of the central interests in SVM. In this work, we concentrate us on the model selection, especially kernel selection, for SVM and attempt to make a considerably deep exploration on some aspects of this issue. The main contents and contributions of this dissertation are as follows:1. We summarize systematically the statistical learning theory, kernel feature space and SVM, which are the bases of this work. We introduce these contents in a considerably concise way and simultaneously strive to avoid the decrease in integrality and systematization. Additionally, during the introduction, we add appropriately some of our own understanding on these contents.2. We explore the semantic interpretation of the SVM parameters, and point out that the influence of different features and samples on the classification results can be measured by the kernel parameter and regularization parameter, hence the investigation of the importance of the features and samples for SVM can be reduced to a model selection issue. Based on the analysis of the sample weighted SVM model (such as Fuzzy SVM), a new model, i.e., feature weighted SVM (FWSVM for short) is proposed. FWSVM in essence is the combination of the feature weighting and SVM. However, we introduce the feature weighting into the construction of kernel function, hence we can analyze the influence of the feature weighting on the SVM classification performance from the perspective of kernel function. Theoretical analysis and experimental results show that the FWSVM has better generalization ability than the standard SVM.3. First we summarize systematically the commonly used model selection (especially kernel parameter selection) methods, such as cross-validation technique, minimizing the LOO error or its upper bounds, optimizing kernel evaluation measure, etc. After that, we investigate further the geometric significance of the kernel polarization, and point out that high kernel polarization value means to keep within-class data pairs close and between-class data pairs apart. Subsequently, we propose an algorithm for learning the general Gaussian kernels by optimizing kernel polarization, i.e., kernel polarization-based gradient ascent algorithm (KPG for short). Compared with the optimized standard Gaussian kernel, general Gaussian kernel which is adapted by KPG can yield better generalization performance of SVM. Additionally, we also propose a variant of KPG for SVM feature selection, i.e., KPFS, which is demonstrated preliminarily with some UCI machine learning benchmark examples.4. Enlightened by the Local Fisher Discriminant Analysis (LFDA), we explore the design of kernel evaluation measure in the case of multimodality (samples of the same class form several separate clusters, i.e., local structure of the data of the same class). We point out that currently commonly used kernel evaluation measures all neglect the influence of the local structure on the classification performance and the ’globality’ of the these measures may leave less degree of freedom for increasing separability. To overcome this disadvantage, we then propose a ’localized’ kernel evaluation measure, i.e., local kernel polarization. Local kernel polarization can preserve to some extent the local structure of the data of the same class by introducing the affinity coefficients between the data pairs, hence can increase further the separability of the between-class data points. Local kernel polarization is demonstrated with some UCI machine learning benchmark examples.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络