节点文献

SVM文本分类中基于法向量的特征选择算法研究

Normal Weight Based Feature Selection Method in SVM Text Categorization

【作者】 姜鹤

【导师】 陈丽亚;

【作者基本信息】 上海交通大学 , 通信与信息系统, 2010, 硕士

【摘要】 随着Internet的快速发展,文本分类已经成为了组织在线信息的核心任务之一,并且成为了许多应用中的关键架构。相对于其他学习算法,SVM在文本的分类中表现出了更优异的性能。在采用SVM算法的文本分类中,由于文本所表征的向量空间维数通常非常巨大,因此在训练过程中将耗费大量的系统资源。在资源受限的情况下,往往无法直接在文本原始的空间维数上进行处理。在此情况下,引入有效的特征选择算法就显得相当的必要。文本介绍了一种基于法矢量权重的特征选取方法,并将此方法应用于基于SVM的中文文本分类。此特征提取方法提供一种有效的途径,在基本保持分类器性能的前提下显著的减少特征空间的维数,进而提升系统的资源利用效率。本文研究的关键技术包括:第一,为了描述SVM训练过程中对计算资源的消耗,引入“稀疏度”的概念。此处,稀疏度指得是每一文本样本所表征的矢量中非零特征项的平均统计数。文档矢量的稀疏度直接影响计算资源的开销,这里的资源包括稀疏矢量所消耗的存储资源和进行运算所耗费的时间。第二,介绍了一种基于法矢量权重的特征选取方法。基于法向量权重的特征提取方法需要选取训练数据集的子集,预训练得到SVM模型,将法向量权重作为特征项的评估指标,再以此作为特征排序的依据。第三,研究在计算资源有限的条件下,使用特征选择算法增保留部分特征并保留尽可能多的训练文档,和减少训练文档数并保留尽可能多的文本特征数两种情况下的文本分类性能。第四,研究对于线性SVM分类器,选用基于法向量的特征选择算法,和传统的基于几率比和基于信息增益的特征选择算法,对文本分类性能的影响。实验证明,对于线性SVM分类器,相比与保留全部的特征而只保留部分训练文档,使用特征选择算法保留部分特征而相应的保留更多的训练文档能够获得更好的特征性能,从而为在资源受限情况下,特征选取算法的使用提供有力的理论依据。同时,比较基于法向量的特征选择算法,基于几率比和基于信息增益的特征选择算法下的分类性能,证明了对于线性SVM分类器,基于法向量的特征选择算法能够获得最好的分类性能。基于法向量的特征选择算法可以在较大幅度减少计算资源消耗的同时基本维持所得到的分类器性能。从而在资源受限的条件下,提供了一种SVM文本分类的解决途径。

【Abstract】 With the rapid growth of Internet, text classification has been one of the key tasks of organizing on-line information, and have become the key component of lots of applications. Compare to other learning algorithm, SVM learning algorithm performances better in text classification.For text classification based on SVM learning algorithm, usually there is an abundance of training data, which will cost a lot of computing resources in training process. So training of classifiers cannot be performed over the full set of data due to limited computing resources. Under this situation, it is significant to introduce feature selection methods.This paper introduces a feature selection method based on the weight of normal from SVM model, and applies this method to text classification based on SVM learning algorithm. This feature selection method provides an effective way to maintain the classification performance while reducing the dimension of feature space and then significantly enhances the efficiency of computing resource . This paper will perform the research on following points:Firstly, in order to describe the cost of computing resources in SVM training process, we introduce the concept of“sparsity”. Sparsity is here defined as the average number of non-zero components in the vector representation of data. Sparsity of vector impact the cost of computing resources directly, the resources here involve both system memory and time cost for computing.Secondly, introduce a feature selection method based on the weight of normal from SVM model. This feature selection method is first to train linear SVM on a subset of training data to create initial classifiers, then taking the weight of normal from SVM model as the measure of features , by which features are sorted.Thirdly, when the computing resources are limited, for the following two situations, eliminating part of features by feature selection method to retain as much training data as possible, eliminating part of training data to retain as many features as possible, compare the performance of text classification.Fourthly, for linear SVM classifier, explore the performance of normal-based feature selection method by comparing it with two traditional feature selection methods: odd ratio and information gain.Experimental results show that for linear SVM classifier , compare to eliminating part of training data to retain as many features as possible, it will performance better by eliminating part of features by feature selection method to retain as much training data as possible. Which provides a strong theoretic evidence for performing feature selection when the computing resources are limited. At the same time, compare to traditional feature selection methods : odd ratio and information gain, the normal-based method yields better classification performance. This feature selection method provides an effective way to maintain the classification performance while reducing the dimension of feature space and then significantly enhances the efficiency of computing resource .

节点文献中: 

本文链接的文献网络图示:

本文的引文网络