节点文献
基于SVM的多类文本分类算法及其应用研究
The Reseach and Application of Multi-Category Text Classification Algorithm Based on SVM
【作者】 成艳洁;
【导师】 王建仁;
【作者基本信息】 西安理工大学 , 管理科学与工程, 2009, 硕士
【摘要】 随着通信技术和计算机技术、尤其是Internet的飞速发展,各种各样的信息成几何级数增长,作为传统信息载体的文本信息更是如此。为了能在海量的文本中及时准确地获得有效的知识和信息,文本表示技术以及文本自动分类技术受到了广泛的关注。SVM作为一种基于统计学习理论的新型机器学习方法,较好地解决了非线性、高维数、局部极小点等实际问题,是机器学习领域新的研究热点。文本分类是基于内容的自动信息管理的核心技术。文本向量稀疏性大、维数高、特征之间具有较大的相关性,支持向量机对于特征相关性和稀疏性不敏感,处理高维数问题具有较大的优势,因此,支持向量机非常适用于文本分类问题,在文本分类中具有很大的应用潜力,更是当前的一个研究热点。本文主要针对支持向量机在文本分类等实际应用中存在的问题进行深入研究,主要工作如下:首先,本文研究分析文本分类的总体模型,包括信息预处理、特征表示、特征提取。重点研究分析了特征表示与特征提取技术,文本的分类算法。支持向量机是针对两类分类问题提出的,如何将其有效地推广到多类分类仍是一个尚未完全解决的问题。本文分析了现有多类分类方法的缺陷,接着引出半对半分类分类算法。在此基础上,根据树型支持向量机的特性,提出了一种基于支持向量机的半对半多类分类方法。该方法设计树型支持向量机的树型结构,克服其差错积累问题。实验表明,与其它支持向量机多类分类方法相比,该方法具有较高的分类精度和训练速度,提高了支持向量机在多类分类问题中的应用效果。其次,认真研究了统计学习理论的主要内容和SVM算法的基本原理,讨论了核函数这一热点问题,阐述了SVM研究和应用现状,以及所面临的问题。并且结合语义概念空间,提出了一种基于支持向量机和语义概念空间的HAH多类分类算法。实验表明,该算法不仅在分类精度方面有所提高,而且大大降低了标号数据数目。最后,基于支持向量机在文本分类中的优势,将支持向量机方法应用于文本分类的特征提取,提出了一种基于支持向量机的单词聚类方法。该方法基于支持向量机度量单词对分类的贡献大小,将对分类贡献一致的单词合并起来作为文本向量的一个特征项。实验表明,该方法在基本不丢失分类信息的前提下,较大程度地降低了文本向量的维数、减少了文本特征之间的相关性,并提高了文本分类的查准率和查全率。
【Abstract】 With the rapid development of communication and Internet, various kinds of information increases exponentially. Text, the most typical information carrier, can not make an exception. In order to control and retrieve valuable information, research of automatic text categorization (TC) becomes very important. Svm as a new machine learning method based on statistical learning theory, have attracted more and more attention and became a hot issue in the field of machine learning, because they can well resolve such practical problems as nonlinearity, high dimension and local minima. Text categorization is a key technique in content-based automatic information management. Text vectors are high dimensional and extremely sparse, and have numbers of relevant features. SVMs are particularly suited for text categorization and have great potential in text categorization,as SVMs are not sensitive to relevant features and sparse data, and have advantages in dealing with high dimensional problems. And is a hot field of research.This paper mainly focuses on the drawbacks of SVMs in the practical application including text categorization, and the main work is as:Firstly, the text analyzes the total model of text categorization, including the information preprocessing, feature representation and feature catching. The author analyzes the technologies of feature representation, feature catching and text categorization algorithm especially. SVMs were originally designed for binary classification. How to effectively extend them for mufti-class classification is still an ongoing research issue. Several existing mufti-class SVMs methods are compared and analyzed. And a HAH algorithm is presented, next, according to the characters of tree-structured SVMs, a tree-structured SVMs mufti-class classification method is proposed based on the HAH algorithm.The method designs the tree structure and overcomes the misclassification of tree-structured SVMs based on the semi-fuzzy kernel clustering algorithm.Experimental results indicate the method has higher precision and faster training speed than other mufti-class SVM methods do, and improves the classification performance of SVMs for mufti-class classification.Secondly, the text studies the Statistical Learning Theory (SLT) and Support Vector Machine (SVM) theory seriously, discusses kernel function. the author shows the research and application status of Support Vector Marchine, and points out some important issues. And combining the concept of semantic space, a space based on the concept of incremental semantic Direct Push support vector machine algorithm for text classification. The experiments show that the algorithm not only in the classification of a certain degree of precision, but also greatly reduces the number of labeling data.Thirdly,taking advantages of SVM in text categorization and applying SVMs to text feature extraction, a method of word clustering based on SVMs is proposed. The method evaluates the contribution of each word to classification by using SVMs, and combines several different words which have similar contribution to classification into one text feature.The experimental results indicate that the method almost does not lose the information of classification, dramatically decreases the dimensions of text vectors and the number of relevant features, and improves the precision and recall of text categorization.
【Key words】 Support Vector Machines; Text Categorization; Multi-Class Classification; concept of semantic space; Feature Extraction;
- 【网络出版投稿人】 西安理工大学 【网络出版年期】2011年 S1期
- 【分类号】F270.7;F224
- 【被引频次】3
- 【下载频次】176