节点文献

基于特征选择和质心构建的文本分类研究

Research of Text Categorization Based on Feature Selection and Centroid Construction

【作者】 谢华

【导师】 王健;

【作者基本信息】 大连理工大学 , 计算机应用技术, 2010, 硕士

【摘要】 随着信息技术的发展,人们能够获取的信息呈现爆炸式的增长。面对日益增多的海量信息,仅仅依靠人工的方式来处理这些信息变得越来越困难。需要一些自动化的辅助工具来帮助人们更好的管理和过滤这些信息。文本分类正是在这样的背景下提出的一种文本自动化处理工具。文本分类就是将文本集中的每个文本分配到预先定义好的类别集中的某一个类别中去。使用机器学习的方法,其目的就是从实例中进行分类器的学习,然后利用分类器进行自动分类。这是一个有监督的学习问题。当前,存在多种文本分类方法,如朴素贝叶斯,K-近邻,神经网络,基于质心的方法和SVM等。文本分类在许多领域,例如网络资源的分类和垃圾邮件过滤等,都得到了广泛的应用。本文的主要工作是对基于丰富语义信息的文本表示方法进行了研究,并提出了一种新的称为FSCC的基于质心的文本分类方法。首先介绍了文本分类的相关背景知识和研究现状。接着详细说明了文本分类的一般流程,包含文本的表示,分类器的选择和训练,最终分类结果的评测。然后研究了文本分类中基于语义信息的文本表示方法。将基于语义的文本表示方法与传统的BOW表示方法进行了比较。最后,在传统的基于质心的分类方法的基础上,本文提出了一种改进的基于质心的分类方法FSCC。在FSCC方法中,首先采用特征选择的方法计算特征与类别之间的特征选择值,然后根据特征选择值定义了一个新的质心特征权重计算公式,并由此得到类别的质心向量。最后,采用非归一化的余弦相似度(demoralized cosine measure)来计算文档与质心之间的相似度。本文在不同的语料上进行了实验,实验结果表明,该方法相比经典的质心分类方法以及SVM,分类效果均有显著的提高。

【Abstract】 With the development of information technology, the information people can get are growing in an explosive way. Facing the mass information increasing day after day, people find that dealing with them solely relying on artificial means becomes more and more difficult. People need some automation auxiliary tool to help them management and filter the information more convenient. Text categorization is one kind of text automated tools proposed under such background.The goal of text categorization is classifying the documents into a fixed number of predefined categories. Using the method of machine learning, its goal is to learn the classifier from examples, and then use the classifier for automatic classification. This is a supervised learning problem. At present, there are many methods for text categorization, such as Naive Bayes, k-nearest neighbor, Neural Network, Centroid-Based Approaches and SVM, etc. Text classification have been widely used in many fields, such as network resources classification and spam filtering, etc.In this paper, the text representation method based on rich semantic information is studied, and a new method based on centroid-based approach, which is called FSCC, is put forward. Firstly, the background knowledge and research status about text categorization is introduced. Then the general flow of text categorization is given, including text representation, classifier selection and training, the assessments of classification results. And then text representation method based on semantic information in the text classification is studied. The semantic-based text representation methods and traditional BOW representation methods are compared subsequently. Finally, based on the traditional centroid-based classification method, this paper proposes an improved method called FSCC. In FSCC, firstly, the relevancy between features and categories is calculated by using feature selection, and then a new formula for calculating feature weight in a centroid, from which the centroid can be constructed, is defined. Finally, a denormalized cosine measure is employed to calculate the similarity score between a text vector and a centroid. Experiments on different corpus show that FSCC significantly outperforms the traditional centroid-based approach, and state-of-the-art SVM classifier.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络