节点文献

基于机器学习的高性能中文文本分类研究

Research on High Performance Chinese Text Classification Based on Machine Learning

【作者】 杨创新

【导师】 彭宏;

【作者基本信息】 华南理工大学 , 计算机应用技术, 2009, 博士

【摘要】 文本分类是信息处理领域中的一个重要的研究方向。随着信息技术的发展,特别是20世纪90年代基于机器学习的文本分类方法的逐渐成熟,文本分类技术在自然语言处理与理解、信息组织与管理、内容信息过滤等领域中有着广泛的应用。由于这些领域对文本分类技术的不断需求,极大地推动了文本分类技术的深入研究,使文本分类技术成为计算机技术的热点研究课题。在基于机器学习的文本分类研究中,按照分类学习方式的不同,可分为有监督分类、半监督分类和无监督分类三种。有监督分类通常简称为文本分类(text categorization,简称TC),它的主要任务是在预先给定的类别标记(label)集合下,根据文本内容判定它的类别;无监督分类称为文本聚类(clustering),文本聚类是按照某种准则对文本集合进行组织或划分,使得相似的文本划分到同一簇中,差异较大的文本划分到不同簇中;半监督学习介于有监督分类与无监督分类之间,它主要关注的是当训练样本不足或者数据的部分信息缺失的情况下,如何获得具有良好泛化能力的学习机器,对文本类别进行正确区分。无论是哪种分类算法,对于高维文本来说,特征提取和特征选择作为降维的重要方法,是降低计算复杂性、提高分类器性能的重要手段。它们与上述的分类算法一样,面临着海量数据、非结构化、维数灾难与数据集偏斜等方面的挑战。本文主要研究中文文本分类,重点就文本的特征提取、特征选择、分类和聚类四个方面进行深入研究。本文首先提出了基于句子成分的文本特征提取算法、均衡特征选择算法和特征选择维数下限;接着,提出了特征索引与特征补偿的KNN分类算法,同时将均衡特征选择应用于非线性半监督分类;最后,在Hartuv and Shamir工作的基础上,提出了加权图聚类算法——WGC算法。本文研究中主要的创新点包括:1、基于句子成分的文本特征提取。在文本特征提取中经常会出现一些跟主题无关的词条。本文根据不同的句子成分在表达主题中所起的作用不同,利用句法分析实现句子成分的标注,并由此提出了基于句子成分的文本特征提取算法。实验结果显示,该算法不但能有效地过滤一些跟主题无关的词条,而且避免了停用词表或词性过滤的局限性。2、均衡特征选择算法研究。针对目前关于数据分类的假设在实际中难以满足以及数据偏斜的问题,本文通过对文本分类目标函数的分析,提出了均衡的特征选择算法。通过理论的分析和公开文本集的实验表明,该算法能够有效地处理子类间的数据偏斜问题。此外,提出了特征选择函数在某一文本集中特征选择维数的下限的计算方法,以及在特征维数下限条件下的非平均维数的特征选择算法。3、高性能文本分类算法研究。为了减少未标记样本与无关向量集的比较从而有效地提高分类的速度,本文利用选择的特征集作为待标记文本分类的索引,提出了基于特征空间索引的最近邻分类算法。实验表明,该算法分类时间受维数增加的影响较小。为了提高分类的准确性,本文将未包含在特征空间中且具有区分类别能力的特征词作为分类的补偿特征集,提出了基于特征补偿的KNN算法。最后,在均衡特征选择的基础上结合鲁棒路径正则化,实现文本的非线性半监督分类。4、基于最小割集的加权图聚类算法。在Hartuv and Shamir工作的基础上,提出了图论聚类算法——WGC算法,该算法有低多项式复杂度,可证明的聚类性质以及在聚类过程中自动地确定聚类的类数等优点。

【Abstract】 Text Classification is an important research topic in the information processing area. With the development of information technology, particularly the matureness of machine-learning based text classification in the 1990s, text classification is widely applied in the area such as nature language understanding and processing, information organizing and management, content filtering, etc. Because more and more text classiffication applies in those areas, this greatly promotes further research on text classification and makes text classification technology a research topics in computer technology.According to the different ways of classification learning, text classification based on machine learning can be divided into three different kinds of methods: supervised classification, semi-supervised classification and unsupervised classification. Supervised text classification is called text categorization, TC for short. The main goal of TC is: given a text and predefined training sets with different class labels, decides which class the text belongs to. The unsupervised classification is called Clustering. It groups a set of data in a way that minimizes the similarity within cluster and maximizes the similarity between two different clusters. Semi-supervised classfication is between supervised classification and unsupervised classification. It focus on how to attain the good performance and generalization when there is not enough training sample or part of the data information is lost, so that it can accurately classified the texts.Whatever classification algorithms it may be, for high dimensional texts, feature extraction and feature selection serving as important methods for reducing dimensionality. play important roles in the performance of computing complexity and classification. They also receive attention such as large scale data, unstructured data, dimensionality disaster and data imbalance from more and more researchers..In this paper, we focus on Chinese text classification research. We concentrate our research on four areas: feature extraction, feature selection, text categorization and clustering. Firstly, we proposed three algorithms: feature extraction algorithm based on sentence elements, balanced feature selection algorithm and low limit dimensionality of feature selection; secondly, we proposed balanced feature indexing KNN classification algorithm and feature compensation KNN algorithm, then we applied the balanced feature indexing KNN to nonlinear semi-supervised classification. Finally, we proposed a graph-theoretic clustering algorithm (WGC algorithm ) on the base of the work of Hartuv and Shamir. Here are the further descriptions of our research:1. Text feature selection algorithm based on sentence elements. During the process of text feature extraction, we often encounter terms that have nothing to do with the subject. According to the fact that different sentence elements play different roles in expressing a subject, in this paper, we use syntactical analysis to label sentence elements and then propose feature extraction algorithm based on sentence elements. Experimental results show that the algorithm effectively not only filter some terms that have nothing to do with the subject, but also avoid the disadvantage of using stop-list table and part-of-speech filtering.2. Research on balanced feature selection algorithm. Aimed at dealing with the problems that the assumption of data classes is not satisfied in practice and data skewed exists , in this paper, we analyze the text classification objective function and then propose balanced feature selection algorithm. We theoretically and experimentally prove and verify the validity and effectiveness of the algorithm. We also propose method for calculating the low limit dimensionality when a feature selection function is selecting features in a certain document set. A non average dimensionality feature selection algorithm in cases of low limit dimension is also proposed.3. Research on high performance text classification algorithm. In order to improve the speed by reducing the matching between unlabeled samples and the unrelated vector sets, in this paper, we use the feature sets as the classification index of unlabeled documents and propose a KNN classification algorithm based on feature space indexing. Experimental results show that the increase of dimensionality has little effect in classification time. In addition, in order to improve the accuracy of classification, we construct the compensation feature sets which contain features that are not include in the feature sets but have some classification ability. We propose a features compensation KNN algorithm. Finally, combining balanced features selection and robust path regularization, we realize non-linear semi-supervised classification.4. Clustering algorithm for weighted graph based on minimum cut. Building on the work of Hartuv and Shamir, we propose a graph-theoretic clustering algorithm for weighted graph based on minimum cut (WGC). The algorithm has the advantage of many existing algorithm: low polynomial complexity, the provable properties, and automatically determining the number of clusters in the process of clustering.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络