节点文献

中文文本分类中特征选择算法及分类算法的研究

Research on Feature Selection Algorithm and Classification Algorithm in Chinese Text Categoriztion

【作者】 迟麟

【导师】 刘文远;

【作者基本信息】 燕山大学 , 计算机软件与理论, 2010, 硕士

【摘要】 近年来,随着信息技术的迅速发展,特别是Internet的普及,网页上的电子文本信息急剧增加,如何有效地组织和管理这些海量信息,并且能够快速、准确地获得用户所需要的信息是当今信息资源管理技术领域的一大挑战。通过文本自动分类技术的使用,可以使电子文本信息自动的按照类别的方式进行组织和管理,满足人们方便快捷的信息处理需求,准确定位所需信息资源。本文从分词算法,特征选择算法和文本分类算法三个方面对文本分类进行深入研究。首先,通过分析预处理中中文文本分类的特点,中文文本向量空间模型表示法,和两种机械的分词方法,在算法的词典结构、算法的匹配方式、算法对歧义词的处理策略和算法识别未登录词的策略上改进了分词方法,并进行了实验验证。其次,在文本预处理的基础上,为了进一步提高特征项对类别的区分能力,本文分析了基于绝对比例区分(CPD)的特征选择算法,分别在特征项的频度和特征项的冗余两个方面进行改进,提出了改进的CPD特征选择算法,并通过实验进行比较验证。最后,通过分析传统的K最近邻(KNN)分类算法具有计算量巨大和当类别间有较多共性,即训练样本间有较多特征交叉现象时,KNN分类的精度将下降的两点不足,提出了改进的KNN文本分类算法,并在中文文本分类语料库--TanCorpV1.0和搜狐互联网网页语料库两种数据集上,通过实验与传统的KNN算法进行比较验证。

【Abstract】 In recent years, with the rapid development of information technology, especially in the popularity of Internet, dramatic increasingly in web pages of electronic text information, how to effectively organize and manage these vast amounts of information, and how to quickly and accurately obtain the information needed by users in today’s information resource management technology is a big challenge. By using the automatic text classification techniques, electronic text information can be automatically organized and managed according to categories, it meets people’s demand for convenient and efficient information processing, and accuracy locates information resources.We deeply studied segmentation algorithms, feature selection methods and text classification algorithms.Firstly, by analyzing the features of Chinese text categorization in pre-processing, representation of vector space model, and the two kinds of mechanical segmentation method, we improved the segmentation method in the dictionary structure of the algorithm, the algorithm matching method, disposal strategy of algorithm to ambiguous word and disposal strategy algorithm to unknown word, and had experimental validation.Secondly, on the basis of text pre-processing, in order to improve the post-classification accuracy rate and reduce the calculation of the amount of classification algorithms, we analyzed Categorical Proportional Difference (CPD) feature selection method, and improved this method in frequency and redundancy of feature items, and experimented to compare validation.Finally, by analyzing the two shortcomings which are the enormous computational, and when there is more commonality between the categories, namely, to have more features between the training samples cross phenomenon, KNN classification accuracy will decline. we proposed an improved KNN algorithm for text classification, experimented in Chinese text categorization corpus-TanCorpV1.0 and Sohu web page corpus, comparing the traditional KNN algorithm.

  • 【网络出版投稿人】 燕山大学
  • 【网络出版年期】2010年 08期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络