节点文献

文本分类及其相关技术研究

Research on Text Classification and Its Related Technologies

【作者】 李荣陆

【导师】 胡运发;

【作者基本信息】 复旦大学 , 计算机软件与理论, 2005, 博士

【摘要】 随着Internet的迅猛发展和日益普及,电子文本信息迅速膨胀,如何有效地组织和管理这些信息,并快速、准确、全面地从中找到用户所需要的信息是当前信息科学和技术领域面临的一大挑战。文本分类作为处理和组织大量文本数据的关键技术,可以在较大程度上解决信息杂乱现象的问题,方便用户准确地定位所需的信息和分流信息。而且作为信息过滤、信息检索、搜索引擎、文本数据库、数字化图书馆等领域的技术基础,文本分类技术有着广泛的应用前景。 本文对文本分类及其相关技术进行了研究。从提高分类方法的快速性、准确性和稳定性出发,提出多种有效的解决或改进的方法和技术。同时,对文本分类技术的一个新的研究方向——文本流派分类,文本分类的一个重要应用领域——文本信息过滤,进行了研究。本文研究内容和创新工作主要包括以下五点。 (1)训练样本的选择 训练样本的选择对分类器的创建非常重要,非典型样本不仅增加了分类器的训练时间,而且容易给训练样本集中引入一些“噪声”。论文针对KNN这种常用的文本分类方法,分析了什么是它的典型样本,提出了一种基于密度的样本选择算法。根据样本ε邻域内的样本数目估计样本周围的密度,根据样本ε邻域内不同类别样本的数目确定类别之间的边界。裁剪高密度区域的样本,减少非典型样本的数量。同时,尽量保留类别边界部分的样本,以保证分类器的准确性。 (2)基于最大熵模型的中文文本分类研究 中文本文分类和英文文本分类有许多不同之处,文本特征的提取方式、稀疏程度都有所不同,所以分类结果亦有所不同。对于最大熵模型来说尤为不同,因为汉语的熵高于英语。论文从中文文本特征的生成方法入手,使用了分词和N-Gram两种文本特征生成方法,使用了绝对折扣技术对特征的概率进行平滑处理,对最大熵模型和Naive Bayes、KNN、SVM三种方法的性能进行了比较分析。在实验中发现最大熵模型的稳定性不够好,所以将Bagging和最大熵模型结合起来,提高了最大熵模型的稳定性。 (3)使用层次分类改善平面分类的性能 不同于以往的层次化分类,论文中使用了一种本质为图的层次结构,利用这种层次结构解决平面分类问题,从而提高平面分类的查准率和查全率。在普通的类别层次结构中,同一父类的兄弟类别之间的混淆关系是对称的,但事实上类别之间的混淆关系不是对称的。论文从分类器的混淆矩阵入手,引入了混淆类别的概念。利用混淆类别构造的类别层次结构,从查准率和查全率的角度来考虑类别之间的关系,表达出了混淆关系的非对称性。

【Abstract】 With the rapid development and spread of Internet, electronic text information greatly increases. It is a great challenge for information science and technology that how to organize and process large amount of document data, and find the interested information of user quickly, exactly and fully. As the key technology in organizing and processing large mount of document data, text classification can solve the problem of information disorder to a great extent, and is convenient for user to find the required information quickly. Moreover, text classification has the broad applied future as the technical basis of information filtering, information retrieval, search engine, text database, and digital library and so on.Research on text classification and its related technologies are done in the paper. From the angle of improving the speed, precision and stability, several methods and techniques are presented. Moreover, research on text genre classification, which is a new research field in text classification, and information filtering, which is an important application of text classification are also done. Our primary works are as follow.(1) Selection of Training SamplesSelection of training samples has the great important influence on the performance of classifier. Using the atypical samples not only increases the training time, but also is apt to bring the noise to training samples. In the paper, what is the typical sample of KNN is analyzed, and a method of samples selection based density is presented. The number of samples in the e -neighborhood of a specified sample is used to estimate the density of region surrounding the sample. The number of classes in the e -neighborhood of a specified sample is used to judge whether the sampe is around the border of classes. Reduce the atypical samples by reduce the samples in the high-density region. In the same time, reserve the samples around the border of classes in order to guarantee the precision of classifier.(2) Research on Chinese Text Classification Based on Maximum Entropy Model There are many differences between Chinese text classification and English textclassification. So the classification results are also different. It is espically different for maximum entropy model because the entropy of Chinese is higher than that of English. In the paper, two kinds of methods of Chinese text feature generation, word segmentation and n-gram, are used. Absolute-discounting technique is adopted to smooth the feature probability. Maximum entropy model, Naive Bayes, KNN and SVM are compared. Experiment results show that maximum entropy model isn’t stable enough. So bagging is used to improve the stability of maximum entropy model.(3) Using Hierarchical Classification to Improve the Performance of Flat

  • 【网络出版投稿人】 复旦大学
  • 【网络出版年期】2005年 07期
  • 【分类号】TP391.1
  • 【被引频次】202
  • 【下载频次】4532
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络