节点文献

文本分类技术与应用研究

On Text Classification and Its Applications

【作者】 郝秀兰

【导师】 胡运发;

【作者基本信息】 复旦大学 , 计算机软件与理论, 2008, 博士

【摘要】 互联网上充斥着各种信息,其中有一些信息,如恐怖组织等通过互联网散布的消息,直接影响着国家安全与稳定。传统的按IP地址、主题进行信息拦截的方法已不再适合当前的需要,目前主要是对内容进行监控。鉴于Internet上的大部分信息都以文本的形式存在,以上技术大都依赖于文本内容的理解,核心技术是文本分类与聚类技术。爆炸式增长的文本信息对文本内容理解的精度与速度提出了新的标准与挑战,要求文本理解在提高精度的同时,还要进一步提升训练与理解速度。本文挑选文本分类中的3个困难与挑战进行了研究:数据集偏斜(数据集关于类别的分布是偏斜的,即类偏斜)、特征选择、小样本问题(标注瓶颈)。从提高分类方法的快速性、准确性出发,提出多种有效的解决(改进)方法。同时,对文本聚类、分类的一个重要应用领域——话题识别与跟踪,进行了研究。本文的创新工作主要包括以下三点:1、kNN文本分类器中类偏斜问题的处理类偏斜问题是数据挖掘领域的常见问题之一。在文本分类中得到广泛应用的kNN方法,当训练样本存在类偏斜问题时,分类性能明显下降。将kNN分类器用于某文本内容安全项目,我们发现,小类别的待测样本几乎都错分到其它大类中去了。针对kNN存在的这个问题,提出了训练集的临界点(Critical Point,CP)的概念,根据CP的下(上)近似值LA(UA)及训练样本数对传统的kNN决策函数进行修改,这就是自适应的加权kNN分类。在偏斜文本数据集上进行的实验表明,LA、UA是较好的收缩因子。自适应的加权kNN文本分类性能优于传统kNN方法及随机重取样方法。2、训练样本的选择训练样本的选择对分类器的创建非常重要,非典型样本不仅增加了分类器的训练时间,而且容易给训练样本集中引入一些“噪声”。作为一种基于实例的方法,kNN分类器有大量的计算及存储需求。同时,训练数据分布的不均衡,也会导致kNN分类器的性能下降。针对这些缺陷,首先对MultiEdit与Condensing算法进行了改进,然后提出了特征选择与Condensing技术相结合的取样方法。该方法分为两步:第一步,由几种传统的特征选择方法产生训练集中每类训练数据的特征;第二步,根据文档自身的类特征,结合Condensing策略移去多余的训练实例。大量实验表明,该方法明显减小了训练集的数据量,从而降低了算法的时空消耗,改进了分类器的性能。3、半监督的文本分类传统的分类器仅使用有标签的数据进行训练,然而,有标签的实例通常因昂贵、耗时而难以获得,从而造成标注瓶颈问题。半监督学习通过大量的无标签数据与有标签数据相结合来创建性能良好的分类器,从而解决标注瓶颈问题。由于半监督的学习需要较少的人工介入,而精确率又较高,因此无论在理论上还是实践上都具有意义。本文在对已有的半监督学习算法进行研究的基础上,针对有标签数据相当少时,无法使用统计方法进行标注置信度评价的情况,提出了基于kNN和SVM的二阶段协同学习,实验证实该方法是有效的。作为文本分类、聚类技术的应用,我们对BBS的话题识别与跟踪进行了研究。从文本挖掘的角度上来说,话题识别类似于文本聚类;而话题跟踪类似于多类文本分类。话题识别与跟踪,研究目标是要实现按话题查找、组织和利用来自多种新闻媒体的多语言信息。这类新技术是现实中急需的,比如:自动监控各种信息源(如广播、电视等),并从中识别出各种突发事件、新事件以及关于已知事件的新信息,这可广泛用于信息安全、证券市场分析等领域。另外,还可以找出有关用户某一感兴趣话题的所有报道,研究这一话题的发展历程等等。在对话题识别与跟踪各种算法进行研究的基础上,我们根据BBS内容的特点,建立了一个面向BBS的话题识别与跟踪系统。在以上研究的基础上,我们开发了一个文本内容安全管理原型系统。

【Abstract】 Internet is imbued with various informations,some of which,such as terrorism, threaten the security of sovereignty.Traditional techniques to block information according IP address or theme are out of date.Now,the state of the art is to monitor the content of the information.Because text is main representation of information,many techniques to monitor information depend on the understanding of text.Text classification and clustering are key techniques.Explosive increase of text information poses new challenge to text understanding and requires that text understanding be quicker,more efficient,and more accurate.In this paper,three challenges in text categorization are explored,i.e.,class imbalance,feature selection and bottleneck of annotation.To improve the speed and accuracy of classification,several methods and techniques are presented.Meanwhile, topic detection and tracking,an important application of text classification and clustering is discussed.Our main contributions are,1.One strategy to deal with class imbalance in kNN classificationClass imbalance is one of problems plagued the community of data mining. Performance of kNN,a widely used algorithm in text categoryization,deteriorates when distribution of training data is skewed among different classes.When used in a project of text content security,kNN classified almost all test samples of minority classes into majority ones.To overcome this defect,critical point(CP) of training set is proposed. Traditional decision functions of kNN are revised by LA or UA,approximate value of CP.This is so-called adaptive kNN with weight adjustment.Experiments on bised data sets shows that adaptive kNN with weight adjustment outperforms traditional kNN and random resampling and gets better results.2.Selection of training samplesSelection of training samples is vital for a classifier to build.Atypical samples not only increase the time of training but also introduce noise into training set.As an instance based algorithm,kNN classifier has large computational requirement and space cost.Meantime,imbalance distribution of training data will lead to bad performance of kNN classifier.To deal with these defects,MultiEdit and Condensing algorithms are firstly modified,then sampling based on feature selection and Condensing is proposed. First,several traditional methods of feature selection are combined to form features for each class.Second,redundant cases are removed by combination of class features contained in cases with Condensing algorithm.Exaustive experiments show that the size of training set decreases sharply,which leads to reduction in space and time cost and improvement in classification quality.3.Semi-supervised text categorizationSemi-supervised categorization is a kind of special categorization.Tradtional classifiers only train with labelled data,but labelling data is a difficult task because it is expensive and time-consuming.Labelling data is dull and requires experienced annotators to label them with plenty of time and special device.This is so-called bottle-neck of annotation.At the same time,unlabelled data are easy to obtain and can be used in diverse ways.Semi-supervised learning algorithm builds good classifiers with labelled data and lots of unlablelled data to solve the bottle-neck of annotation. Because semi-supervised learning needs less manual work,it is important both in theory and in practice.Two-phase co-training based on kNN and SVM is proposed after we examine existing semi-supervised learning.Experiments show that the given method is effective.Meantime,we discuss a practical application of text classification and clustering technology——topic detection and tracking oriented to BBS.From the point of view of text mining,topic detection is similar to text clustering and topic tracking is similar to text categorization.Topic detection and tracking(TDT) aims to organize and deploy multi-language news from various news agents according to topic.This technique is a must in appications,such as automatically monitoring information sources,for instance,radio and TV,and recognizing unexpected events, new events and new information about exsting events.It can be widely used in information security and analysis of securities business.In addition,TDT can be used to dig out all news some user interested in and discover the evolution course of a specific topic.On the basis of survey on TDT,we develop a TDT system oriented to BBS.We apply the above results into a prototype system on text content security.

  • 【网络出版投稿人】 复旦大学
  • 【网络出版年期】2009年 08期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络