节点文献

面向内容安全的文本分类研究

Research on Text Categorization Method Oriented to Content Security

【作者】 张博锋

【导师】 苏金树;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2007, 博士

【摘要】 随着互联网应用技术的发展,滥用信息所造成的政治、经济、军事、社会和文化等诸多方面的问题引起人们的关注,内容安全逐渐成为信息安全的一项基本内容。文本分类是根据内容对相关信息进行组织、管理、识别及过滤的有力手段和核心技术之一,面向互联网内容安全的需求对文本分类技术提出新的挑战。信息内容的安全必须对异常的内容实施高效监控和及时响应,因此需要分类系统对通过的文本进行高速实时的检测。互联网上的内容多样且更新频繁,某些情况下必须以较大代价,甚至无法为分类器的训练提供感兴趣内容的更多标注样本,成为分类系统构建的主要瓶颈,因此通过少量标注样本和大量未标注样本进行分类器训练的半监督学习方法成为研究的热点。内容的多样性和各种主题的相互交叉,还使得内容安全不同领域的关注者可能希望对类似或完全相同的内容作出响应,多标签学习主要解决这种实例可能同时属于多个类别的问题,成为一个新的研究方向。本文围绕互联网内容安全需求背景下的文本分类这一主题,主要针对高效率的文本分类训练和预测方法、缓解标注瓶颈的半监督学习,以及多标签的文本分类三个问题展开深入研究,取得的主要成果与创新工作概括如下:1.高效率的SVM多类学习方法研究。提出了与Rocchio级联的SVM多类方法Roc-SVM,通过Rocchio分类器高速准确的过滤大部分不相关类别,大幅减少所需的二值SVM判别次数,将“一对一”和“一对剩余”两种SVM多类方法实验中的分类时间降低约一个数量级,分类的准确性却基本不受损害。为了优化一对剩余SVM多类训练的过程,提出一种简洁的类增量式SVM多类方法CI-SVM。实验表明,其训练时间相对一对剩余多类方法大幅减少,分类过程的效率也显著提高。2.通过类别层次对na(l|¨)ve Bayes分类器准确性的改进。Na(l|¨)ve Baye方法的训练效果受主观选择的训练数据关于类别全局分布的影响。利用层次式分类的特点,通过在类别的后验概率计算中引入新的概率条件,并在每个内部类别所属的子类局部数据中进行决策的方法,对na(l|¨)ve Bayes分类器进行改进。改进方法EHNB降低了全局数据分布对分类器的影响,部分缓解了样本关于类别分布不均衡的问题,使得na(l|¨)ve Bayes方法在层次式分类中的效果有较明显的提高。3.基于自训练与EM方法集成训练的半监督学习方法。提出将激进的对未标注样本进行标注的自训练,与保守调整未标注样本标签状态的EM两种方法训练过程进行集成的思想,并提供ESTM和SEMT两种半监督学习方法。ESTM在EM的迭代中利用中间结果进行确定性标注,而SEMT在自训练过程中,以半监督的EM方法代替na(l|¨)ve Bayes监督学习方法。实验表明,ESTM和SEMT有效结合了自训练和EM的优点,具有更好的利用未标注样本提高分类器准确性的能力。4.面向协同训练的特征集分割。给出了特征子集间条件独立性度量的定义,并证明了特征子集分组合并时独立性的保持性质。以此为根据,提出对每个类别的局部特征子集分别进行分割,再分组进行合并的局部化分割策略,同时给出基于样本局部自适应聚类和特征关联图分块的分割方法,两种方法均以尽量保持子集间的条件独立性为前提。在两个数据集上的测试中,所获得的特征集分割使得协同训练利用未标注样本,更好的提高了na(l|¨)ve Bayes方法的分类效果,拓展了基于特征集分割的协同训练方法的适用性。5.基于标签状态向量的多标签学习方法。通过在排位(ranking)方法的标签状态向量空间LSVS中,二次挖掘标签状态值关联中所蕴含的多标签信息,提出基于标签状态向量的两阶段多标签学习框架。在此框架下,给出kNN LSVS上的BOL(bag of labels)模型和Bayes多标签学习方法,并在LSVS上改进ML-kNN方法。在na(l|¨)ve Bayes LSVS上,我们采用线性最小方差拟合(LLSF)进行多标签的训练和预测,并证明了LLSF的方差可以给出分类器Hamming训练损失的一个上界。在11个多标签分类问题上的应用表明,两阶段框架下,各种多标签方法训练所得的分类器具有较好的多标签分类效果。

【Abstract】 With the development of Internet application technology,problems induced by information technique abuse in politics,economy,military,society,culture and so on have drawn more and more attention.The content security has become one of the basic issues in information security.Text categorization is one of the powerful means and key techniques for information organization,management,recognition and filtering,for which the need of the Internet content security poses new challenges.To ensure the security of information,abnormal content must be monitored efficiently and responded in time.So the fast and real-time inspecting of texts passing is necessary.Due to the variety and frequent movement of the content in Internet,it is difficult,perhaps impossible,to provide enough labeled samples of interest for the training of the classifier.This becomes the bottleneck in construction of a classification system.Therefore,semi-supervised learning method training with a few labeled and lots of unlabeled samples turned into a research hotspot.Variety of content and cross of topics also makes watchers from different areas pay attention to similar or even identical content.Multi-label learning appears to solve the above problem of an instance belonging to more than one class,and becomes a new research area.Aimed at the topic of the requiring background from the Internet content security, this dissertation studies three questions,namely,efficient training and prediction for text categorization,semi-supervised learning to alleviate the labeling bottleneck and multi-label text categorization.The main work and contributions of this dissertation are shown as follows:1.Efficient multi-class SVM learning method.A multi-class method cascading Rocchio with SVM is proposed.The Rocchio classifier filters most of the irrelevant class and enormously reduces the need of the judgments by SVM.The cascading method decreased the time of the 1 vs.1 and 1 vs.rest method for the test experiments by a quantity level respectively.A concise class-incremental multi-class SVM method CI-SVM is also presented.According to the experiment, the training time of the method was reduced and the testing efficiency was also improved significantly.2.Enhancement of the na(l|¨)ve Bayes classifier under class hierarchy.The performance of text categorization method of na(l|¨)ve Bayes highly depends on the global distribution of subjectively-selected sample correlating with classes.It can be enhanced by taking advantage of hierarchical characteristics and by introducing the conditional probability.This enhancement makes decisions in the local data belonging to child-classes of an internal class,thus lightening the influence of global data distribution and partially overcome the problem of date skewness. Experiments showed that the enhanced method improved the effectiveness of hierarchical categorization with na(l|¨)ve Bayes notably.3.Semi-supervised learning method based on self-training and EM integration. The method of integrating the training process of EM,which conservatively adjusts the label status for samples,and self-training,which labels the samples directly,is proposed.Two semi-supervised learning methods named ESTM and SEMT are provided.ESTM decisively labels some samples by the middle result in the iteration of EM,and SEMT substitutes the supervised na(l|¨)ve Bayes by semi-supervised EM method.Experiments demonstrated that ESTM and SEMT integrated the advantages of self-training and EM,and improved the classifier by unlabeled samples much more.4.Feature set splitting for co-training in text categorization.This dissertation presents the quantitative definition of the conditional independence of feature subsets given the class and suggests a strategy for splitting feature set locally in this sense.The property of holding independence when two groups of feature sets are united is also proven.Two methods respectively base on locally adaptive clustering and relevancy graph partitioning for feature set splitting in the precondition of independence are proposed.Applications to two data sets show that,using the feature divisions produced by our methods,the combined effectiveness of the co-trained na(l|¨)ve Bayes classifiers is improved by applying the unlabeled samples.As a result,the applicability of the co-training method is extended.5.Multi-label learning method based on label status vector(LSV).A Two-stage learning frame based on label status vector is proposed.It re-mines the multi-label information contained between label status values in the label status vector space (LSVS) of ranking methods.Under this frame,this dissertation presents the bag of labels(BOL) model in the kNN LSVS,proposes the Bayes method for that model and improves the ML-kNN method.In the na(l|¨)ve Bayes LSVS,linear least square fit(LLSF) for multi-label training and prediction is provided.The upper bounding of the Hamming training loss by the square of LLSF is also proven.Applications to 11 multi-label problems have shown that the two-stage frame and above learning methods was effective in multi-label classification.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络