节点文献

面向垃圾信息过滤的主动多域学习文本分类方法研究

Research on Text Categorization Method by Active Multi-Field Learning for Spam Filtering

【作者】 刘伍颖

【导师】 王挺;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2011, 博士

【摘要】 垃圾信息过滤是提高网络信息可用性的关键技术之一。虽然该领域已有许多研究成果,但随着社会对垃圾信息过滤的迫切需要,以及垃圾信息过滤技术在实际应用和测试中表现出的许多不足,近年来许多研究机构都在十分活跃地进一步深入研究垃圾信息过滤领域的各种关键技术,以提高垃圾信息过滤的性能和解决实际应用中的问题。目前的研究大多采用基于统计的文本分类方法来解决垃圾信息过滤问题。在这种背景下,本文对用于垃圾信息过滤的基于统计的在线二值文本分类总体框架问题、域文档分割问题、域分类结果组合问题、时空高效域分类问题和有代价反馈问题进行了深入研究,提出了一系列针对这些问题的应对方法。我们采用TREC07P邮件语料上的垃圾邮件过滤实验、CSMS中文手机短信语料上的垃圾手机短信过滤实验以及TanCorp网页新闻语料上的多类别文档分类实验来验证提出方法的有效性。本文主要的研究工作包括:(1)分析了信息文档的文本结构,揭示了信息文档普遍具有多域结构特性。根据这一特性,提出了一种多域学习框架。该框架采用分而治之的研究思路,把一个复杂的多域文档的文本分类问题划分成几个简单的域分类子问题,每个域分类子问题有其自身的特征空间和统计文本分类模型。实验结果表明多域学习框架是一种有效的基于统计的在线二值文本分类总体框架。在多域学习框架下,域间文本特征的独立性更强,而域内文本分类模型针对性更强;并且在每个域分类子问题中,无论是文本特征抽取还是文本分类模型构造都更加简洁高效。(2)研究了域文档分割问题,提出了自然域文档分割策略和特定属性域文档分割策略。自然域文档分割就是根据文档本身具有的多域结构化特点,通过识别域分隔点,将一个文本文档分割成几个域文本文档。特定属性域文档分割是一种文本特征复用技术,它将那些具备较强区分能力的文本通过某种规则抽取出来,组成一个原来并不真实存在的文本域。实验结果表明前一种策略具有较强的通用性,因为信息文档普遍具有多域结构特性;而后一种策略更加适合短文本文档,因为可以克服短文本文档的特征稀少问题。(3)研究了域分类结果组合问题,提出了均权组合策略、支持向量模型权组合策略、域分类器历史性能权组合策略、域文档信息量权组合策略和复合权组合策略。实验结果表明在多域学习框架下,这五种组合策略都能提高已有文本分类算法的性能,其中综合考虑域分类器历史性能和当前域文档信息量两方面因素的复合权组合策略在时间复杂度和分类准确率上都能达到更理想的性能。(4)分析了信息文档集合中的Token频率分布,揭示了Token频率分布普遍服从幂律的特性。根据这一特性,提出了一种基于Token频率索引的文本分类算法。该算法采用文本检索的研究思路解决文本分类问题,利用等概率随机采样方法进行在线标注文档压缩,能够有效应对传统在线文本分类研究中难以将离线批处理后验规则变成在线可计算的先验规则的困难。由于Token频率索引数据结构具备每次查询和增量更新的时间复杂度都很低的优势,还具备索引的原始文本压缩特性和基于随机采样的压缩特性,所以能够高效地捕获文档内容的变化和垃圾概念的漂移。实验结果表明基于Token频率索引的文本分类算法能够很好地解决时空高效域分类问题,而且将该算法集成到多域学习框架下,能够达到低时空复杂度和高分类准确率的最佳性能。此外,还扩展了Token频率索引的研究思路,提出了一种基于多类别Token频率索引的文本分类算法。实验结果表明该算法在多类别文档分类中也是有效的。(5)研究了有代价反馈问题,提出了时序优先主动学习策略、先验区间主动学习策略和基于方差的非确定采样主动学习策略。其中基于方差的非确定采样主动学习策略充分利用了多个域分类器之间的决策差异,通过比较域分类结果间的当前方差和历史方差阈值,挑选信息丰富的文档请求用户反馈。实验结果表明在这三种主动学习策略中,基于方差的非确定采样效果最好,它能够在大量减少用户反馈的情况下,仍然达到较理想的分类性能,而且由于计算方差的时空复杂度比较低,所以基于方差的非确定采样是一种有效的主动学习策略。综上所述,本文研究了垃圾信息过滤面临的若干关键问题,提出了以多域学习为核心的一系列文本分类方法,较好地满足了垃圾信息过滤的实际应用需求。进一步的工作仍然围绕多域学习这一核心,可以预见多域学习的进一步完善和发展能够获得更好的效果。

【Abstract】 Spam filtering is one of the key technologies to improve the availability of network information. Although, spam filtering has been extensively investigated and many advances have been made on it, there are still many problems expected to be solved which are shown in actual applications and evaluations. As a consequence, in recent years, many academies and industries have been making an in-depth research on spam filtering technologies. Currently, many academies tend to use statistical text categorization (TC) methods to solve the problem of spam filtering. This dissertation has explored the main framework problem of statistical online binary TC for spam filtering, the splitting problem of field documents, the combining problem of field results, the space-time-efficient classifying problem of field documents and the costly feedback problem, and proposed a series of methods to solve the problems. We have used the email spam filtering experiment on the TREC07P collection, the short message service spam filtering experiment on the CSMS collection, the multiclass document classification experiment on the TanCorp collection to validate the availability of proposed methods, and make the following contributions:(1) The text structure of information documents is detailedly investigated, and it is found that many information documents have a multi-field structure. According to this finding, we propose a multi-field learning (MFL) framework, which uses a divide-and-conquer idea to break a complex TC problem of multi-field documents into several simple field sub-problems. Each sub-problem has its own feature space and statistical TC model. Experimental results show that the MFL framework is an effective main framework of statistical online binary TC. In this framework, text features are more independent between fields and TC model is more targeted in each field. And in each sub-problem, feature extraction and TC model construction are both straightforward and efficient.(2) We investigate the splitting problem of field documents, and propose a natural field document splitting (NFDS) strategy and an attribute-specific field document splitting (ASFDS) strategy. The NFDS strategy splits a document into several field documents according to the splitting positions identified by the natural field structure. The ASFDS strategy, a reuse technique of text features, extracts the texts with strong distinguishability by some rules to form a field document, which does not really exist in the original document. Experimental results show that the NFDS strategy is general for the common multi-field structure of information documents, and the ASFDS strategy is more suitable for short text documents because it can overcome the problem of sparse features.(3) We also deeply investigate the combining problem of field results, and propose five combining strategies (arithmetical average, support vector model, historical performance of field classifiers, text quantity of field documents, and compound). Experimental results show that the five strategies can improve the performance of previous TC algorithms, and the compound strategy, considering the historical performance of field classifiers and the current text quantity of field documents, can achieve the promising performance of time complexity and precision.(4) The token frequency distribution is detailedly investigated in document collections, and it is found that the token frequency distribution commonly follows a power law. According to this finding, we propose a token frequency index (TFI) based TC algorithm. This algorithm, transfering a research idea of text retrieval to TC problems, uses an equal-probability-based random sampling to compress labeled documents online, and can solve the hard problem of turning a posteriori rule of offline batches to a priori online computable rule in traditional statistical TC methods. The TFI data structure has an advantage of the low time complexity for each query and each incremental update, and has a raw text compression property of indexes and a compression property based on random sampling. So the TFI can capture the varied content and the concept drift space-time-efficiently. Experimental results show that the TFI-based TC algorithm can solve the space-time-efficient classifying problem of field documents, and integrated in the MFL framework, this algorithm can achieve the state-of-the-art performance of the low space-time complexity and the high precision. Moreover, we extend the research idea of TFI, and propose a multiclass token frequency index (MTFI) based TC algorithm. Experimental results show that the MTFI-based TC algorithm is effective in the multiclass document classification.(5) In the research of the costly feedback problem, three active learning strategies (chronological priority, priori range, variance-based uncertainty sampling) are proposed. The variance-based uncertainty sampling active learning strategy makes use of the decision difference of several field classifiers, and compares the current variance of field results with the threshold of historical variances to choose informative documents to require user feedback. Experimental results show that the variance-based uncertainty sampling is the best active learning strategy among the three strategies, and the best strategy can achieve the promising performance with greatly reduced requirements of user feedback. And owing to the low space-time complexity of variance computing, the variance-based uncertainty sampling is an effective active learning strategy.In conclusion, this dissertation investigates some key problems in spam filtering, and proposes a series of TC methods around the MFL idea. The proposed TC methods can meet the practical application requirement of spam filtering. The further research of the MFL can be expected to achieve higher performance.

节点文献中: