节点文献
垃圾邮件过滤中的敌手分类问题研究
Adversarial Classification for Email Spam Filtering
【作者】 邓蔚;
【导师】 秦志光;
【作者基本信息】 电子科技大学 , 计算机系统结构, 2011, 博士
【摘要】 机器学习作为一种重要的智能信息处理技术,在垃圾邮件过滤系统中得到广泛的应用。然而在实际对抗性网络环境中,垃圾邮件过滤器面临着垃圾邮件发送者无休止恶意攻击的威胁。从而导致在实验环境中高性能的机器学习算法,在实际应用时其性能可能变的很差。敌手分类的提出正是为了应对这种挑战,并成为当前机器学习领域的研究热点,具有重大的理论和实际应用价值。本文针对垃圾邮件过滤中的敌手分类问题展开了研究,包括对敌手分类中的攻防博弈问题,垃圾邮件过滤的抗中文好词攻击问题,以及基于Kolmogorov复杂性的鲁棒性分类问题这三方面的研究。本文取得了如下五点创新性成果:1.提出了一个基于Stackelberg延时博弈的敌手分类模型。以往基于Stackelberg博弈的敌手分类模型,不能解释取得纳什均衡后垃圾邮件发送者为何还要继续发动攻击。本模型将实际中跟随者的反应延时引入Stackelberg博弈建模,重点分析了反应延时对领导者和跟随者收益的影响,并利用遗传算法得到纳什均衡,最后通过实验仿真验证了本模型的正确性。本模型表明垃圾邮件发送者具有先发优势,并在数据挖掘者的反应延时中获得超额收益,从而不断发起新的攻击。2.提出了一个基于Stackelberg不确定性博弈的敌手分类模型。现有敌手分类的Stackelberg博弈模型通常假设跟随者的行动是最优的和理性的,这在实际垃圾邮件过滤中是不合理的。本模型将跟随者的有限理性和有限观察引入敌手分类的Stackelberg博弈建模,并重点分析了不确定性参数对分类器性能的影响,最后通过真实邮件数据集进行了实验,验证了本模型的有效性。3.提出了一个抗中文垃圾邮件好词攻击的多示例逻辑回归模型。目前对中文好词攻击问题的研究尚不多见。本模型结合中文分词技术和特征选择方法进行预处理,并利用多示例机制和逻辑回归算法进行学习和分类,最后在中文邮件数据集上进行了实验。实验结果表明该模型能够有效对抗中文垃圾邮件的好词攻击,且鲁棒性优于单示例逻辑回归和单示例支持向量机模型。4.提出了一个基于Kolmogorov复杂性的垃圾图像分类模型。传统的垃圾图像分类算法存在着鲁棒性较差、图像特征对特定数据集敏感等问题。本模型利用数据压缩技术和Kolmogorov分类机制,实现了对垃圾图像的准确分类。通过在垃圾图像数据集上进行实验,验证了本模型能有效对垃圾图像进行分类。同时对该模型的更新机制进行了安全性分析。本模型既不需要提取图像中的文字,也不需要对图像特征进行定义和选择,是一种数据驱动的无参数分类方法。5.提出了一个基于Kolmogorov复杂性的恶意软件检测框架。垃圾邮件是传播恶意软件的有效方式,传统的基于特征码的方法难于检测新的和变种的恶意软件。本模型提出了一种通用的恶意软件检测方法,并利用动态马尔科夫压缩来对代码样本进行分类,最后的实验结果验证了本框架能对恶意软件进行准确的分类。本框架实现简单,无需提取特征码,并且能够有效识别新的和变种的恶意软件。
【Abstract】 As an important technology of intelligent information processing, machinelearning is widely used in spam filtering systems. However, in practical adversarialenvironments, spam filters encounter never-ending malicious attacks by spammers. Sothe machine learning algorithms which perform well in experimental environment mayperform badly in practice. Adversarial classification is proposed for this challenge. Nowadversarial classification is a hot topic in machine learning and has great value intheories and practical applications.In this dissertation, researches on adversarial classification problems in spamfiltering have been conducted, which include game problems between attacker anddefender in adversarial classification, combating Chinese good word attacks in spamfiltering, and Kolmogorov complexity based robust classification methods. Fiveinnovative contributions of the dissertation are enumerated as follows.1. A Stackelberg game theoretical model with reaction-time delay is proposed foradversarial classification. Previous researches on Stackelberg game theoretical modelsof adversarial classification could not explain the reason that the spammer continues tolaunch attacks after the Nash equilibrium is reached. In this model, the data miner’sreaction-time delay is considered in Stackelberg game. In addition, the influences ofreaction-time delay to the spammer and data miner are emphatically analyzed. The Nashequilibrium is reached by using genetic algorithm. The model’s correctness is verifiedby our experiments. The model shows that the spammer who has the advantage of beingin the lead obtains extra payoffs during the data miner’s reaction-time delay. So thespammer can continuously launch new attacks.2. A Stackelberg game theoretical model with uncertainties is proposed foradversarial classification. Existing researches on Stackelberg game model foradversarial classification critically assume the data miner plays optimally and rationally.Unfortunately, it is not real in practical spam filtering. In the proposed model, the dataminer’s bounded rationality and limited observation for the spammer’s strategy is considered. In addition, the influences of different uncertainty parameters to theclassifier are analyzed with emphasis. At last, the model’s effectiveness is verified onreal spam dataset.3. A multiple instance logic regression model for combating Chinese good wordattacks is proposed. Now there is little research on the problem of Chinese good wordattacks. This model uses Chinese word segmentation and feature selection methods forpreprocessing. Then it uses multiple instance learning mechanism and logic regressionalgorithm for learning and classification. At last the experimental results on largeChinese spam corpora show that the model can effectively combat against Chinese goodword attacks. It also shows that the robustness of the model is better than that of singlelogic regression model and single instance support vector machine model.4. A Kolmogorov complexity based spam image classification model is proposed.Traditional classification algorithms for spam image have the vulnerabilities of lessrobustness and strong sensitivity of image features for special image dataset. The modeluses data compression technology and Kolmogorov complexity classificationmechanism to classify spam images effectively. At last, the experimental results onspam image database show the model can accurately classify spam images. In addition,the model’s security of updating mechanism is primarily analyzed. The model needsneither text extraction from images, nor feature definition and feature selection ofimages. It is a kind of data-driven parameter-free classification method.5. A Kolmogorov complexity based malware detection framework is proposed.Spam is an effective way to transmit malware. It is hard for traditional signature-basedapproaches to detect malware which is new or obfuscated. A general malware detectionframework is proposed. It uses dynamic Markov compression to classify code instances.The experimental results show the framework can accurately detect malware. Theframework can be implemented easily without malware signature selection and candetect unknown and obfuscated malware effectively.
【Key words】 spam filtering; adversarial classification; Stackelberg games; Kolmogorovcomplexity; Chinese good word attacks;