节点文献

垃圾邮件行为模式识别与过滤方法研究

Research on Spam Behavior Patterns and Recognition Methods

【作者】 王美珍

【导师】 李芝棠;

【作者基本信息】 华中科技大学 , 计算机系统结构, 2009, 博士

【摘要】 电子邮件已经成为现代人际交流的一种最常见的方式。但是,SMTP(SimpleMail Transfer Protocol:简单邮件传输协议)协议的不完善,尤其是对电子邮件发送者没有做任何的身份鉴别和控制,使得垃圾邮件越来越泛滥。垃圾邮件过滤是个复杂的问题,虽然存在许多相关研究,也获得了很多成果,但是在技术上目前还没有哪一种方法能完美地过滤所有的垃圾邮件。随着伪装技术的发展,垃圾邮件也越来越隐晦,致使基于内容过滤的误判率也很高,而对大量疑似垃圾邮件,基于内容的过滤也耗费了大量的处理时间。因此,必须寻求新的方法和算法。提出了基于数据挖掘的行为识别垃圾邮件过滤系统框架。对采集的数据提取行为特征,并将行为特征分为会话行为特征、信头行为特征和统计行为特征,采用特征选择算法选择能够有效地预测训练数据类属性的特征,经数据预处理,从数据中挖掘出垃圾邮件行为判定规则的知识。提出了基于多级结构的垃圾邮件行为模式挖掘模型,针对不同类型的行为特征,采用不同的模式挖掘算法:对MTA(Mail Transport Agent:邮件传输代理)会话阶段的行为特征,提出了基于决策树的垃圾邮件发送行为识别模型。它不需要接收整封邮件,通过挖掘邮件会话过程中所表现出的行为特征,在会话阶段提前过滤掉垃圾邮件。对用户发送行为采用直方图距离法来检测异常用户发送行为。通过计算附件的指纹特征、统计特征,构建附件的特征向量,利用支持向量机模型来对垃圾邮件的附件行为建模。计算URL(uniform Resource Locator:统一资源定位)之间的相似度,构建包含相似URL的群组,通过计算样本与URL群组的最小距离并转换成分类输出的置信度来判别垃圾邮件行为。由于传统的贝叶斯垃圾邮件过滤在误判和漏判带来的损失方面没有进行关注,提出了一种贝叶斯算法的改进算法,引入了损失因子,在不降低正确率的情况下,使得垃圾邮件误判的风险减到最低。若选择合适的损失因子,正确率和召回率都能达到一个比较理想的效果。利用该算法将各模型判别结果关联起来,通过对联合贝叶斯模型和附件模型、发送发送行为模型、URL模型的性能比较,验证了改进的贝叶斯联合模型相对单个模型来说,能够较大地提高分类性能。提出了基于模糊决策树的分类方法。由于绝对明确的属性并不总是存在于现实世界中,属性隶属度能更自然和合理地描述行为特征,因此相对于清晰决策树来说,模糊决策树更适合。模糊决策树算法使得决策树学习的应用范围扩大从而能够处理不确定性,它合理地处理了学习和推理过程中的不精确信息,具有更强的分类能力及稳健性,由于能生成不同水平和不同置信度的规则,为决策者提供丰富的决策信息。设计了基于行为模式识别和其它过滤技术相结合的邮件过滤系统MailGate,并进行了原型实现。实验结果表明MailGate对垃圾邮件过滤的召回率和误判率能够达到较好的效果。

【Abstract】 E-mail has become one of the most common manners in modern communication. However, imperfect SMTP(Simple Transfer Protocol) protocol, especially no authentication and controlling for e-mail senders, has made spam flood.Spam filtering is a complex researching problem. Although many research has been made on that, and many achievements has gotten, but technically, there is no perfect solution can filter all the spam. With the development of camouflage technology , spam became more obscure, and lead to higher false positive rate for content-based filtering. For large number of suspected spam, content-based filters also spent so much time on processing. Therefore, we must find new methods and algorithms to solve the problem.The framework of spam filtering system based on mining behavior patterns is proposed. Extracting behavior features from collected data, and dividing behavior features into session features, message header features and statistical features, using feature selection algorithm to choose the features that can effectively predict training data class attribute, and after data preprocessing, knowledge of spam behavior determinant rules can be mined from the training data.A model of spam behavior patterns mining is proposed, and it is based on multi-level structure. For different types of behavioral features, different pattern mining algorithms have been used: for session features in MTA(Mail Transport Agent) stage, using Decision Tree for spammers’ behavior recognition. It needn’t to receive the entire message, and mines behavior patterns from features in the conversation, spam can be filtered in the early time of the session. Histogram distance method is used for user sending behavior to detect the abnormal sending behavior. Fingerprint features and statistical features of attachments are calculated to generate the feature vector, and Support Vector Machine model(SVM) used to model attachment behavior. By calculating URL(Uniform Resource Locator) similarity between URLs, similar URLs are grouped to URL clique. The minimum distance between the sample and other URL cliques is converted into the confidence level as the classifier output to determine spam behavior.A collaborative filtering model based on Bayesian algorithm is proposed, and the model correlates the results of the various models. Because traditional Bayesian spam filtering technology hasn’t concerned about the loss of spam false negatives and false positives, an improved Bayesian algorithm is proposed. In the algorithm, the loss factor is introduced in the situation of no reducing the accuracy rate of filtering, to minimize the risk of spam false positives. If choosing the appropriate loss factor, the accuracy rate and the recall rate can be improved to ideal result. By comparing the performance with the new combining Bayesian model, the attachment model, the user sending behavioral model and URL model, corresponding to the single models, the improved Bayesian combining model can greatly improved the filtering ability.A classification method based on fuzzy decision tree is proposed. Because the absolutely clear attributes do not always exist in the real world, the attribute subordinating degree is more natural and reasonable to describe the characteristics of behavior, so corresponding to clear decision tree, the fuzzy decision tree is more suitable. Fuzzy decision tree algorithm expands the scope of application of decision tree, and can handle uncertainty. It can deal with the inaccurate information in the process of learning and influence with stronger classification ability and robustness. It can generate rules with different level and different confidence degree, and provide decision makers with full determinate information.Based on the combining technology of behavior-based pattern recognition and other e-mail filtering technology, the filtering system MailGate is designed and implemented. Experiments show that the recall rate and FP rate of spam filtering get a good result.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络