节点文献

基于神经网络集成的垃圾邮件过滤系统设计

The Design of Spam Filtering System Based on Neural Network Ensemble

【作者】 刘宝萍

【导师】 李爱军;

【作者基本信息】 山西财经大学 , 计算机应用技术, 2010, 硕士

【摘要】 网络的推广与应用使得电子邮件已经成为人们信息交流的重要手段,但随之而来的垃圾邮件问题严重影响人们的生产和生活。垃圾邮件过滤技术的研究具有十分重要的意义。目前存在的垃圾邮件过滤技术存在诸多不足,不能完全地将垃圾邮件过滤掉。为了达到将垃圾邮件完全过滤的理想状况,需要研究一种更加有效的垃圾邮件过滤技术,提高邮件分类的准确率。集成可以提高分类器分类的准确率。在目前应用于垃圾邮件过滤的机器学习方案中,神经网络是比较有效的方法之一。但是,神经网络容易陷入局部极小值,造成邮件的误分。因此,将神经网络进行集成,采用神经网络集成技术将多个不同的神经网络单分类器组合成一个分类器,集成的输出由构成集成的各神经网络的输出共同决定。基于该思想来提高学习的系统的泛化能力,提高过滤系统的过滤性能。本文就此方面进行研究。本文设计的邮件过滤系统模型由邮件预处理、特征提取、分类器设计三个部分组成。其中,邮件预处理把标准邮件语料库中的数据表示为计算机容易识别和处理的向量空间模型(VSM)形式;特征提取采用信息增益(IG)算法降低了数据的维数,提高了算法的运行效率;分类器设计采用神经网络集成的方法Boosting和Bagging来构造邮件分类器,通过组合多个单分类器的输出结论的方式训练分类器,确定邮件的类别,对垃圾邮件进行过滤。在垃圾邮件语料库PU系列语料库上分别进行了实验。除传统评价指标外,本文还采用混淆矩阵(Confusion Matrix)的评价方法,通过与单分类器RBF神经网络的过滤性能比较,证明了神经网络集成对于垃圾邮件的过滤有较好的效果。

【Abstract】 It makes the e-mail have become an important means of information exchange to population and application of the network. However, problem of spam seriously affect people’s production and life. The research of the spam filtering technology is of great significance. There are many inadequacies in the existing spam filtering technologies currently; the spam can not be completely filtered out. In order to achieve a complete filtering spam ideal situation, it needs to study a more effective spam filtering technology so as to improve the e-mail classification accuracy.Ensemble can improve the classification accuracy of classifier. Neural network is more effective one of the methods which are used in machine learning programs currently. However, neural network is easy to fall into local minimum, assigning an e-mail to the wrong category. Neural network ensemble combines a number of different neural networks into a single classifier, and its output is decided to the integration of various neural networks. Based on the idea to improve the generalization ability of learning systems, improve the filtration performance of filtration systems. This paper will be studied in this respect.The spam filtering system model designed in this paper includes three parts of preprocess, feature selection, classifier design. Preprocess treats the standard e-mail corpus of data as the form of vector space model (VSM) that the computer can identify and handle easily. Feature selection uses information gain (IG) algorithm reducing the data dimension, and improving the operational efficiency of the algorithm. Classifier design constructs an e-mail classifier and filters a spam, using Boosting and Bagging of neural network ensemble methods. The category of an e-mail is defined by combining the output of multiple single-classifier approach. It experiments on the PU series corpus of spam and compares with a single classifier with the RBF neural network. It uses an evaluation method based on Confusion Matrix in addition to traditional evaluation indicators, proving that neural network ensemble is more effective in filtering a spam.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络