节点文献

基于贝叶斯技术的邮件过滤研究

Research of E-mail Filtering Based on Bayes Technology

【作者】 李雯

【导师】 刘培玉;

【作者基本信息】 山东师范大学 , 计算机软件与理论, 2008, 硕士

【摘要】 一提到电子邮件(e-mail),相信大家都不会感到陌生。随着Internet的迅猛发展,电子邮件凭借使用方便、快捷、廉价的特点很快被广大网络用户所接受,已成为当前最流行的信息交流方式之一。但是电子邮件给我们带来便利的同时,垃圾邮件也随之产生,带来了巨大的危害。近年来大量的商业、色情、反动垃圾邮件和邮件病毒的泛滥给互联网用户带来很多烦恼和侵害,也给社会带来了极大的负面影响,邮件系统的安全问题引起业界的重点关注。垃圾邮件在国内的情况十分严重,中国如今成为了世界垃圾邮件来源的第三大国,反垃圾邮件迫在眉睫。因此研究垃圾邮件过滤具有着极其重大的现实意义。要对垃圾邮件进行综合治理,不仅需要通过法律途径和管理措施,而且需要好的邮件过滤技术。本文主要针对技术措施,探讨了垃圾邮件过滤的工作。主要研究工作包括:1.对邮件过滤技术和贝叶斯技术进行了分析和研究。本文首先对垃圾邮件过滤的研究背景和研究现状做出了分析,包括垃圾邮件的危害以及特征类型,揭示了垃圾邮件之所以泛滥成灾、屡禁不止的原因。本文归纳分析了目前国内外常见的各种主流反垃圾邮件技术,并分别指出它们的特点和缺陷。并对贝叶斯技术和朴素贝叶斯算法的基本原理以及在邮件过滤中的应用做了探讨和研究。2.提出一种对朴素贝叶斯的改进算法——改进朴素贝叶斯算法。基于概率统计的朴素贝叶斯算法具有方法简单、运算速度快、分类精确度高等优点,在文本分类中得到广泛应用。但是,在邮件过滤过程中,合法邮件被误判为垃圾邮件将可能给用户带来巨大的损失。传统的朴素贝叶斯算法在对邮件进行分类与过滤时,没有充分考虑到合法邮件与垃圾邮件具有的不同特性,因此用于邮件过滤时具有一定的局限性。在此基础上本文引入损失最小化的思想,将其与朴素贝叶斯算法结合起来,并根据垃圾邮件的特性做了改进,给出一种改进的朴素贝叶斯垃圾邮件过滤算法。该算法能够根据用户的需求通过调整k值,来达到相应的过滤效果。3.将Boosting算法引入邮件过滤领域,提出另一种对朴素贝叶斯算法的改进算法——基于Boosting方法的改进贝叶斯算法。虽然改进朴素贝叶斯算法能够根据k值的动态选择,使系统有侧重地对待分类邮件进行过滤,但是k值取的过大或是过小都会使邮件过滤的精确率有所下降。Boosting方法最大的特点是可以有效地提升算法的精度,它可以将精度较低的“弱学习算法”提升为精度较高的“强学习算法”。为了提高邮件过滤的精确率,本文将Boosting方法应用于邮件过滤领域,用Boosting方法对朴素贝叶斯算法进行提升,提出了一种新的邮件过滤算法——基于Boosting方法的改进贝叶斯算法。实验结果表明,该算法提高了邮件分类的精确度,降低了邮件的误判率,减少了传统方法处理时信息的丢失和错判的情况,改善了邮件过滤的整体性能。4.设计和实现了基于改进贝叶斯算法的邮件过滤系统。我们将本文提出的改进贝叶斯算法在邮件过滤技术平台进行了实际应用层面的测试,实验数据证明了算法的可靠性和有效性,在对垃圾邮件进行分类与过滤时取得了令人满意的测试效果。

【Abstract】 It is very familiar to us when we talk about electronic mail (e-mail). With the rapid development of Internet, e-mail has become one of the most popular communicating modes for users for its conveniency, speediness and cheapness. But spam (also referred to as“junk mail”) is emerged with the convenience of e-mails, and bring harms to users. In recent years, the flooding of all kinds of spam has become a headache problem for human and society. Mail system security arouses widespread interest and becomes a research focus in industry. Spam is very serious in China. Nowadays, China has been the third most serious country in the world about the spam. So study of spam filtering is of great significance.In order to deal with the spam effectively, we need not only lawmaking and management measures but also good spam filtering technology. This paper mainly studies the spam filtering technology, the contents are as follows:1. Analyses and studies the e-mail filtering technology and Bayes technology.In this paper, we analyses research background and current status of the spam filtering, including the harm and characteristics of spam and the reason why spam becomes more and more. We make a research on the nowaday prevalent anti-spam technologies all over the world, point out the advantages and the disadvantages of them. Then we study the Bayes technology and Na?ve Bayes algorithm detailed.2. Proposes an improved filtering algorithm based on Na?ve Bayes—the improved Na?ve Bayes algorithm.Compared with the other text classifiers, Na?ve Bayes algorithm has more widely been used in the area of text classification for the simply method can classify texts correctly and more quickly. Mistaking the legitimate mail as spam will produce more loss than mistaking the spam as legitimate mail. However, the traditional Na?ve Bayes method doesn’t consider the different features between the legitimate mail and the spam in the process of classifying and filtering mail and doesn’t take into account the loss of misclassifying legitimate mail as spam, so there are some limitations of e-mail filtering. An improved algorithm of spam filtering is presented in this paper, which can minimize user’s loss. The improved Na?ve Bayes algorithm can achieve user’s purpose by changing the value of k .3. Another improved algorithm combined the Na?ve Bayes algorithm with Boosting method is proposed in this paper—the improved Na?ve Bayes algorithm based on Boosting method.The improved Na?ve Bayes algorithm can make the filtering system focus on different types of e-mails according to choosing different value of k .However, choosing the value of k too much or too small will reduce the accuracy rate. The greatest feature of Boosting method is boosting the accuracy of algorithms. Boosting method can effectively transform the weak learning algorithm into strong learning algorithm. The improved Na?ve Bayes algorithm based on Boosting method is proposed to improve the accuracy of spam filter. The experiment results illustrates that the improved filtering algorithm can reduce the loss of the information and the error rate of misclassifying mail. The improved filtering algorithm has better performance than the traditional Na?ve Bayes method.4. Designs and implements the e-mail filtering system based on improved Bayes algorithms.Finally, we put these improved Bayes algorithms into action in e-mail filtering system and the the experimental result shows the reliability and the validity of the algorithms. The e-mail filtering system achieves satisfying test result.

  • 【分类号】TP393.098
  • 【被引频次】4
  • 【下载频次】349
节点文献中: 

本文链接的文献网络图示:

本文的引文网络