节点文献

基于贝叶斯算法的垃圾邮件过滤研究

Study on Spam Filtering Technology Based Bayes

【作者】 陆青梅

【导师】 尹四清;

【作者基本信息】 中北大学 , 计算机应用技术, 2008, 硕士

【摘要】 随着因特网的迅猛发展,电子邮件成为了现代通信的主要手段。但是同时许多垃圾邮件也在网络中蔓延,给广大用户带来了大量的麻烦。因此能够有效地防治垃圾邮件是一个有重要意义的现实问题。本文首先深入研究了国内外大量反垃圾邮件文献和数据,对已有的垃圾邮件过滤技术做出分析和总结。垃圾邮件过滤技术是反垃圾邮件的重要手段,目前主要有基于安全认证的垃圾邮件过滤技术、基于规则的垃圾邮件过滤技术和基于统计学习的垃圾邮件过滤技术,后两者都是基于内容的垃圾邮件过滤技术。本文研究了基于内容的垃圾邮件过滤算法,主要对贝叶斯算法及其分类模型进行了深入的研究,通过实验方法对PG贝叶斯算法、GR贝叶斯算法和朴素贝叶斯算法进行了详细的分析和对比测试,重点讨论了朴素贝叶斯算法在垃圾邮件过滤中的优点和不足,并针对其不足,通过选择基于卡方分布的特征选取算法进行改进,以进一步提高中文分词的准确性和效率;通过最小风险因子的引入,降低对垃圾邮件的误判风险以减少用户的干预频度,提高识别效率;通过认知学习算法的提出,提高模型的自学习能力,同时极大地降低了高维向量空间垃圾邮件的识别难度,使模型达到了更好的精确率和召回率。本文在基于最小风险的朴素贝叶斯算法的基础上,进一步引入认知学习的理论,从技术上对高维空间向量的垃圾邮件过滤提供了很好的解决方案,实验结果证明,此方法可进一步提高垃圾邮件的识别率,特别是较好的解决了高维特征向量空间的垃圾邮件过滤问题,从而为基于人工智能的垃圾邮件过滤技术的研究打下了基础。

【Abstract】 With the rapid development of Internet, E-mail has become a primary means in modern telecommunication. However, spams (also named as "junk mails") ,simultaneously pervade widespread on line, bringing a lot of troubles to numerous users. Therefore, it is important and practical to prevent and control spasm effectively.The thesis, on the one hand, investigates thoroughly considerable anti-spam documents and data from both home and abroad. Furthermore, analysis and conclusion are made on existing anti-spam techniques. The E-mail filter technology is an important measure against spams, which at present is mainly based on IP address, rules and the content respectively,and the latter two are mainly based on the contents.The thesis mainly talked about spam filter algorithm based on contents,whose feature is text categorization,i.e.to preprocess the text content of mail and then recognize spams over text categorization. And at the same time Baysian algorithm and its categorization model are studied deeply in the dissertation. A detailed analysis and comparable testing on PG Baysian algorithm are put forward throngh the experiments,in which the strengths and limitations of austerity Baysian algorithm in the anti-spam filter are mainly discussed.In order to increase the accuracy and the efficiency of Chinese words sputter,the algorithm is selected on the basis of the characteristic of x2 and try to improve through the method of balancing the key words;and through the introduction of the minimum risk,the risk of the misjudgement on the spasm is reduced to the aim of decrease of the frequency of interference in order to increase the efficienly of recognition;and through the forward of the cognition learning algorithm,increased the capability of self-study of the model and reduced the recognition difficulties of the vector quantities spams,so that the model can reach the perfect accuracy.The thesis puts forward a better solution to vector quantities spam filter through technique based on minimum risk of austerity algorithm and through the introduction of cognition learning.The experients proves that the forward of the method increased the recognition percentage of the spams,especially solved the problems of the spam filter,and finally pay its effort for the research on the basis of artificial intelligence.

  • 【网络出版投稿人】 中北大学
  • 【网络出版年期】2008年 11期
  • 【分类号】TP393.098
  • 【被引频次】3
  • 【下载频次】368
节点文献中: 

本文链接的文献网络图示:

本文的引文网络