节点文献

反垃圾邮件中贝叶斯方法的应用研究

Applied Research of Bayesian Method on the Technology of Anti-Spam

【作者】 徐松浦

【导师】 苗放;

【作者基本信息】 成都理工大学 , 应用数学, 2005, 硕士

【摘要】 本论文依托于国家高技术研究发展计划863项目重大专项课题“‘缩小数字鸿沟——西部行动’第一批课题(课题编号:2003AA1Z2530)——基于国产Linux的公共信息平台关键技术与应用研究”的研究内容。 近年来,我国的因特网应用进入大发展阶段,电子邮件给用户带来很大方便的同时,也产生了一个新的问题,即大量的垃圾邮件的出现。如何将电子邮件中属于“垃圾”类别的邮件过滤掉,已成为电子邮件用户关心的一大问题。这就是所谓的“反垃圾邮件(Anti-Spam)”问题。这也是基于国产NC和国产Linux公共信息平台要解决的一个问题。 要治理垃圾邮件,必须立法、组织、技术三管齐下。就技术而言,我们要清醒地认识到制造、传播垃圾邮件技术和反垃圾邮件技术的斗争如同人类和计算机病毒斗争一样,都是一个此消彼长、长期不断的过程。为此,本文就反垃圾邮件技术、文本自动分类系统、贝叶斯分类模型、多分类器组合等相关理论、知识进行了研究。 贝叶斯(Bayes)分类算法是基于概率统计原理的一种分类方法,它具有理论清楚、运算速度快、分类精度高等优点,因而被广泛地应用在各个领域的文本分类并取得较好的效果。本文对朴素贝叶斯分类模型(NBC)、朴素贝叶斯分类模型的提升(Boosted NBC)、半朴素贝叶斯分类模型(SNBC)、树扩展的朴素贝叶斯网络分类模型(TAN)、增量贝叶斯分类模型、贝叶斯网络(BN)等贝叶斯变形算法进行了深入地研究。 在此基础上,本论文提出基于贝叶斯技术的反垃圾邮件多分类器组合模型,并对模型的阈值优化设置提出了改进方法。实验结果表明,该算法模型可以获得较高的查准率和查全率,可以为设计出更好的反垃圾邮件方案提供理论的支持。

【Abstract】 This dissertation is based on the "Narrowing the Digit-divide—West Program—Key Technique and Applied Research of the Public Information Platform based on the Domestic Linux " as the first batch of national 863 important special project (serial number: 2003AA1Z2530).With the rapid development of the Internet, Electronic mail brings both convenience and trouble to users, especially the later, for so much junk mail frequently appear in users’ mailbox. How to filter these junk mails and retain useful e-mail is a big problem not only to the e-mail users but to the public information platform based on the domestic Linux and NC. This is the so-called "Anti-spam".In order to deal with the junk mail, we must adopt ways and methods from three aspects: lawmaking, organization and technology. In brief, it is a long hard fight between us and junk mail makers, just as the fighter of that of viruses, In this regard, the author has done some research of the theories and techniques of anti-spam, text filtering, Bayesian classifier model and the combination of multiple classifiers.Beyesian classifier algorithm is a filtering method based on the theory of statistical probability. It shows fairly satisfactory performance on the areas of text classification. Accordingly, the author proceeds a further research on Naive Beyesian classifier(NBC),Boosted NBC, Semi-NBC(SNBC),Tree-Augmented Naive Bayesian Classifier (TAN),Increased NBC and Bayesian Netwok(BN).Based on these researches, the author focuses on establishing the Bayesian multiple classifiers optimization algorithm on anti-spam. He also explore the improved threshold method in the anti-spam model based on Bayesian Classifier.Experimental results show that this new algorithm can achieve fairly satisfactory performance in the mail filtering applications and may provide solid theoretical support for designing the anti-spam software.

  • 【分类号】O212.8
  • 【被引频次】7
  • 【下载频次】353
节点文献中: 

本文链接的文献网络图示:

本文的引文网络