节点文献

基于信息融合准则的邮件过滤系统的研究与实现

E-Mail Filtering System Based on Information Fusion Criterion

【作者】 吴硕

【导师】 景晓军;

【作者基本信息】 北京邮电大学 , 通信与信息系统, 2008, 硕士

【摘要】 基于内容的垃圾邮件过滤技术是Internet安全技术研究的一个重点问题。将机器学习的相关方法应用于垃圾邮件的判定是进行大量垃圾邮件处理的有效方法。本文针对电子邮件的特点,通过分析传统邮件过滤技术的不足之处,在对大量垃圾邮件进行统计分析的基础之上,基于信息融合准则对邮件过滤技术进行了研究。本文主要包括以下几个方面的内容:1、综述垃圾邮件过滤问题的研究现状,包括垃圾邮件的定义、危害以及当前主要垃圾邮件过滤技术;在总结比较常用的特征提取方法及过滤算法的基础上,提出了一种利用期望交叉熵(CE)代替词频逆文档频率(TFIDF)算法中IDF函数进行分类的词频交叉熵(TFCE)算法。2、在深刻理解信息融合技术的基础上,通过理论分析,针对传统垃圾邮件判决采用单一准则的缺陷,重点研究了基于三角膜算子的垃圾邮件融合判决准则。其后详细阐述了该准则的原理和评价结果以及具体实现过程,包括体系结构、功能模型和组织模型、邮件过滤的流程和垃圾邮件反馈模块等问题。3、利用实验检验了算法的有效性。仿真实验主要分为两部分:一是比较了邮件过滤系统中各种基于评估函数的特征提取方法,如文档频率(DF)、互信息(MI)、信息增益(IG)、期望交叉熵(CE)、词频逆文档频率(TFIDF)和本文提出的新的特征提取算法词频交叉熵(TFCE)的优缺点和特征提取精度;二是将基于三角模算子的信息融合判决准则与基于词频或文档频率的采用单一准则的判决方法进行了比较。论文最后对基于词频交叉熵(TFCE)算法和信息融合准则的邮件过滤系统提出了进一步完善、改进的意见,从而得出最佳决策,有效降低邮件漏判、错判的概率,为邮件过滤技术的发展提供了一个新的探索途径。

【Abstract】 Nowadays email is one of the most common network applications and has become the most important communication method. Content-based spam filtering is an important issue in Internet security technology. Application of machine learning approaches such as text categorization to spam determination is an efficient way for dealing with plenty of spam.This paper aims at characteristics of e-mail by analyzing the inadequacy of traditional technology in filtering spam on the basis of a large number of statistical analyses. We put emphasis on comparing the advantages, disadvantages and scope of applications of various feature selection methods, and achieve a Cross Entropy (CE) to replace IDF function of Term Frequency Inverse Document Frequency (TFIDF) algorithm, named Term Frequency Cross Entropy (TFCE). A new judgment has been proposed which is based on triangle module fusion at the same time to further improve accuracy of feature selection and effectively reduces the probability of mail misjudgment and lost of judgment.This thesis mainly includes the following parts: Summarize the state of spam filtering which include the definition of spam, danger and filtering techniques; Generalize common approaches of feature pruning, anti-spam filter and mail corpora. Also we emphasize on feature selection methods and filtering algorithms, the theory of TFCE; Summarize the framework and implementation of new algorithms which mainly include architecture, function model, organization model and flowchart of spam filtering. Based on research and academic analysis of information fusion technology, we give a detail analysis on the spam fusion judgment criterion. Simulation results are shown to verify its performance: One is comparison of various feature selection method, including TFCE; the other one is comparison between information fusion criterion based on triangle module and single judgment criterion. The simulation results suggest that Average accuracy of TFCE is higher than that of other traditional feature selection methods and the performances of information fusion criterion based on triangle module are also better than those of single judgment criterion.Finally, this paper proposes some suggestions to further improve the performances of spam filtering system based on TFCE feature selection method and triangle module fusion algorithm and effectively reduce mail misjudgment and lost of judgment, provides a new probability for the development of e-mail filtering technology.

  • 【分类号】TP393.098
  • 【下载频次】60
节点文献中: 

本文链接的文献网络图示:

本文的引文网络