节点文献

基于模糊支持向量机的垃圾邮件过滤技术研究

The Research of Spam Filtering Technology Based on Fuzzy Support Vector Machines

【作者】 赵海涛

【导师】 魏延;

【作者基本信息】 重庆师范大学 , 计算机软件与理论, 2010, 硕士

【摘要】 随着互联网的迅速发展,电子邮件作为一种现代通信手段受到广泛使用,但人们在享受电子邮件带来的种种便利的同时,也受到了大量垃圾邮件的骚扰。支持向量机(SVM)是基于统计学习理论的机器学习方法,由于它是基于结构风险化最小的,具有小样本、泛化能力强、全局最优等优点。SVM方法已被成功地运用于许多领域,在垃圾邮件过滤等领域也成为了一个研究热点。本文通过研究学习了电子邮件的工作原理,邮件格式和邮件预处理技术,得到了邮件过滤前的向量表示。还重点学习研究了支持向量机和模糊支持向量机方法,把模糊支持向量机技术引入到垃圾邮件过滤中来,设计了一种新的模糊隶属度函数,考虑了合法邮件误分造成的严重后果引入了不同的惩罚参数C。最终提出了一种基于误分损失的FSVM垃圾邮件过滤方法,并进行了仿真实验。主要研究内容如下:1)研究了电子邮件工作原理,邮件相关协议和电子邮件预处理技术。重点研究了特征提取以及邮件的向量表示:使用正向最大匹配法和逆向最大匹配法相结合的方法对邮件文本进行中文分词,通过文档频率方法进行特征选择,使用TF_IDF函数建立向量空间模型。2)研究对比了基于支持向量机的邮件过滤技术和其它的邮件过滤技术。基于支持向量机的垃圾邮件过滤技术具有小样本、泛化能力强和全局最优等优点,但是也有两个明显的缺陷:邮件分类实际上是一个不确定信息的处理问题,SVM方法却把它当做确定性问题处理的,另外基于SVM的方法错分合法邮件和垃圾邮件的概率是等同的,忽略了错分合法邮件问题较错分垃圾邮件更严重的问题。3)把模糊支持向量机技术引入到垃圾邮件过滤中来,并重点研究了模糊支持向量机的隶属度函数和惩罚因子,设计出新的基于类中心的模糊隶属度函数,提出了一种基于错分损失的FSVM垃圾邮件过滤方法。4)研究和设计更适当的邮件过滤评价方法:LP、LR、WR等,重点使用合法邮件的查全率LR和其它综合指标作为评价手段,进行仿真实验对比所提的FSVM方法和SVM方法的过滤性能。仿真实验的结果证明考虑了误分损失的模糊支持向量机垃圾邮件过滤方法在保证了较高的垃圾邮件拦截率的同时,保证了较高的合法邮件查全率,有效解决了错分合法邮件带来的严重后果,证明了所提方法的可行性和有效性。

【Abstract】 With the rapid development of the Internet, as a modern means of communication E-mail is used widely. But with the widespread use of e-mail, people enjoy the convenience brought by e-mail, but also by a lot of spam. Support Vector Machine (SVM) is a machine learning method based on statistical learning theory, which based on the smallest structural risk and has some advantages such as small samples, generalization ability and the advantages of global optimization. SVM method has been successfully used in many fields, and has also been a hot research topic In the areas of spam filtering.In this paper through studying the principle of the e-mail, mail format and the technology of e-mail pretreatment we get the expression with vectors of e-mail for e_mail filtering. And focus on learning of the technology of support vector machine and fuzzy support vector machine, introduces the technology of fuzzy support vector machine into spam filtering in the Lai, and design a new function of fuzzy membership, and consideres the serious consequences of legitimate e_mail’s misclassification and use different penalty parameters C. Finally, proposing a methods of FSVM spam filtering based on the loss of misclassification, and conducte a simulation experiment.The main research contents and innovations of this paper are as follows:1) Research the working principle of the e-mail, related protocols of e_mail and e-mail pretreatment. Focuses on the feature extraction and vector express of e_mail: using the method which combined the forward maximum matching with the reverse maximum matching method to separate chinese vocabulary on the e_mail text, and using the method of Document-Frequency to select features, and finally using the function of TF_IDF to build a vector space model.2) Study and contrasted the technology of e_mail filtering based on support vector machine and other mail filtering technology. The technology of spam filtering based on Support vector machine has some advantages such as small samples, generalization ability and the advantages of global optimization, but there are two obvious problems: Mail filtering actually is an uncertain information processing problem, the method based on SVM treats it as a certain one; On the other hand the rate of misclassifying legitimate mails and spam by the approach based on SVM is equivalent ,which ignores the matter that misclassifying a legitimate mail is more serious than misclassifying spam.3) Introduced The technology of fuzzy support vector machine into spam filtering, and focuses on fuzzy membership function of support vector machines and penalty parameters, designed a new fuzzy membership function based on class center, proposing a methods of FSVM spam filtering based on the loss of misclassification.4) Research and design more appropriate methods of spam filtering evaluation,eg. LP、LR、WR, mainly using the recall of legitimate mail recall and other comprehensive indicators as evaluation tools.Conducted a simulation experiment to compare the performance of the method we proposed based on FSVM and SVM in spam filtering.The results of Simulation prove that the method based on FSVM considered the loss of misclassification of spam filtering method ensure a high rate of spam filtering, ensure a high recall rate of legitimate e-mail in the same time, which resolve the problem that the result of misclassifing a legitimate e-mail is more serious than misclassifing spam.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络