节点文献

基于RS理论和SVM的网络信息过滤技术的研究

Research on the Filtering Technology of Internet Information Based on RS Theory and SVM

【作者】 刘杨

【导师】 衣治安;

【作者基本信息】 大庆石油学院 , 计算机应用技术, 2008, 硕士

【摘要】 随着互联网的飞速发展,人们获取了丰富的信息。然而,各种不良信息也随之泛滥,特别是反动、色情、暴力等有害信息极大地危害着社会的稳定和人们的身心健康,网络“垃圾”已经侵入了我们的生活。如何过滤掉与自己需求无关的信息,如何快速、准确的获得所需信息并免受非法信息的侵扰,已经成为当前互联网发展研究的热点。本文提出了一种新的将RS理论和二叉树多分类SVM算法相结合的网络信息过滤思想,通过改进的启发式相对属性约简和值约简,消除冗余属性和值,对变换后的数据表,采用一种带松弛因子的统计粗糙集算法生成决策规则,使挖掘出的规则更简洁,具有更高的可靠性,可以有效地避免生成规则的偶然性,从而降低误分类率。然后通过二叉树多分类SVM算法来训练SVM,将多分类转化为二值分类,算法采用先聚类再分类的思想,计算测试样本与子类中心的最大相似度和子类间的分离度,以构造决策结点的最优分类超平面。对于C类分类只需C ?1个决策函数,从而可节省训练时间。实验表明:RS理论和二叉树多分类SVM相结合的算法,可以降低训练模型的复杂度,从而在一定程度上减少了模型的过拟合现象,并提高了SVM的推广能力和训练速度,取得了较好的过滤效果。本文实现了一个位于邮件客户端,能对已有邮件进行学习,自动对新到邮件进行分类过滤的智能邮件过滤系统。该系统是基于POP3协议和SMTP协议,介于用户的邮件服务器和邮件接收软件之间的一个过滤层。系统中邮件的过滤分成两级实现:第一级是在邮件取下后,首先根据邮件信头内容进行过滤,进行邮件分解、内容分析、特征提取,并形成特征向量形式。第二级过滤的主体部分是基于二叉树SVM的多分类过滤器,核函数选用径向基函数。最后用大量电子邮件进行测试,计算邮件过滤评估函数,并与Naive Bayes方法、KNN算法、Boosting Trees算法几种过滤方法相比较。实验结果表明,该系统具有实时监控、自动更新邮件过滤模块的能力,使邮件过滤更高效、更准确。在电子邮件过滤中,由于垃圾邮件中含有的URL地址是通过授权获得的,因此,本文采用了基于URL地址进行垃圾邮件过滤的方法,通过捕获垃圾邮件中所含有的URL信息,这种方法对过滤含有URL的垃圾邮件相当快速、有效,是其它过滤方法难以做到的。

【Abstract】 With the rapid development of Internet,people acquire abundant information.However, many kinds of illegal information is also flooding,especially the reactionary,pornography, violence information is harming the society’s stable and people’s physical and moral integrity enormously,the network trash has already invaded our lives.How to filter the information which has nothing to do with ourselves’demands,How to obtain the information which we are needing more fast and more accurate,and exempt the invasion of the illegal information, the technology of Network information filtering has already became the researching hotspot in the Internet development field at present.This paper proposes an improved idea of data classification and filtering based on Rough Set theory and Binary tree SVM,utilizes an improved heuristic algorithm of related attribute reduction to eliminate conflicting data,reduces space dimension of sample data,For the transformed data table,it presents a kind of relaxation factor algorithm based on statistical rough sets model to make decision rule.It can avoid generating the casual rules,make the mined rules more simply,depress the mistake classified rate.then It trains SVM by clustering integrated with Binary tree SVM,it can convert multiclass problem to binary classification problem by constructing binary tree.Algorithm adopts the idea of clustering first and classifiying later,calculates the most similarity between testing sample and sub-category center,and the separation measure of sub-categories,in oder to construct the optimal class hyperplane of decision-making nodes.It only needs C-1 kinds of optimal function for C kinds of classification,so it can save training time.The experiment results show that the new algorithm can decrease the complexity in the process of SVM classification, prevent the over-fit of training model at a certain extent, can improve the training speed and precision of filtering.This paper implements an intelligent mail filtering system, which is located on the side of mail client,it is able to study the older mail,carry on classifying and filtering to the newly mail automatically.The systerm has a filtering floor which is based on the agreement of POP3 and the agreement of SMTP ,and it is situated between the mail-server and the mail received software.the mail filtering is divided into two levels of realizations in the system:The first level is to filter the content of mail header after the mail is taken down, carry on the mail to decompose,analysis the mail’s content,extract the Characteristics,and form the characteristic vector form.The main part of the second level of is multiclass filter which is based on the binary tree.Its nuclear function selects the radial direction primary function.Finally it tests the Effect Through massive emails experiment,it calculates the appraisal function of mail filtering, and compares with the several filtering methods of Naive the Bayes,the KNN algorithm,the Boosting trees.The experiment results show that the systerm has the capacity of real-time monitoring,the ability to Update module of filtering e-mail automatically,and makes the Email filtering to be more highly effective,more accurate.In view of the URL address in the junk mail is obtained through authorizing, So this paper adopts the method of filtering the junk mail is based on URL address,By capturing the URL information in the junk mail,the methods can filter the junk mail which contains URL address more faster, more effective,It is difficulty to achieve for other filtering methods.

节点文献中: