节点文献

一个WEB文本过滤系统设计与实现

Design and Implementation of WEB Text Filtering System

【作者】 沈凤仙

【导师】 朱巧明;

【作者基本信息】 苏州大学 , 计算机应用技术, 2009, 硕士

【摘要】 随着互联网的快速发展,网络上的信息呈爆炸式增长,文本信息过滤技术的研究取得了很大的进展,Web文本信息过滤技术已成为一个研究热点。本文在前期课题IPCG控制网关的研究基础上,为了提高该计费网关对公共信息网络服务的综合监管能力,通过研究Linux下实时内容过滤和文本过滤等相关技术,设计并实现了一个基于IPCG控制网关的Web文本信息过滤系统。本文首先给出了系统总体框架以及设计目标,并提出了一种分布式过滤系统的实现方式。系统由中央预警模块统一管理、在线过滤和离线过滤相结合。分布式数据库的同步借鉴OSPF路由协议中数据库同步算法,实现全网过滤信息的通用性。实时在线过滤模块,包括了数据包预处理和基于IP地址及关键词过滤两个子过程。数据包预处理过程主要针对Web页面进行数据分析和结构分析,解析出正确的页面数据信息;基于IP和基于关键词的过滤过程,采用了哈希树结构来组织IP黑名单列表和缓存拼接策略存储过滤内容,关键词过滤结合统计信息综合判定。离线过滤模块对正例类和不确定类做进一步的离线分析,更新实时在线过滤模块的IP黑名单列表和过滤关键字列表。离线过滤采用改进的特征词提取算法和改进的过滤策略。改进的特征词提取算法,综合考虑了特征词长、网页结构特征和词汇的感情色彩等;改进的过滤策略过滤初期采用SVM算法,中后期采用改进的自适应模板过滤法。模板的更新采用改进的模板系数调整策略,并引入特征衰减因子来提高过滤的准确率。实验表明,本文提出的方法既能保证内容过滤分析和数据报流通相互独立,又能提高在线过滤的速度和过滤的正确率。

【Abstract】 With the rapid development of Internet,the amount of information increases in an explosive way.Text information filtering technology has made great progress and information filtering based on web text has become a research hotspot.The pre-topic of this paper is the research of IPCG gateway and the research of this paper is how to improve the gateway’s supervision capability for the public services.By studying the real-time content filtering under the Linux and the relevant technology of text filtering,this paper proposes and implements a web text filtering system based on IPCG gateway.Firstly,this paper shows the overall framework of the system which combines real-time online filtering with offline filtering,and puts forward a distributed filtering system which refers the database synchronization algorithm of OSPF routing protocol.Real-time online filtering module includes two processes.One is the pretreatment of packets,and the other is the IP-based and the keyword-based filtering.The pretreatment of packets aims at getting correct data information by web content analysis and web structural analysis of web pages.The IP-based and the keyword-based filtering use the hash-tree structure to organize IP blacklist and the cache strategy to storage filtering content.The keyword-based filtering which combined with statistical information assigns the category to the page.Offline filtering model makes further offline analysis for the example and the unascertained page,and then updates the IP blacklist list and the keyword list used by online filtering module.This paper puts forward the feature extraction algorithm and the filtering strategy.The feature extraction algorithm considers the length of features,the structural information of pages and the semantic orientation information of features.The filtering strategy uses SVM at initial filtering stages and uses the improved adaptive template-based algorithm in latter stages.In order to update profile,it uses the improved coefficient adjustment strategy,and uses the feature attenuation factor.The experimental results show that the method proposed in this paper can ensure filtering process and data transfer independently,while it can improve both the speed and the accuracy of online filtering.

  • 【网络出版投稿人】 苏州大学
  • 【网络出版年期】2009年 10期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络