节点文献

互联网信息内容安全过滤方法研究

Study of the Information Content Securty Filter Method in WEB

【作者】 李东艳

【导师】 张永奎;

【作者基本信息】 山西大学 , 计算机软件与理论, 2004, 硕士

【摘要】 互联网信息内容安全过滤(Information Content Security Fiiter)是指从海量的WEB文本中识别出含有不良内容的非法文本,以将其屏蔽。目前它已经成为信息过滤的一个新的研究领域。 本文研究了内容安全过滤中的若干关键技术,包括文本表示,非法文本的识别算法及对文本动态学习的实现等。本文还设计了一个信息内容安全过滤(ICSF)实验系统,实现了对非法文本的训练、规则的提取、更新以及对新文档的判别等功能。 本文的工作和创新主要体现在以下几个方面: 1.系统地分析了非法文本的特点,总结了非法文本内容和用词的特征,并给出其形式化表示。 2.通过基于规则的算法实现信息内容过滤。我们采用实例学习方法,在大量训练实例的基础上,将改进的用于逻辑规则提取的OCAT挖掘算法用于文本分类规则的提取,分别产生针对正例集和反例集的识别规则,对文本进行二分分类。同时,通过分析非法文本所特有的用词形式的特征,给出判别规则来计算文本含有非法文本用词特征的可信度。最后,结合训练集的提取规则与特殊词规则,对新文档进行判别。 3.对不同规则采用不同的更新算法,实现对新出现的非法文档的自动识别。我们根据误判文档的反馈信息修改逻辑规则,使其不断增加对新非法文档的识别能力,实现规则的增量式学习。并提出了特殊词自动识别算法,对出现在新的非法文本中的特殊词进行自动识别,以扩展作为特殊词识别规则基础的特殊词表,实现对特殊词识别规则的更新。

【Abstract】 The international information content security filter refers to identify the illegitimate text that include ill content and take out them. Along with the increase of the illegitimate text in WEB, content security filter has become a new study domain of information filter.Some key problems of content security filter have been studied in our paper, for example, the representation of train texts, identification of illegitimate text and the automatic learning to the new text. We also design an ICSF experimental system to implement all the functions that be mentioned above.Main work and innovation in this paper are:1.The characteristic of illegitimate text has been roundly analysis, and we summarize the content and vocable feature of illegitimate texts and put forward their formalized express.2.We realize content security filter by using the rule-based approaches. Based on large numbers of train examples, we adopt learning from examples approach which implement produce rules by using extended OCAT algorithm to realize classification of text. At the same time, we put forward rules for special word to calculate the credibility of text. At last, we combine the train rules and special word rules to identify the new documents.3.Two automatic learning algorithms are used respectively to improve the produced rules. At first we modify the logical rules according to the feedback information to improve the ability of identify of the new illegitimate content and to implement the increment learning. We also present an algorithm to automaticly pick-up new special words in new illegitimate document. Then the system can catch new status to the new illegitimate information.

  • 【网络出版投稿人】 山西大学
  • 【网络出版年期】2004年 03期
  • 【分类号】TP393.092
  • 【被引频次】10
  • 【下载频次】441
节点文献中: 

本文链接的文献网络图示:

本文的引文网络