节点文献
基于信息搜集与内容分析的互联网不良信息监测技术研究
Research of Technologies for Detecting Bad Information in Internet Based on Information Gathering and Content Analysis
【作者】 黄旭;
【导师】 朱艳琴;
【作者基本信息】 苏州大学 , 计算机应用技术, 2008, 硕士
【摘要】 Internet以其前所未有的信息传播能力在给人们生活带来巨大便利的同时,也成为反动、色情、暴力等不良信息的载体。这些不良信息,尤其是有关国家安全的敏感信息借助于Internet传播,成为一个危害极大的社会问题。从海量信息中迅速有效地识别这类不良信息,进而阻止其非法传播,确保网上信息内容安全,已成为内容安全领域的重要研究课题。目前相关的研究大多集中在网关或用户端的信息过滤与自动屏蔽上,而国家安全部门对嫌疑站点进行主动核查,大多采用手工的形式,效率低下。为解决此类问题,本文以信息搜集与内容分析为基本思路,围绕不良信息的自动发现以及处理展开研究工作,深入研究了互联网结构体系、自然语言处理、人工智能与机器学习等相关原理与技术,具体工作涉及网页采集、关键词形式特征分析、文本特征提取、文本分类等方面。文章首先从Web结构入手,研究了基于内容的链接权重计算方法,提出基于内容评价的爬虫搜索策略;结合不良信息的固有特征,分析了不良信息形式化特点,同时针对不良信息隐蔽、多变的特点,研究了基于重复串的特征提取方法;基于贝叶斯理论,提出了实时文本分类器的设计方案,并提出文档特征反馈机制以提高分类性能。最后结合现实网络环境,提出一种Internet不良信息监测平台的实现框架。在Internet应用飞速发展的今天,本文研究工作对于提高相关部门工作效率、净化网络环境、促进构建和谐社会具有一定的积极意义,成为网络环境下内容安全领域的一次有益探索。同时,相关研究成果促进了网络、自然语言处理、人工智能等技术在信息安全领域的协同应用。
【Abstract】 Internet has a huge capability of information promulgating, and it brings advantage to web users. At the same time, Internet becomes a carrier of bad information about rebellion, eroticism, and violence. The bad information, especially the sensitive information on national security, diffused in Internet becomes a serious social problem. How to distinguish the bad information rapidly and effectively in order to prevent them from diffusion, to ensure the safety of information in Internet, becomes a serious task in content security.Some correlative research concentrates on information filtering and auto-shield at gateway or client computer. But the active check to suspicious site is done by national security department mostly by means of inefficient handiwork. To solve it, many thoughts were established in this paper based on information gathering and contend analysis, and start off the research by surrounding how to gather and process the bad information. On the whole, this paper studied some correlative principles and technologies of the web system, nature language process, artificial intelligence and machine learning, etc. Firstly, this paper researched the Web structure and the way to calculate the hyperlinks’weight, advanced the crawler’s search strategy based on content evaluation. Secondly, it analysed the formalization feather of the bad information, and then researched the repeats-based term extraction algorithm aiming at the bad information character which is concealment and levity. Thirdly, this paper proposed a real-time text categorization method based on Bayesian Theory, and put forward the feedback of file character to improve the performance of classifier. And finally, it advanced a structure of a system to find the bad information in Internet.Nowadays, it is well known for the rapid development of the application of internet. This paper has active significance to improve the efficiency of correlative department, clean the web environment, and accelerate to construct harmonious society. It is useful for exploration of content security in Internet. Moreover, the fruit of this paper is valuable to the cooperating of network, nature language process, and artificial intelligence in information security.
【Key words】 Information Security; Content Security; Search Strategy; Repeats; Bayesian Theory; Feedback;
- 【网络出版投稿人】 苏州大学 【网络出版年期】2008年 11期
- 【分类号】TP393.06
- 【被引频次】4
- 【下载频次】385