节点文献

基于主动搜索的论坛内容监管技术研究

Research on BBS Content Supervision Technology Based on Active Search

【作者】 耿乐群

【导师】 王巍;

【作者基本信息】 哈尔滨工程大学 , 计算机应用技术, 2011, 硕士

【摘要】 随着因特网的愈加普及,互联网正在成为一种不可缺少的信息传播媒介。但同时,网上的不良信息如反动、色情等内容也随之扩散,极大的影响了国家的安定和人民群众的身心健康。论坛作为一种网民常用的互联网应用形式,在方便网民的同时,也面临着传播不良有害信息的问题。为了良好的网络文化氛围和环境,对论坛进行内容监控十分必要。论坛内容监管在实现上有主动和被动两种模式。主动模式有其自身的优点,针对主动模式中面临的问题,本文主要就以下两个问题进行了研究与实现:主动模式中使用网络爬虫技术获取论坛的页面,为论坛监管提供原始内容,但对于需要用户登录才可以查看的网页内容的论坛,爬虫获得的页面往往是登录页面,这对论坛内容监管毫无意义。针对这一问题,本文在详细分析用户登录过程和原理的基础上,给出并设计实现了一种基于Cookie和爬虫结合的论坛受限内容获取方案,通过相对自动的方式的获取认证Cookie用于获取论坛受限页面内容,并通过实验证明了该方案的可行性。在网络爬虫的运行过程中,为避免对同一网页的重复下载,需要快速高效的URL去重技术。利用哈希去重是一个重要的研究方向,本文研究了基于K-Picked哈希算法的URL去重方法,在研究原算法原理和不足的基础上,对原算法进行了改进和优化,采用了扩大算法中普通字符的范围,增加除数的离散程度和将K值随机化的手段,降低了最终压缩编码的冲突率,最后通过多个实验验证了改进后算法在URL去重中取得了较为良好的效果。

【Abstract】 With the increasingly popularity of the Internet, the Internet is becoming an indispensable information media. But at the same time, online information such as adverse reaction, proliferation of pornography and other content also will greatly influence the country’s stability and people’s health. Forum is used as an commonly Internet application form. It facilitates users greatly. At the same time, it is also facing the problem of harmful information. For a good network of culture and environment, forum content monitoring is necessary.In the realization, there are two ways of forum content regulation. They are active mode and passive mode. Active mode has its own advantages. For the problems faced by active mode, the paper mainly researches on the following two issues.Active mode uses Web crawler technology to obtain forum pages, in order to provide original content for regulation, but some forums require users to log in before they can view the content, Web crawler can only get the login page which is meaningless for content regulation. To solve this problem, this paper analyzes the user login process and presents a method based on the forum Cookies and Web crawler. It can get restricted page content from forums by using certificated Cookies in an automated way relatively. Experiments have proved the feasibility of the program.While the Web crawler is processing, duplicated URLs need to be removed quickly and efficiently in order to avoid downloading the same page repeatedly. Hashing is an important research direction. Based on K-Picked hash algorithm, this paper studied the theory and the lack of the original algorithm, proposed an improved scheme. By expanding the scope of ordinary characters, increasing the dispersion of the divisor and randomizing K discrete value, the improved algorithm has achieved a relatively good result which is proved by a series of experimental.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络