节点文献

网络论坛采集及热点话题发现研究

Key Technology Research on Web Forums Crawling and Hot Topic Detection

【作者】 李恒训

【导师】 王斌; 刘金刚;

【作者基本信息】 首都师范大学 , 计算机应用技术, 2011, 硕士

【摘要】 近年来,互联网蓬勃发展,已经成为人们生活中不可或缺的一部分。其中网络论坛由于其富于交互性、即时性、开放性的特点,逐渐吸引了大量网络用户,已经成为互联网的重要组成部分。论坛是人们发布和获取信息的必要途径和重要手段,在生活、工作、娱乐中扮演着不可缺少的角色。网民通过论坛进行交流,可以发表一个主题大家一起来探讨,也可以提出一个问题大家一起来解决,因此论坛是一个人与人语言文化共享的平台,蕴含着大量宝贵信息,是一个巨大的知识库,同样是搜索引擎的重要数据来源。此外,中国网民言论之活跃已达前所未有的程度,不断在网络论坛上形成热点话题,有些甚至形成热点社会事件,显示了其不可忽视的力量,往往会引发重大舆情危机。因此,论坛采集是信息检索、数据挖掘和舆情监测的重要基础。然而由于论坛的特有结构造成了论坛采集的极大困难,大多数通用搜索引擎都对论坛采集进行了规避或简单处理。本文对论坛采集的关键技术进行了研究,针对论坛结构复杂、链接层次深、翻页链接难以识别以及容易陷入采集陷阱等问题进行了深入研究,提出了一种通用性较强的论坛自动采集方法。首先,我们采用深度优先和广度优先相结合的随机算法从论坛上抽样采集一定数量的网页进行分析,通过网页结构聚类、动态网页链接聚类、网页有效度识别等方法和步骤,在离线状态下对论坛的逻辑结构进行分析,得到论坛采集的最优路径,并且通过翻页链接识别采集深层链接的论坛帖子。根据离线分析的结果和少量人工调整的基础上,本文设计并实现了一个高效快速的论坛采集框架,对大规模采集中的性能问题进行了分析与探讨,并应用于分布式文件系统进行分布式采集。实验结果表明,与传统采集方法相比,本文方法大大提高了论坛采集的有效率和覆盖率。在论坛采集的基础上,本文研究了基于论坛的热点问题发现,提出了一种基于主题词的快速聚类算法,并构建了一个热点话题发现原型系统。该系统可以实时有效地发现论坛中一段时间内的热点话题及话题所包含的帖子,并且在实际中得到成功应用。

【Abstract】 The Internet is boomed in recent years, and it has become an indispensable part of people’s lives. Because of some features, such as rich interactive, instant and open, forum gradually attracted a large number of users, which has become an important part of the Internet. Forum is a necessary approach and important method for people to publish and acquire information in our daily life, work, entertainment and other aspects, which plays an indispensable role. Internet users can communicate through the forums by post a topic to explore all together. You can ask a question, whoever knows will work together to solve the question. So, it is a platform for people to share language and culture, which contains a wealth of information. So forum is a huge knowledge base, it is also an important data source of search engines. In addition, comments of active Internet users in China reached unprecedented levels, which continued to form the network hot topics, and some even form a focus of social events to show their power cannot be ignored, which often lead to a major crisis in public opinion. Therefore, the forum is an important basis for information retrieval, data mining and monitoring public opinion. However, because of the unique structure of the forum, it is hard to obtain the forum data, and most search engines have avoided crawling from the forum.We studied the key technologies on the forum crawling in this paper, besides the complex structures, deep link-level, the link flipping, easy to fall into collection traps and other problems. We proposed a universal forum crawling method.First, we use depth first and breadth-first combining algorithm to randomly sampling from the forum of a certain number of pages, through the web structure identify, web page clustering, dynamic web links clustering and some other methods, we obtain the logical structure of the forum. Then, we design and implement a rapid and efficient distributed forum crawling framework for large-scale crawling, in which the performance problems are analyzed and discussed. Compared with traditional crawling methods, our method greatly increased the efficient and coverage of the forum crawling.Based on the crawling of the forum, we applied it to a hot topic detection prototype system. The system can detect forum hot topic effectively for some time period, and find the posts each topic contains. Finally, we successfully applied it to a public opinion monitoring system in ICT, CAS, which achieved good practical results.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络