节点文献

基于网络爬虫和Lucene索引的互联网舆情监测系统设计与实现

Research and Development of Internet Public Opinion Monitoring Model Based on Web Crawler and Lucene Index

【作者】 周小丽

【导师】 时小虎;

【作者基本信息】 吉林大学 , 软件工程, 2013, 硕士

【摘要】 随着计算机技术的不断发展,使用IT技术实现虚拟网络舆情监控越来越受到政府、企业的重视。网络突发事件应急管理是公共安全保障的核心问题,应急管理与网络舆情有着密切的关系,近十多年来,随着信息化的普及,信息内容的爆炸式增长,使得从海量的网络数据发现和处理突发事件信息越来越重要和困难。而应急处理的时效性要求很高,往往要求采取即时措施,传统的采集和分析方式已很难满足这种实时性的需求,因此建立一个互联网舆情监控系统是很有必要的,该系统不仅要能发现事件,还要能够“发现早”,“发现全”。根据权威机构调查,到2012年我国网民数量已经突破5亿关口,国内互联网普及率达到38.3%,其中,移动上网用户达到3.5亿。互联网活动参与者数量增长明显。如今,互联网被称为继电视、广播、纸媒之后的“第四媒体”。随着源源不断的网民参与,互联网取而代之,成为了社会舆论的晴雨表。主要体现在新闻网站、知名论坛、贴吧、博客等平台上,这类媒体也被统称为虚拟社会。由于网络的监管不严格,甚至漏洞百出,网民参与其中基本没有门槛,活动成本基本为零,但其影响却较现实更广泛、渗透更深,造成的社会影响不容忽视。如果任由其发展、不加以引导,那么,大量的负面互联网舆情信息充斥到虚拟社会中,无疑会给社会的长治久安造成不良影响,埋下社会隐患。对于政府机构,加强虚拟社会舆情监管、应对,积极化解危机,对维护社会稳定,实现我国的现代化建设,经济建设不断向前有着很重要的现实意义。互联网是一个宝库,尤其是在互联网大数据时代,借助IT技术,实现对虚拟网络舆情的及时、全面的监控已经迫在眉睫。本文将主要介绍互联网舆情监控系统的设计与实现,以及网络爬虫(Web Crawler)和Lucene索引的优点和在互联网舆情监测系统中的应用。本文设计的互联网舆情监控系统主要分为信息采集模块、信息检索模块、数据分析模块和数据展示模块。信息采集模块的核心是网络爬虫,采集范围覆盖整个互联网,包括新闻媒体、论坛、博客、微博客及视频类网站。信息检索模块的核心功能是实现大数据的快速、精确的检索,这里将用到支持Lucene索引的Mongo数据库,它将检索速度提高到5秒以内。还有数据分析模块和数据展示模块,分别用来对文本的语义进行分析和最终数据的展示。网络爬虫,又被称为蜘蛛Spider,或是网络机器人、BOT等,这些都无关紧要,最重要的是:由于爬虫的存在,才使得搜索引擎有了丰富的资源。使用搜索引擎,使我们检索信息的能力获得了空前的提高,成本有效地降低,可以说,搜索引擎是现代的计算机技术、因特网技术与传统的索引理论相结合的成功典范。随着网络的普及,其影响力不断扩大,信息急速增长,网络毋庸置疑,已经成为了当今信息最大的载体。搜索引擎帮助我们实现了从海量的互联网获取信息提过了有效的途径。但是,网络世界是复杂的,多元化的,而用户对数据的获取是有方向性的,有目的性的,如Google、百度等面向整个虚拟社会的通用型的搜索引擎越来越凸显出其局限性,搜索引擎如何提供用户基于主题的快速、准确和深入的查询,是摆在我们面前的一个难题。网络爬虫作为搜索引擎的核心部件,就自然成为了我们研究攻克的主要方向,无论多么强大的搜素引擎,在后面,都有一个高效的网络爬虫为它服务。本文还要介绍另外一个关键技术,Lucene索引,一个高效的数据检索工具,在我要提到的舆情监控系统中,将起到不可或缺的作用。

【Abstract】 With the continuous development of computer technology, the government andenterprises is pay more attention on the use of IT technology to virtual network public opinionmonitoring. Network security management is the core issue of public security, emergencymanagement and network public opinion has a close relation. Nearly ten years, with thepopularity of information technology, the explosion of information content, from vastamounts of data discovery and handling emergencies information network is more and moreimportant and difficult. And the requirements timeliness of emergency response is very high,which required to take immediate measures, while the traditional way of collection andanalysis has been difficult to meet the needs of the real-time, thus set up a virtual socialemergency management command system is necessary. The system not only can find events,but also can analyze the complex relationship between events, describe and predict thedevelopment trend of events.By2012, according to authoritative organization investigation, China’s Internetpopulation has reached at500million; domestic Internet penetration rate reached38.3%,among them, there are350million mobile Internet users. The number of participants Internetactivity significantly increased obviously. Today, after the television, radio, newspaper, theInternet is called "The fourth media". Now, with a steady stream of Internet users participatein, instead, the Internet has become a barometer of public opinion. Mainly reflected in thenews website, well-known blog BBS, post bar, such as platform, this type of media is alsoreferred to as virtual society. Due to network regulation is not strict, even flawed, basic nothreshold, to participate activities cost nearly zero, but its influence is more extensive, deeperpenetration, cause the social impact of nots allow to ignore. If its development, not be directed,so a lot of negative Internet public opinion information are full of the virtual community,which will certainly to affect the social stability and security is buried under the social hiddentrouble. For government agencies, and strengthen the virtual social public opinion supervision,and resolving the crisis, to maintain social stability, the realization of the modernization construction of our country, the economic development forward has very important practicalsignificance.The Internet is a treasure, especially in the era of big data, with the aid of IT technology,realization of virtual network public opinion in a timely and comprehensive monitoring hasbeen imminent. In this paper, we will mainly introduces the structure of the Internet publicopinion Monitoring and how the Web Crawler and the Lucene index used in the applicationof the Internet public opinion monitoring system.In this paper, the Internet public opinion monitoring system constitute by Informationacquisition module, Information retrieval module, data analysis module and data displayingmodule. The core of the Information search module is the crawler, It can Crawl data fromnews websites, BBS, blog and micro blog websites and video websites. The informationretrieval module is used for a fast and accurate retrieval for big data, here the Lucene indextake up to5seconds. Finally we will also introduce data analysis module and data displayingmodule, respectively used to analyze the semantics of the text and the final data show.Web crawler, also known as a Spider spiders, or network robot, BOT, etc, all these aredoesn’t matter, the most important thing are: as a result of the existence of the crawler, makessearch engine has a wealth of resources. Using a search engine, the ability to enable us toretrieve information received an unprecedented increase, effectively reduce the cost, so tospeak, search engine is the core of computer technology, Internet technology with traditionalindex theory combining the successful model. Along with the network popularization, itsgrowing influence, information rapid growth, the network, no doubt, has become the largestcarrier of the information today. Search engine to help us to achieve from the mass of theInternet to get information about the effective way. But the network world is complex,diversified, but users access to data is always in purpose, the whole virtual society orienteduniversal search engine more and more highlights its limitations, how to ask a user based onthe theme of the rapid, accurate and in-depth queries, is a difficult problem in front of us. Webcrawler as a core component of search engine, naturally became a main direction of research,in the back of a powerful search engine, there is a highly effective web crawler to service it.We will introduced another key technology in this paper, the Lucene index, an efficient dataretrieval tool, which will play an indispensable role in the public opinion monitoring system.

  • 【网络出版投稿人】 吉林大学
  • 【网络出版年期】2013年 08期
  • 【分类号】TP391.3
  • 【被引频次】5
  • 【下载频次】1479
节点文献中: