节点文献

互联网舆情信息挖掘方法研究

Public Opinion Mining on the Internet

【作者】 杜阿宁

【导师】 方滨兴;

【作者基本信息】 哈尔滨工业大学 , 计算机系统结构, 2007, 博士

【摘要】 及时掌握舆情动态、积极引导社会舆论,是维护社会稳定和执政党执政安全的重要举措。随着Internet迅猛发展,互联网拥有越来越庞大的用户群,且逐渐发展成为群众发布信息、获取信息和传递信息的主要载体。因此,基于互联网的舆情信息挖掘技术越来越受到广泛关注。舆情是指一定时期内一定范围内的社会群体对某些社会现象和现实的主观反映。互联网舆情信息挖掘技术作为舆情信息挖掘的有效手段成为研究热点。然而,现有互联网舆情信息挖掘技术的研究中暴露出信息海量性、处理时效性和预警准确性方面的问题,因此亟需互联网舆情信息挖掘在理论体系和挖掘方法上实现突破。本文针对互联网舆情信息挖掘技术进行研究,在明确舆情及其相关概念基础上,着重探讨互联网舆情信息挖掘的体系结构和互联网舆情信息形成过程中不同阶段所采用的不同挖掘方法。主要研究内容如下:互联网舆情信息挖掘的体系结构是一项重要的研究内容。本文提出包括属性层、信息采集层、挖掘层和处置层的互联网舆情信息挖掘四层体系结构。其中属性层覆盖舆情信息存在空间、发生时间、变化走势和转化机制中的一般规律;信息采集层覆盖互联网舆情信息采集过程中涉及到的关注主题类、采集空间、采集内容等问题;挖掘层覆盖互联网舆情信息处于不同挖掘时机、基于不同挖掘目的、所采用的挖掘方法;处置层覆盖互联网舆情信息的评价、分析与处置手段。四层体系结构是互联网舆情信息挖掘的基础。在互联网舆情信息的产生阶段,本文提出内容敏感网页的舆情监控方法,实现敏感信息监控和不良信息过滤。针对内容敏感网页监控方法,本文提出用户兴趣聚焦度的概念,把用户过滤需求看作以用户感兴趣事物为核心、由不同用户兴趣聚焦度为半径形成的非形式化连续空间,借此表达用户在过滤倾斜情况上的需求。基于用户兴趣聚焦度,本文提出中文敏感网页过滤算法,一方面把网页结构中的URL分析、主题句分析、正文分析相结合,另一方面把用户兴趣聚焦度量化后引入机器学习算法的训练阶段用于正文分析。实验结果表明,内容敏感网页过滤算法有效提高了网页的过滤精度和处理速度,解决了互联网舆情信息产生阶段的舆情发现问题。在互联网舆情信息的传播阶段,本文提出针对大多数用户阅读的新闻主题进行挖掘的舆情监测方法,及时了解群众关心的舆情热点并避免某些问题转化为突发事件爆发。针对频繁访问主题监控方法,本文提出基于差值编码双向链表的数据流中频繁项监测确定性算法Frequent Sketch(FS)。FS算法的空间复杂度O(log(εn)/ε),数据项平摊处理时间O(1),算法生成的全局摘要S是ε-亏度摘要。基于FS算法及其在窗口数据流上的扩展算法FS-Win,本文提出一种互联网频繁访问主题挖掘算法。实验分析表明,该算法能够实时地进行用户频繁访问主题挖掘,解决了互联网舆情信息传播阅览阶段的监测问题。在互联网舆情信息的转载阶段,本文提出针对大多数网页转载的新闻主题进行挖掘的舆情计量方法,了解当前互联网舆情主题的状态,发现热门舆情事件的发生和群众对事件的舆论倾向。针对舆情态势计量方法,本文提出NISAC指数方法,NISAC指数借鉴经济指数和社会指数的编制方法,以互联网空间中含有特定词的页面数量为基础进行指数编制。数据分析表明,NISAC指数能够对互联网反映出的社会运行安全态势进行监测、评估和预警,解决了互联网舆情信息转载阶段的掌控问题。

【Abstract】 To dominate and lead the public opinion is one of important acts of maintaining social stability and Party ruling security. With the rapid expansion of information technology, Internet become the main platform of information releasing, exchanging and acquiring with a huge number of users. Instead of public opinion survey, public opinion mining on the Internet become more and more important. Public opinion is the aggregate of individual attitudes or beliefs held by the adult population in some area in a period. As a method to collect public opinion, public opinion mining on the Internet becomes the researching focus. However, problems of existing public opinion mining techniques on huge-volume processing, high-speed mining and high-accuracy pre-alarm call for improvements in public opinion architecture and mining algorithms.This thesis focus on the Internet public opinion mining techniques. After clarifying the notion of public opinion and relating concepts, this paper mainly studies the architecture of Internet public opinion mining and mining algorithms on different periods of public opinion information forming. The main contents are as follows:Research on the architecture of Internet public opinion information mining is quite important. This thesis proposed four-level architecture of Attribute Level, Information Collecting Level, Mining Level and Disposing Level. Among them, Attribute Level includes basic rules in public opinion collecting, catching, tracking and leading; Information Collecting Level includes what is collected, where to collect and how to collect; Mining Level includes three-phase public opinion forming model of Releasing, Acquiring and Citation, and mining algorithms on different mining phases; Proposing Level includes evaluating, analyzing and proposing methods. The four-level architecture is the base of Internet public opinion mining.During the Releasing phase, we monitored content-suspicious pages to fulfil the use of harmful information filtering and suspicious information monitoring. This thesis proposed the notion of User Interest Focusing Degree (UIFD), which use how the set of interest constituted to measure the user interest. Thus user interest is regarded as an informal continuum With different UIFD around the objects user interested in. This thesis implemented the UIFD-based Chinese web pages filtering approach on public opinion, which includes pages structure analyzer of URL, title, body and machine learning algorithm with UIFD imported into the training procedure. UTFD-based Filtering algorithms earns high efficiency in Chinese content-suspicious web pages filtering.During the Acquiring phase, we timely maintained the list of frequently accessed news topic on the Internet, to get the hot topic in time and avoid them transforming to unexpected affairs. This thesis put forward frequent items maintaining algorithm of Frequent Sketch (FS), which keeps the deficient synopsis by maintaining a sorted doubly-linked list of groups storing the frequency delta in between and pruning the counters periodically. Compared with existing algorithms, FS acts better in accuracy, processing speed and memory used. Frequently accessed news topic mining approach on FS-Win algorithm (FS expanded to windowed stream) and topic similarity algorithm, can acquire frequently accessed news topic in time.During the Citation phase, we measure the spreading degree of news topics, to help user comprehend current public opinion broadcasting situation, find out what hot topic and people’s attitude is. This thesis introduced a measurement model of Internetpublic opinion-----NISAC indexes. Similar to the compiling methods of economicalindexes and natural indexes, NISAC indexes are compiled based on the number of web pages which contain certain keyword. NISAC indexes can help describe the public opinion situation quantificationally, understand the spreading degree of hot topic. We can acquire unexpected affairs of abnormal spreading degree by monitoring the indexes of certain keyword contained in affairs relating pages. In a word, NISAC indexes are used to monitor, evaluate and pre-alarm the social security situation reflected on the Internet.

  • 【分类号】TP311.13
  • 【被引频次】53
  • 【下载频次】6671
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络