

【作者】 刘德鹏

【导师】 徐谡; 许慧新;

【作者基本信息】 电子科技大学 , 软件工程, 2011, 硕士

【摘要】 随着互联网的高速发展,网络给人们提供了前所未有的开放、便捷的信息共享与发布平台,越来越多的人通过网络来表达自己的意见、想法、情绪和态度,其中既包括对对事件的发展有着正面、积极作用的信息,也包括一些负面、消极的信息。同时,网络平台的开放性、直接性和隐蔽性使得网络舆论越来越重要地影响人们的意识形态。因此,对大量舆情信息的及时有效监控分析,对维护社会稳定、促进国家发展具有重要的现实意义。网络舆情监控系统与自然语言处理技术密切相关。受限于自然语言处理技术水平,传统的网络舆情监控系统,主要为话题识别的相关内容,而对舆情的情感因素关注较少。虽然也有学者对舆情情感意见信息挖掘进行了研究,但由于处理结果与语料相关性较高,导致实用性不足。近年来,随着自然语言处理研究的逐步深入,浅层语义分析开始崭露头角,并在相关应用研究中体现出相对词性标注、句法分析更为智能实用的优势。浅层语义分析是一种简化了的语义分析形式,它利用动词对句意理解的关键作用,以动词为中心对句子意义的进行了形式化表示。语义角色标注作为一种浅层语义分析,对句子中一些成分为给定动词谓词的语义角色进行了标注,具有分析任务定义明确、便于评价等优点。结合这种最新的自然语言处理技术,基于对现有舆情监控分析算法的对比分析,我们设计并实现了一个网络舆情监控分析系统,创新性的提出了:(1)一种新的结合HowNet中公开的计算词语语义相似性算法和基于字的倾向计算算法,并对现有话题识别与追踪技术进行优化整合;(2)通过对大量样本的统计分析,得到倾向性语言表现形态规律,具体表现为角色-特征性概率表和角色-情感性概率表,为后续分析提供客观数据基础。本文包括的主要内容有:(1)舆情监控分析系统框架设计与模块设计。根据网络舆情信息的特点,提出系统总体框架,并对信息预处理模块、信息挖掘模块和信息服务模块进行了设计。(2)舆情热点话题识别技术研究。对网络中一段时间内大量出现的某个新闻主体进行提取追踪,通过对ICTCLAS分词技术、文档频率特征抽取方法、TFIDF权重计算以及K均值聚类算法的有效整合,实现热点话题识别与追踪。(3)舆情信息浅层语义分析研究。主要利用语义角色标注工具,通过训练测试,对文本语义层角色进行标注。(4)舆情信息倾向分析研究。实现文本中意见、情感等信息的提取,主要包括情感词库建设、特征库建设、情感倾向计算算法研究以及语料知识发现等。本文所涉工作在国内相关事件和分析中得到应用,可有效辅助舆情监控并减少人为干预,必将在未来的网络信息管理中发挥积极的效益。

【Abstract】 Along with the rapid development of the Internet, network provides people with unprecedentedly open, convenient platform for information sharing and releasing. And more and more people express their opinions, ideas, feelings and attitudes through network, which include positive information boosting the development of events, also include some negative information making the events more badly. At the same time, the openness, directness and concealment of network make it influence the people’s ideology more importantly. Therefore, monitoring and analyzing the huge network information timely and effectively has practical significance in maintaining the social stability and promoting the national development.Network public opinion monitoring system is closely related to the Natural Language Processing technology. Because of the limited Natural Language Processing technology, traditional system solves the topic recognition and relevant content of it, but pay less attention to the emotional factor in public opinion. Although some scholars research the opinion mining of public opinion, the close relation between corpus and result makes the low practicability.In recent years, along with the gradually deeper researching of Natural Language Processing, shallow semantic analysis starts to make a figure, and performs more intelligently and practically in related application and research compare to part-of-speech and syntactic analysis. Shallow semantic analysis is a simplified semantic analysis, which represents the meaning of a sentence centering on the verb which is the key to understand the whole meaning. Semantic role labeling is a shallow semantic analysis, which labels some words and expressions’semantic roles for a given verb. It has some advantages such as clearly defined analyzing task, easy to evaluate and etc.Based on the comparative analysis of existing public opinion monitoring algorithms, we design and implement a network public opinion monitoring and analyzing system combing new Natural Language Processing technology, and put forward a novel tendency algorithm which integrating the semantic similarity computing algorithm between words released on HowNet with the tendency computing algorithm based on single character, and also optimize the existing hot topic identification and tracking. Also,based on the statistical analysis of mass samples, we find the regular pattern in tendency texts which is represented as role-feature probability table and role-emotion probability table and provides objective data base for subsequent analysis.This paper mainly includes the following content:(1) The design of system framework and main modules. According to the characteristic of public opinion, this paper designs the system framework and mainly modules which includes the information preprocessing module, information mining module and information service module.(2) The research of hot topic identification and tracking. In order to extract and track the topic appearing with high frequency in a period of time, this paper integrates the ICTCLAS word segmentation, the feature extraction of document frequency, TFIDF weighting computing and K-means clustering algorithm.(3) The research of shallow semantic analysis. This paper uses semantic role labeling tools to label the semantic role of word in texts through training and testing, which can improve the efficiency of text tendency analysis significantly.(4) The research of text tendency analysis. This paper presents methods to extract the feeling and opinion in the texts, which mainly includes emotional lexicon construction, feature lexicon construction and emotional tendency computing algorithm and knowledge discovery in corpus, etc.The related tasks in this paper have applied in domestic events analysis and it can effectively help network public opinion monitoring reduce human intervention. It will play a positive benefit in future network information management.

  • 【分类号】TP393.09
  • 【被引频次】6
  • 【下载频次】586

