

Technology Research on Sentiment Analysis for Chinese Web Reviews

【作者】 周城

【导师】 肖卫东;

【作者基本信息】 国防科学技术大学 , 管理科学与工程, 2011, 硕士

【摘要】 随着网络技术的迅猛发展,网络已成为越来越多的人们获取信息的重要来源,同时,也成为人们表达自己观点的平台。对迅速增长的网上文本资源,尤其对用户主动发布的评论进行挖掘和分析,识别出其情感倾向及演化规律,可以更好地理解用户的行为,分析热点舆情,也可以为政府,企业和其他机构在决策时提供重要的依据。本文首先介绍了情感分析的研究背景和应用前景;然后以中文Web评论为研究对象,对其概念、特点进行了介绍;接下来按照Web评论的情感分析流程,分别从Web评论的获取和预处理、Web评论的情感分析方法两方面进行了深入研究。其中,对于Web评论的情感分析,本文分别研究了基于文本分类技术和基于情感词典的文本情感分析方法。文本情感分析的价值在于从某一主题的评论中分析得出总结性的结论,这首先涉及到从网络上获取大量的评论数据。同一主题的评论通常集中在某些站点,同一站点的网页呈现高度结构化。针对这一特点,本文设计了基于消息中间件的网页实时处理技术来并行下载和预处理网页,得到可供情感分析的评论数据。接着,本文运用了两种基于不同思想的情感分析方法:(1)基于文本分类技术:首先在传统特征选择方法基础上提出了基于相关性和冗余度的联合特征选择算法,旨在删除冗余特征,保留有利于分类的特征,从而提高文本情感分类效果;最后采用支持向量机的文本分类方法进行情感极性分类。(2)基于情感词典技术:利用《知网》建立情感词典,并计算中文词语的情感倾向,接着根据短语结构进一步计算文本中短语的情感倾向值,最后通过求和获得整个评论的情感倾向值。最后,以网络上的公开评论数据集和课题获取的手工标注数据集为实验测试数据,对文中提出的两种情感分析方法进行对比分析,实验结果表明:本文提出的两种情感分析方法均是有效的,而且基于情感词典的方法在性能上要略优于基于文本分类的方法。

【Abstract】 With the rapid development of Web technology, the Web has become a very important source from which more and more people obtain information. In the meanwhile, it is becoming a significant platform for people to express their viewpoints. Mining and analyzing this rapidly expanding information on web, especially the sentiment of the online reviews posted by users, can better our understanding of the consuming habits and public opinions of various users. Besides, it plays a crucial role in decision-making for many institutions, such as enterprises, the government, etc.At the beginning, this paper introduces the background of sentiment analysis and its prospect, and describes the conception and features of Chinese Web reviews. And then, according to the process of sentiment reviews for Web reviews, this paper makes a research in the approach of gathering and preprocessing Web reviews, and the technology of sentiment analysis. For sentiment analysis, this paper researches two methods based on text classification and sentiment dictionary respectively.The biggest value of sentiment analysis is generating summaries from many reviews which focus on the same topic, so this refers to how to get large numbers of reviews spreading on the Web. Generally, the reviews on one topic are distributed intensively on several Websites and Web pages in the same Website are highly structured. So this paper design a real-time Web page processing technique based on Message-Oriented Middleware aimed at parallel downloading and preprocessing Web pages, which gets the reviews data for sentiment analysis.Then, this paper proposes two approaches for sentiment analysis. Firstly, based on text classification technology, we propose a joint feature selection method based on relevance and redundancy to eliminate redundant features, find significant features for classification and consequently improve the accuracy of text sentiment classification, and then the well known classification technique, support vector machine, is used to classify the sentiment polarity. Secondly, based on sentiment dictionary technology, we utilize HowNet to construct a sentiment dictionary which is used to compute the sentiment orientation of words and phrases in the reviews. And then, the sentiment orientation of phrases is summed to compute the sentiment orientation of reviews.Finally, we use these two proposed methods to analyze the sentiment orientation of the public data set, as well as the data sets collected in this research. The experimental results show that the feature selection method and the sentiment dictionary based sentiment analysis method proposed in this paper are effective, and the sentiment dictionary based method outperforms the text classification based method.
