节点文献

Web文本观点挖掘及隐含情感倾向的研究

Research on the Opinion Mining and Hidden Sentiment Inclination for Web Text

【作者】 杨卉

【导师】 周春光;

【作者基本信息】 吉林大学 , 计算机应用技术, 2011, 博士

【摘要】 所谓观点,是指一个人对某些事物的想法和理解,它是对某些事物的判断和评价。观点并非是事实,因为观点既没有得到验证,也没有得到证明和确认。如果一个观点后来能够得到证明和确认,那它就不再是一个观点,而变成一个事实。因此,从一个Web访问者的角度来看,将所有发布在Web上的信息看成是观点比看成是事实更加妥当。了解其他人的想法和对事物的判断已经成为决策制定过程中最重要的依据之一。如今,互联网使一切成为可能,我们能够在互联网上了解那些并不认识的人和专家的观点和态度。同时,越来越多的人也在互联网上分享自己的感受和经历。随着网络上观点资源的日益丰富,如个人博客,在线评论等,给我们提供了新的机会和挑战,如何使用信息技术去挖掘和理解其他人的观点便是观点挖掘。情感倾向分析是对Web上用户主动发布的内容(也称作用户生成内容)进行有效的分析和挖掘,识别出这些内容的情感趋势——赞同、反对、高兴或者悲伤,甚至进一步预测情感随时间的演化规律。通过对用户生成内容的情感倾向分析,使我们能够更好地了解用户的消费习惯,分析当下热点事件的舆情,帮助企业和政府作出合理正确的决策。然而,目前被广泛使用的信息检索技术,尤其是搜索引擎技术,是以关键字为基础的,无法实现基于情感和观点的检索。其原因有两方面:第一,情感或者观点无法用简单的关键字来表示和索引。第二,信息检索领域的排序策略也并不适合观点挖掘。目前,大多数的情感分析算法是需要靠我们用简单的术语来表达我们对产品和服务的情感。然而,文化因素,语言的细微差别和不同的上下文使其很难成为一个简单的赞成或是反对情感的书面文本字符串。因此,本文首先深入研究了情感倾向评估模型和Web文本特征抽取方法,提出了连续性情感评估模型和基于中文依赖语法的情感评估模型。在此基础上,为了挖掘Web文本的主题社区和情感趋势,本文将隐含情感倾向评估模型分别与Web文本社区挖掘算法和文本聚类方法K-Means算法相结合,提出了Web文本社区快速挖掘算法、基于多Agent的Web文本社区挖掘算法和基于隐含情感的Web文本聚类算法。本文主要工作如下:(1)在Web文本空间向量模型基础上,提出了一个基于中文依赖语法的主观字特征抽取方法。该方法能够在尽量避免噪音的情况下,依据中文依赖语法规则,抽取出文本表达中的主观字。实验分别在不同的特征向量空间和样本数量不平衡的情况下,对IG、MI、CE和我们的算法在KNN分类器下的表现进行了比较。(2)针对离散情感倾向评估方法无法准确描述情感变化趋势的问题,提出了两个中文连续情感倾向评估模型,分别是中文连续情感评估模型和基于中文依赖语法的情感评估模型。中文连续情感评估模型旨在提出一个全面、准确的中文情感倾向分析模型。本文的方法首先识别出句子中出现的情感字,通过上下文的句法结构来判别出每个句子的情感倾向,然后通过整合所有句子的情感倾向来预测整篇文档的情感倾向。实验证明,该方法可以准确地描绘出一定时间段内的Web文本情感的变化趋势。基于中文依赖语法的情感评估模型,通过中文依赖语法规则来判别主观字的先验极性和修饰极性的方法。实验证明,在真实Web数据上,该方法比传统的SVM和NB算法的情感分类结果准确性更高。(3)研究了Web文本社区挖掘算法。基于不同的Web社区结构,即静态社区和动态社区,分别提出了基于隐含情感的Web文本社区快速挖掘算法和基于多Agent的Web文本社区挖掘算法。基于多Agent的Web文本社区挖掘算法是一个动态社区挖掘算法,该算法可以在未知Web文本社区结构的情况下,有效地挖掘相同主题和相同情感的Web文本社区。以上两种算法的共同特点是在Web文本社区挖掘算法中,考虑了隐含情感因素,实验结果表明,这两种算法不仅能够提高Web文本挖掘算法的精度值,同时可以提高算法的回召值。(4)改进了经典的文本聚类方法K-Means算法,提出了一个基于隐含情感的Web文本聚类算法,算法中给出了一个基于隐含情感和文本特征的相似性比较算法,同时算法基于一个新的分级机制的原始中心选择算法。因为一个好的原始中心不仅仅能够代表文本聚类的中心,同时可以更好的区分该中心与其他中心。通过实验验证,在不同类型的在线文本集上,K-Means算法、Bisecting K-Means算法、UPGMA算法和本文提出的HSK-Means算法想比较,具有原始中心选择的算法(如bisecting K-Means和HSK-Means算法)的表现明显优于不具有原始中心选择的文本聚类算法。综上所述,本文深入研究了Web文本观点挖掘和中文文本隐含情感倾向分析问题,主要考虑了如何更加准确地评估文本中隐含情感倾向,即连续情感倾向评估问题;同时,分别对静态和动态的Web文本社区挖掘给出的两个不同算法,最后给出了一个基于隐含情感和原始中心选择的Web文本聚类算法。将隐含情感分析和社区挖掘相结合,不仅仅可以更加准确的、全面的了解观点持有者表达的真正想法,同时可以帮助需要使用和借鉴这些观点的人作出正确的决策。本文的算法研究和实现方法都非常新颖,且具有较高理论价值和实际应用价值。本文对观点挖掘和情感分析领域进一步研究具有重要意义。

【Abstract】 The opinions mean someone’s ideas and understanding about something, they are something’s judgment and evaluation. The opinions are not the facts, because the opinions are not verified, unproved and confirmed. If later an opinion could be proved and confirmed, it is no longer an opinion, is becomes a fact. So from the views of a Web’s visitor’s it is more suitable to take all the information published on the web as opinions rather than facts. Knowing others’opinions has become the most important part of decision-making procedures. Now the Internet makes everything possible, we could get to know others and experts’opinions and attitudes even though we are not familiar with them. At the same time, more and more persons share their feelings and experiences on the internet. The abundant opinions resources on the internet such as personal blogs, online comments bring new opportunities and challenges. How to dig and understand others’opinions using information technology are opinions mining.Sentiment inclination analysis is to effectively analyze and mine the users’actively published contents, also called user generated contents on the web, to identify the contents’sentiment inclination, e.g. positive、negative、happy or sad, even to predict the trend of sentiment over time. By analyzing the sentiment inclination of the user generated contents, we could better understand the users’consuming habits, analyze the comments and responses of the current hot affairs and assist the enterprises and governments in making the reasonable and right decisions.But the current most-used information technology, especially the search engine technology is based on the keywords, could not search based on the sentiment and opinions. There are two reasons, firstly the sentiment and opinions could not be expressed and indexed by simple keywords, secondly the index strategy of the information search fields is not suitable for opinions.Now the problem of most sentiment analyzing algorithms is that we have to use simple terminology to express our sentiments about products and services. However, the culture factors, the subtle differences of the languages and the different contexts make it difficult to simply label a favorite or objective sentiment. So, firstly our paper deeply researched the sentiment inclination evaluation model and web text features extraction methods. We proposed continuous sentiment evaluation model and sentiment evaluation model based on the Chinese dependency grammar. On this basis, our paper combined hidden sentiment inclination evaluation model with the web text community mining algorithm and text clustering methods K-Means algorithms respectively in order to mine the web texts’topic community and sentiment trends, proposed web text community fast mining algorithm, web text community dynamic mining algorithm based on multi-agent and web text clustering algorithm based on hidden sentiment, our paper’s mainly focuses are followings:(1) We proposed a features extraction method of subjective words using the Chinese dependency grammar based on web text space vector model. This method could extract the subjective words of the expressed texts following the Chinese dependency grammar rules while avoiding noises possibly. The experiment compared the performances of the IG、MI、CE and our algorithms under the KNN classifiers while using different feature vector spaces and unbalanced sample counts.(2) Aimed at the method of discrete sentiment inclination evaluation can not accurately describe the trend of sentiment, proposed two Chinese continuous sentiment inclination evaluation model:Chinese continuous sentiment evaluation model and sentiment evaluation model base the Chinese dependency grammar. The goal of Chinese continuous sentiment evaluation model is to propose a comprehensive and accurate sentiment inclination analysis method. This method identified the sentiment words of the sentences, judged every sentence’s sentiment inclination through the context’s sentence structure, and then combined all the sentences’sentiment inclination to predict the sentiment inclination of the whole documents. The experiment results showed that our method could accurately describe the web texts’sentiment trends in a specified period. The sentiment evaluation model based on Chinese dependency grammar is to judge prior polarity and modified polarity of the subjective words using the Chinese dependency grammar rules. Experiments showed that on the real Web data, the accuracy of our method’sentiment classification is higher than the traditional SVM and NB algorithm.(3) We researched web text community mining algorithm. For the different web community structures, those are static communities and dynamic communities our paper proposed web text community fast mining method based on hidden sentiment and web text community dynamic mining algorithm based on multi-agent respectively. Web text community dynamic mining algorithms could effectively mine the web text community of the same topics and the same sentiments while not knowing the web text community structures. The above two methods’common feature is that they all take count of the hidden sentiment factors in the web text community mining algorithms. The experiment results showed that these two algorithms could not only improve the accuracy of web text mining algorithm, but also improve the recall of the algorithm(4) We improved the classic text clustering algorithm K-Means, proposed a web text clustering algorithm based on hidden sentiments, this algorithm contained a similarity compared algorithm based on the hidden sentiment and text features, also proposed an original center selection algorithm base on a new classification mechanism. A good original center could represent the center of the text clustering and meanwhile distinguish this center from others centers better. The experiments validated that , using the online text sets of different types, compared the K-Means algorithm、Bisecting K-Means algorithm、UPGMA algorithm and the HSK-Means algorithm proposed in this paper, the text clustering algorithm with original center selection(e.g. Bisecting K-Means algorithm and HSK-Means) performed significantly better than the algorithm without original center selection.Above all, this paper deeply researched the web text topic mining and Chinese text hidden sentiment inclination analysis, mainly focused on how to evaluate the hidden sentiment inclination of the texts more accurately, that is continuous sentiment inclination evaluation, meanwhile, we proposed static and dynamic community of web text mining algorithms respectively. Finally, we given a web text clustering algorithm based on hidden sentiment and original center selection. Combining hidden sentiment analysis and community mining, not only can be more accurate, comprehensive understanding of the real views of opinions’holder, but help to use and learn from these opinions of people make the right decisions. This algorithm research and implementation methods are very novel and has a high theoretical value and practical value. So, this thesis is of great significance to the further research of opinion mining and sentiment analysis.

  • 【网络出版投稿人】 吉林大学
  • 【网络出版年期】2012年 05期
节点文献中: