节点文献

文本情感分类及观点摘要关键问题研究

Research on Key Problems in Text Sentiment Classification and Opinion Summarization

【作者】 张冬梅

【导师】 马军;

【作者基本信息】 山东大学 , 计算机应用技术, 2012, 博士

【摘要】 人类自然语言文本承载了两种信息,客观事实信息和带有人的主观感情色彩的信息,这些带有人的主观性信息的文本反映了人们对于某个特定对象的态度、立场和意见等。文本情感分析以带有主观性信息的文本为研究对象,目的是识别、分类、抽取、标注这些文本里表达的情感、观点、影响。随着互联网的迅猛发展,论坛、社区、博客、购物网站等社会媒体上面的主观性评论信息越来越多,甚至呈现爆炸式的增长。越来越多的人或机构开始习惯于在网络上搜索评论信息来帮助做出决定。但是,Web上的海量信息使他们在进行检索后不得不在数量巨大的评论中一条一条地人工翻阅、检查、判断信息,以便做出综合判断。在这种情况下,如果能够对这些海量的评论信息进行综述,得到的观点综述信息势必对消费者和生产商两方都具有很高的参考价值。这方面的工作就是基于观点的多文档摘要。同时,如果能够自动对这些评论进行分析,分析出哪些评论对评论对象持肯定态度,哪些持否定态度,以及肯定或否定的程度,便可极大提高用户获取评论信息时的效率。这方面的工作就是情感分类。本文围绕文本情感分析中的多文档观点摘要和情感分类这两个子课题进行了研究,主要工作包括以下三方面:(1)提出了一种基于观点的多文档摘要方法。现有基于观点的多文档摘要方法多数根据被评论的特征/方面(feature/aspect)进行摘要,称为基于特征/方面的观点摘要。这种摘要很大程度上依赖于对评论特征和评论词的精确识别,而实际中经常存在句子中缺少显式给出的评价特征或评论词的情况,这样的句子在基于特征的观点挖掘中很容易被忽略,从而影响后续生成的摘要的质量。而要精确挖掘句子中的评论特征和评论词又需要领域知识的支持,这又会造成领域依赖性。同时,这种基于特征/方面的观点摘要将关注点放在对每个特征的评价上,不能提供一个覆盖所有评论中主要主题和基本观点的综述信息。针对以上问题,本文提出了一种通用的领域无关的多文档观点摘要方法。本方法采用传统摘录式多文档摘要技术,结合概率主题模型LDA(Latent Dirichlet Allocation, LDA)和语义倾向进行多文档观点摘要。本文方法首先利用LDA模型对多文档的句子集合建模,挖掘文本集合中的潜在主题,利用Gibbs抽样得到句子在主题上的概率分布和主题在词上的概率分布,同时对句子进行词性分析并利用WordNet和SentiWordNet计算句子中词的语义倾向值;然后依次计算主题的重要度、词的重要度,在这两者基础上结合词的语义倾向计算句子的重要度;最终根据句子的重要度排序依次抽取句子,根据主题去除句子冗余后得到抽取式文摘。本文方法利用LDA模型挖掘评论文本中的重要主题,并结合语义倾向挖掘在重要主题上的主观性较强的观点。实验证明,本文方法得到的摘要更接近专家摘要。(2)提出了一种基于集成学习的不平衡数据集情感分类方法。目前二元情感分类的研究重点都放在了如何提高情感分类的性能上,却忽略了现实中经常出现情感分类样本中一个类别的样本数量几倍于另一个类别样本数量的情况,即情感分类样本的类别“不平衡”问题。而当前对情感分类的研究绝大多数都是在平衡的数据集上进行的,由此导致在平衡的数据集上得到的性能较好的情感分类器,在实际中应用时很难保持原有的性能。因此,研究如何对不平衡的情感分类数据进行分类,并提高其分类性能具有非常重要的意义,这也是情感分类技术能真正运用到实践过程中必须解决的一个问题。针对上述问题,本文提出了一种综合了不平衡数据集分类和集成学习两方面技术的情感分类方法。作为一种混合的方法,该方法从算法和数据两个层面着手,在集成学习的框架下,综合了欠抽样、Bootstrap重采样和随机特征选择三种方法来处理训练集,以便同时获得这三种方法的优势,得到若干在样本和特征空间都不相同的具有较大差异性的训练子集,由此得到具有较大差异性的基分类器,并最终提高集成得到的总分类器的性能。在“不平衡”情感数据集上的实验证明该方法可显著提高“不平衡”情感数据集的分类性能。(3)提出了细粒度的情感分类并研究了文本分类预处理技术对情感分类的影响。大量的情感分类研究重点放在二元情感分类上,即将主观性文本分为肯定类别或否定类别,而现实中带主观性信息的文本并不总是分为肯定或否定两类,例如很多网上商城的评价信息都是对应着1星到5星的等级信息,在这种情况下仅仅研究将评论信息分为肯定和否定两个类别不能满足实际的需要。针对这种情况,本文提出了对含有主观性信息的文本进行更细致的分类,称为细粒度的情感分类,该分类不仅考虑评论文本的肯定和否定的极性,还考虑评论的力度等级。本文同时分析了细粒度情感分类与普通多类分类问题本质上的不同。考虑到情感分类和传统的基于主题分类的目的不同,为了更好地研究细粒度的情感分类,本文针对有指导的机器学习方法,分析了影响情感分类的各种因素,研究比对了特征词数目、停用词表、文本特征选择、特征权重计算和文本分类方法在情感分类这个特殊问题上的性能表现,发现将文本分类技术应用于情感分类时在停用词、分类方法等方面和应用于主题分类时表现不同。最后,针对细粒度的中文文本情感分类问题,本文利用机器学习的方法在中文科技论文的评论上做了相应实验;实验中使用评论文本对应的等级信息作为类别标签,解决了人工标注的问题;实验发现细粒度的情感分类不仅在本质上和基于主题的多类分类不同,而且分类难度高于传统的多类分类和两类的情感分类。

【Abstract】 Human natural language text contains two kinds of information:objective and subjective information. The subjective information represents one’s attitude, standpoint and opinion to a specific object. Text sentiment analysis focuses on subjective information to recognize, classify, extract and annotate the expression of sentiment, opinion and effect in the content.With the rapid increase usage of internet, there are more and more subjective information appearing at the social medium, such as forum, community, blog and shopping websites. Both individual and organization became strongly relying on the review information obtained from the internet to make their own decisions. However, due to the huge amount of information available on the internet, one has to search, check and judge each review one by one before the person or organization can make the final decision. In this situation, it will be very useful to first summarize the relevant huge amount of information; this summary will be valuable for both the customer and manufacturer. This kind of work is called opinion-based multi-document summarization. Furthermore, it will greatly enhance the customers’ efficiency to obtain the information if there is an automatic analysis of the original information, for example, which is positive attitude, which is negative attitude, and to what extent. This is called sentiment classification.This thesis focused on the opinion-based multi-document summarization and sentiment classification, two fields in text sentiment analysis. It contains the following three parts:1) Developed a new method for the opinion-based multi-document summarizationCurrent opinion-based multi-document summarization that mainly based on the feature or aspect of the review is called feature/aspect based opinion summarization. This is largely depended on the accurate recognition of opinion feature and opinion word, however in reality, the opinion feature or opinion word is often not explicitly appeared in the sentence. Therefore, the feature/aspect based opinion mining will miss the opinion that is implied in the sentence due to the failing of recognition of the implicit opinion, and affect the performance of the following summarization. As to accurately recognize the feature/aspect requires the domain knowledge, thus make it domain dependent. Furthermore, this feature/aspect based method mainly focuses on the recognition and evaluation of each feature; therefore, it cannot provide summary information about the main topic and basic idea that covers all the opinions.To overcome this problem, this thesis proposed a general, domain-independent multi-document opinion summarization method. This new method utilizes the traditional extractive summarization method, combining Latent Dirichlet Allocation (LDA) and semantic orientation for mullet-document summarization. This method first builds the model of the sentence sets from multi-document with LDA, and explores the latent topics, obtains the sentence-topic distribution and topic-word distribution through Gibbs sampling, performs part of speech analysis and computes semantic orientation of word with WordNet and SentiWordNet. Secondly, it evaluates the importance degree of topic and word sequentially, and then based on these results and semantic orientation of word, it evaluates the importance degree of sentence. Finally, it sorts the sentence by the importance degree of sentence, obtains the extractive abstract after getting rid of the redundancy according to the topics. This identifies the important topic from the opinion text with LDA model and the strong subjective opinion on such topic with semantic orientation method. Experiment results indicate that results with this new method are comparable to expert summarization.2) Developed a new ensemble learning based method for sentiment classification of unbalanced dataCurrent binary sentiment classification has been focusing on improving the performance of classification, while the unbalanced data, in which the number of samples in one category is several folds of that of another category, is neglected. Majority of the study on sentiment classification has been on the balanced data, so these methods perform well on balanced data, while are unable to maintain the same performance in practical applications. Therefore, it is imperative to study and develop new methods to deal with unbalanced data for sentiment classification and to improve the performance of sentiment classification in practical applications.To this end, this thesis proposed a new method of sentiment classification that combines unbalanced data classification method and ensemble learning technique. As a hybrid method, it considers both algorithm and datasets. In the framework of ensemble learning, it integrates three different methods: under-sampling, Bootstrap re-sampling and random feature selection to process the training set. It thus combines the advantage of the three methods to obtain the subset with larger diversity in both sample space and feature space, and leads to a larger diversity base classifier. In the end, it can enhance the ability of the ensemble classifier. Experiment on the unbalanced data for sentiment classification show that such new approach could significantly improve the classification performance on unbalanced data.3) Developed a fine-grained sentiment classification and analyzed the effect of pre-process of text on sentiment classificationMajority of study in sentiment classification focus on binary sentiment classification which categories subjective text as positive or negative. However, in reality, text with subjective information cannot always be simply classified as positive or negative. For example, the review information from many shopping websites contains ranking information from1star to5stars. In this case, classifying them only into positive or negative cannot meet the practical need. To solve this problem, this thesis proposed a method called fine-grained sentiment classification. This method not only considers the positive or negative polarity of the review text, it also addresses the ranking strength of the review text. It further analyzed the essential difference between the fine-grained sentiment classification and the traditional multi-class categorization.Considering the difference between the sentiment classification and the traditional topic-based categorization, to better study the fine-grained sentiment classification, this thesis used supervised machine learning method to analyze various components that affect the sentiment classification. Specifically, it compared performance of the combination among the number of feature, stop words list, text feature selection, feature weight computation and text categorization method on sentiment classification. These studies indicated that there were differences between sentiment classification and topic-based classification when applied stop words list and feature selection in text categorization. Finally, to study the fine-grained sentiment classification of Chinese text, this thesis did experiment in analyzing reviews in Chinese scientific literature using machine-learning method. In the experiment, the usage of ranking information correspondent to the review text as category label solved the problem of manual annotation. The experiment shows that fine-grained sentiment classification is not only different from the topic-based multi-class categorization, but also difficult to classification compared to traditional multi-class categorization and binary sentiment classification.

  • 【网络出版投稿人】 山东大学
  • 【网络出版年期】2012年 12期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络