节点文献

基于Web评论信息的倾向性分析关键技术研究

【作者】 杨玉珍

【导师】 刘培玉;

【作者基本信息】 山东师范大学 , 网络与网络资源管理, 2014, 博士

【摘要】 社交网络的迅速兴起,网民规模的不断攀升,使得以互联网为代表的新兴媒体已经成为广大群众表达诉求、抨击时弊、建言献策、沟通交流的重要工具,成为群众行使知情权、参与权、表达权和监督权的重要渠道。与此同时,网络用户也由信息的被动接收者转变为信息的生产者,这便造成了用户产生的大量评论信息在互联网上堆积的情形。不仅如此,用户产生评论信息中还蕴含了用户情感态度、政治倾向等信息。挖掘用户产生内容信息所携带的情感信息,分析用户的情感倾向,对商品推荐、舆情发现以及信息预测等均有着重要的意义。迄今为止,研究者在倾向性分析领域做出了大量的研究,推进了倾向性分析研究的进步。由于用户的情感倾向性信息多蕴含在用户产生的文本信息中,而自然语言处理本身便是一项极具挑战性的工作。再加上蕴含在用户产生评论信息中的情感倾向还会依据语境的不同而发生变化,这便使得倾向性分析存在以下几个亟待解决的问题:倾向性分析存在语料分布极度不平衡现象。一些领域的语料容易通过互联网获取,而某些领域的语料属于稀有资源,如何解决语料分布不平衡问题,使得构建的情感词表具有较高的领域可移置性,达到跨领域倾向性分析的目的是当前亟待解决的首要问题。情感词不仅具有领域依赖性,而且具有上下文依赖性,同一个情感词在不同的上下文环境中会表现出不同的情感倾向,导致系统精确度大幅降低。如何解决情感词的上下文依赖问题是提高倾向性分析的关键所在。针对复杂的语言现象,如何捕捉比较词、否定词以及句式等因素对句子倾向性的影响,能否构建一个合理的句子倾向性分析模型,捕捉影响句子倾向性的多种因素,达到提高句子倾向性分析目的是倾向性分析所面临的问题之一。平面话题模型难以描述评论文本中主题与属性之间的关系,造成全局把握某一评论话题的全局情感倾向性困难的局面。能否构建一个合适的评论文本表示模型,用于描述评论文本中话题与子话题之间纵向层次关系及横向关联关系,最终达到描述用户全局情感倾向的目的,是当前面临的一个重要问题。本文针对上述问题,确立研究内容。主要工作如下:(1)研究跨领域情感词自动扩展方法,解决不同领域数据分布不平衡现象。针对倾向性分析中存在语料分布不平衡问题,提出一种跨领域倾向性分析方法。目的在于利用源领域中已标注信息分析目标领域中未登录词的情感倾向,用于未标注领域情感词自动扩展。该方法首先将情感词划分为依赖情感词和独立情感词两类,以此为基础扩展原有倾向性分析的两个假设,构建源领域与目标领域之间的关系,达到情感词自动扩展的目的。整个方法涉及情感词抽取和情感词倾向性定义两个步骤。情感词抽取阶段采用词性信息与改进的点互信息相结的方法计算候选情感词与评价对象之间的依赖强度,获取目标领域情感词集合。构造词与词、词与评价对象、词与文档之间关系,并利用这个关系计算每个情感词倾向强度,最终达到跨领域情感词扩展的目的。(2)研究评价短语倾向性分析方法,为解决情感词倾向性依赖下文依赖问题开辟新的途径。针对情感词倾向性存在上下文依赖性问题,提出一种基于评价对象隐性情感倾向的评价短语倾向性分析方法。该方法将情感词的上下文环境分解为评价对象,并对评价对象的隐性情感加以量化,以此为基础构建评价对象、情感词以及评价短语之间的关系。最后,依据启发式规则构建短语倾向性分析的目标函数,达到短语倾向性分析的目的。实验表明,结合评价对象隐性情感倾向的情形下,评价短语倾向性识别得到了有效的提高。(3)研究否定句倾向性分析方法,以解决否定词否定界限模糊的问题。针对句子倾向性分析中否定词否定界限模糊的问题,分析影响否定句倾向性分析的主要因素以及否定词的否定范围,将否定界限问题转化为否定词位置问题,以此为基础提出一种基于层叠HMM的否定句倾向性分析方法。该方法被划分为三个层次,其中HMM-1和HMM-2用于识别否定句中所包含的评价对象,以此为基础,计算短语的情感倾向。为了量化否定词对句子倾向性的影响,将句子中所包含的否定词作为触发条件修正评价短语的情感倾向,最后依据不同的句式计算句子的全局情感倾向。该方法参加了2012年第四届全国倾向性信息评测,提交的结果在所有提交结果中表现最优。(4)研究评论文本模型构建方法,用于解决平面话题模型关联检测困难的问题,为全局捕捉话题特征倾向奠定基础。针对平面话题模型关联检测困难的问题,本文提出一种融合扩展IB理论的评论文本模型构建方法。该方法将评论文本视为一个层次结构,首先将评论文本划分为一个个独立的语义单元,并将语义单元进一步划分为主题特征和语义单元特征两个部分。其中,主题属性用于同一话题或产品的全局关联,而语义单元属性则用于区分话题或子属性之间的关系。在语义单元划分中,本文将传统的信息瓶颈理论(The In-formation Bottleneck Method,简称IB)依据评论文本特征加以扩展,用于语义单元划分;在相关话题/产品关联检测中,本文采用加权KL的方法用于关联检测。为了验证这一思想的可行性,本文在数据集TDT4上进行测试,结果表明,本文构建的模型能够较准确的捕捉同一话题/产品之间的关联关系。

【Abstract】 With the rapid rise of social networks and increasing scale of Internet users, emerging me-dia, represented by the Internet has become an indispensable tool for the public to express aspi-rations, criticize the current problems, make recommendations, and communicate effectively, aswell as an important channel for the mass to exercise their rights to know, to participate, to ex-press and to supervise. Thus, the users have turned into the producers of information from therecipients of information, contributing to accumulation of information resulted from a largenumber of users on the network. User-generated information contains much information such asemotional attitude, political tendency, etc. Mining emotional information carried by us-er-generated content information, analyzing users’ emotional tendencies, is of great significanceto product recommendation, public opinion discovery and information prediction.So far, a lot of researches have been made by researchers in the field of orientation analysis,promoting the progress of tendency analysis. Because users’ emotional information is mostlyembedded in user-generated text information, and natural language processing research itself is avery challenging task; in addition, users’ emotional information may change according to differ-ent contexts. These will result in several tendency analysis problems urgently to be solved in thefollowing:(1)Corpus distribution imbalance exists in the tendency analysis; corpus of some areas canbe easily available via the Internet, while corpus of certain areas is difficult to obtain. How tosolve the problem of unbalanced distribution of corpus, to make the built emotional vocabularybe with high ability of field displacement, to achieve the goal of interdisciplinary tendency anal-ysis is the primary problem which needs to be solved currently.(2)Emotional words are not only be with field dependence, but context dependence, causingthe same emotional word in different contexts to show different emotional tendencies, whichsignificantly reduces the system accuracy. How to deal with the context-dependent issues ofemotional words is the key to improve orientation analysis.(3) For sentences may contain negative words, comparative words, emotional words withdifferent tendencies, and other complex language phenomenon, whether a reasonable sentencetendency analysis model can be built, to capture various factors influencing the sentence orienta-tion, and realize the purpose of improving sentence tendency analysis is one of the problemsconfronting orientation analysis.(4) Plane topic models are difficult to describe the relationships between topics and proper-ties in the comment text, resulting in difficulties in fully grasping the global emotional tendencyof certain comment topic. Whether an appropriate comment text representation model can bebuilt, to describe the longitudinal hierarchy and lateral correlation in the comment text, andeventually achieve the goal of describing users’ final emotional tendency, is an important issuecurrently facing us.In response to above-mentioned problems, this paper established the research content, and ultimately made a breakthrough in the following several aspects. Major work is as follows:(1)Research on the problem of automatic extension of emotional words in various areas,and dealing with distribution imbalance of data in different fields. Aiming at the problem of un-balanced corpus in orientation analysis, this paper proposed a method of sentiment analysis forcross-domain. In this method, we analyzed the emotional tendency of the unknown words in thetarget field in use of the labeled information in the source field.This method firstly divided emotional words into two categories: dependent emotionalwords and independent, based on which two assumptions of the original orientation analysiswould be extended, the relationship between the source field and target field be constructed toachieve the goal of emotional words extension. The whole method involved emotional wordsextraction and emotional words orientation definition two steps. The phase of emotional wordsextraction adopted a method combining part-of-speech information and improved mutual infor-mation to calculate the dependence intensity between candidate emotional words and evaluationobjects, and obtain the emotional word set of the target field.For the purpose of orientation definition, the relationships between words and words, wordsand evaluation objects, words and documents were constructed, using which the emotional ten-dency of each emotional word could be calculated, ultimately achieving the goal of interdiscip-linary emotional words extension.(2) Research on orientation analysis of evaluation phrases. An evaluation phrases tendencyanalysis method basing on emotional expectations of evaluation objects was put forward. In viewof the problem of emotional context dependence, first of all, the context of emotional wordswould be decomposed into evaluation objects, the potential emotion of which was used to quan-tize the impact of evaluation objects on phrases tendency. On this basis, the relationships be-tween evaluation objects, emotional words, evaluation phrases could be constructed. Finally, theobjective function of phrase orientation analysis would be constructed based on heuristic rules,to achieve the goal of phrase orientation analysis. Experiments showed that, combining with theemotional expectations of evaluation objects, tendency recognition of evaluation phrases hadbeen effectively improved.(3)Research on the problem of negative sentences orientation analysis. For the negativephenomena that exist in the sentence tendency analysis, this article analyzed the main factorsinfluencing the negative sentences orientation analysis and the negative scope of negative words,on this basis, put forward a kind of negative sentences tendency analysis method based on cas-caded HMM. The method was divided into three levels, of which HMM HMM-1and HMM-2were applied to identify evaluation objects contained in the negative sentences, and define thepotential emotional tendency of every evaluation object. Then negative words contained in thesentences would be put as the trigger condition to correct the emotional tendency of evaluationphrases; finally, global tendency of the sentence be computed according to sentence rules. Thismethod attended Task1of2012the fourth national orientation information measurement, whichwas exactly Chinese negative sentences orientation analysis, and obtained optimal evaluationresults in all submitted results. (4)Research on the problem of comment text model construction, in order to fully capturethe emotional tendencies of network users on a particular topic or product, to solve the defectsthat it is difficult to capture the global information in simple use of evaluation attributes. Thispaper built a model for correlation detection of comment text. In this model, comment text wasseen as a hierarchy. First of all, the comment text would be divided into several individual se-mantic units; the semantic units further be divided into two parts: subject attribute and semanticunit attribute. Among them, the subject property was used for global correlation of the same top-ic or product, and the semantic unit attribute was used to distinguish the relationships betweenthe topics or child attributes. For the division of semantic units, in this paper, the traditional In-formation Bottleneck Method (referred to as IB) was expanded based on comment text feature,and used to divide semantic units; in the correlation detection of related topics/products, the me-thod of weighted KL for correlation detection was adopted. In order to verify the feasibility ofthis thought, this paper respectively conducted tests on TDT4data sets, and the results showedthat the model built in this paper could capture the correlation relationship between the sametopics/products more accurately.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络