节点文献

意见挖掘中若干关键问题研究

Researches on Key Issues in Opinion Mining

【作者】 罗芳

【导师】 熊前兴;

【作者基本信息】 武汉理工大学 , 计算机应用技术, 2011, 博士

【摘要】 随着互联网的普及和电子商务的迅速发展,互联网上存储了大量消费者对产品的评论信息,这些评论中包含消费者对产品性能或功能等方面褒义或者贬义的评价。商家/厂商可以通过跟踪这些信息,及时获取消费者的反馈意见,从而改进产品;潜在消费者可以了解其他消费者的使用体验,为合理购买产品提供帮助。然而面对Web上海量的无结构化或半结构化的评论信息,通过人工阅读的方式获取是一个费时费力的过程。因此,用户评论的意见挖掘研究应运而生,并成为近年来Web信息处理的一个研究热点。本文旨在研究意见挖掘中评价对象识别、评价内容分析及评价情感获取等关键问题,探索领域本体对其提供支持的方式和作用,并结合信息抽取、文本分类和自然语言处理等技术进行深入探讨。本文研究采取了方法论探索与实证分析相结合的方式,所做工作及创新点如下:(1)在分析已有方法和技术的基础上,借鉴软件工程中基于生命周期的模型,提出了增量迭代模型的构建方法。该方法将本体构建分成三个阶段,多步骤实施,结合本文实际应用,通过创建实例,丰富和完善了领域本体的知识结构,最终构建了一个用于产品命名实体识别中的笔记本电脑电子产品的领域本体NBO (Notebook Ontology)。(2)在定义并系统分析产品命名实体识别任务和方法的基础上,研究利用条件随机域CRFs(Conditional Random Fields)模型进行产品命名实体识别的方法,对识别过程中“观察窗口”大小的选取、建模粒度的选取、标注集的确定、特征的选择等关键问题通过实验验证其有效性;为了进一步提高产品命名实体识别的性能,提出了在CRFs模型中引入新的外部特征——本体特征,实验表明,融合内外部特征对产品名称实体、产品属性名称实体、产品构件名称实体的识别性能达到了理想的效果。(3)在研究传统基于主题的文本分类方法的基础上,利用基于机器学习的方法来进行文本的粗粒度情感分类,为解决数据稀疏问题,提出利用情感向量空间模型来进行文本表示,并通过实验对情感分类过程中的分类算法的选取,特征选择方法的运用、特征维数的选取等关键问题进行了分析和比较。为了综合考虑特征词对整个语料的贡献度和各个类别的贡献度,结合了文档频率和卡方统计的思想,提出了一个褒贬类卡方差值特征选择方法CDPNC,其分类性能F-度量值的宏平均和微平均分别达到了90.18%,90.08%。(4)在研究基于语义分析的情感分类方法基础上,利用依存句法分析来进行特征观点对的提取;对观点词的情感分类,针对中英文语言表达的差异,结合实际对基于逐点互信息的语义倾向方法中褒贬基准词对的选取、阈值的设定等问题进行改进,验证了其在中文评论文本情感分类上的可行性,并弥补了基于HowNet语义相似度的观点词情感分类方法的不足。(5)在上述研究成果的基础上,本文给出了一个意见挖掘系统的系统构架并设计实现了其原型系统。该系统可以从不同的粒度,对产品的整体评论、产品的综合特征及细节特征的评论进行全方位的意见挖掘,最终可将产品及评论的查询结果,产品意见的查询结果和产品的意见比较结果以可视化的方式呈现给用户。

【Abstract】 With the popularization of the Internet and the rapid development of E-commerce, the Web storages huge number of customers reviews about products. These reviews contain customers positive or negative feelings about product performance, functionality, etc. Businesses or manufacturers can analysis these customer reviews, and access to consumer feedbacks in time to improve product performance and after-sales service. Potential consumers can obtain some product-using experiences from the online reviews to purchase products more reasonably. However, dealing with an enormous amount of unstructured or semi-structured reviews in a manual way would be extremely expensive and time consuming. Therefore, the research of opinion mining about customers reviews has attracted more and more attentions, and it has been becoming a hotspot in recent researches on Web information processing.In this dissertation, the researches aimed at some key issues of opinion mining, exploring the concrete modes and effects provided by domain ontology, and achieved this tasks combined with the information extraction, text mining and natural language processing techniques. This dissertation emphasized particularly on methodology research associated with empirical analitic study, proposed new methods based on domain ontology and obtained the following achievements:Firstly, based on analyzing existing methods and techniques of domain ontology construction, a incremental iterative method was proposed to construct domain ontology, and it divided the process of domain ontology construction into three phases and ten levels. Using this method enriched and consummated the knowledge framework of domain ontology through instances establishment, and a Notebook Ontology was constructed for Product Named Entity Recognition (PNER).Secondly, based on exploring and analyzing the tasks and methods of product named entity recognition, a Conditional Random Fields (CRFs) model was applied to PNER, and the key technologies of the identification process, such as the size selection of "observation window", the selection of modeling granularity, the determination of labeling schemes and the selection of feature were verified by experiments. In order to further improve the performance of PNER, a new external feature, namely the domain ontology feature, was introduced to the CRFs. Experimental results showed that the combination of internal and external features performed quite well and the F-measure of ETY, ATT, PART on the test set achieved the desired results.Thirdly, based on researching the methods of the traditional topic-based text classification, machine learning was performed to the coarse-grained sentiment classification of reviews. To solve data sparseness, the sentiment Vector Space Model (s-VSM) was used to represent text. The critical issues of the sentiment classification, i.e. the selection of classification algorithms, the determination of feature selection method and the selection of feature dimension, were verified by experiments. Furthermore, in order to consider the entire corpus contribution of features and each category contribution of features, the feature selection method of Chi-square Difference between the Positive and Negative Categories (CDPNC) was proposed. It combined DF with CHI and had the better performance. Experiments showed that the Macro-F and Micro-F achieved 90.18% and 90.08% respectively.Fourthly, based on introducing semantic analysis to the sentiment classification, dependency parsing was performed to extract feature-opinion. Since the differences between Chinese and English language, the semantic orientation computing based on Pointwise Mutual Information (PMI) cannot be directly applied to the sentiment classification of Chinese reviews. Considering the practical application, this dissertation improved the benchmark of positive and negative word, threshold and so on, and verified that applying the semantic orientation computing based on PMI to the sentiment classification of Chinese reviews is feasible, and can overcome the weakness of the semantic similarity computing based on HowNet.Finally, based on the aforementioned researches, an opinion mining prototype system was designed and implemented. It can comprehensively mining the customers reviews about the product overall and in detail. Using this system, users can get visualized results and this will be helpful for their decision making.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络