节点文献

基于意见挖掘技术的网购评论倾向性分析的研究与应用

Opinion Mining Based Sentiment Analysis for Online Products Reviews Research and Application

【作者】 范英翔

【导师】 宋晖;

【作者基本信息】 东华大学 , 计算机应用技术, 2012, 硕士

【摘要】 互联网的高速发展使得网上购物越来越盛行,这极大改变了人们的购物方式。而人们对商品及购物过程的感受,也从口口相传发展为以网购评论的方式来传播。网购评论,不论对于普通购买者还是产品生产者都极为重要。本文力求通过从网购评论中分析、提取人们对商品的情感倾向,进而帮助消费者选择适合的商品,也帮助生产者有针对性地提高产品质量。基于意见挖掘的文本倾向性分析一般是将文档或句子看作词、短语或模式的集合,通过识别关键词、短语或模式,并计算其倾向性值,再将结果累加得到待分析文档或句子的倾向性值。文本倾向性分析一般通过数据采集、文本预处理、倾向性识别与判断以及结果展示等四个步骤实现。本文深入地研究了现有的文本倾向性分析方法,从京东商城上抓取网购评论数据,通过对数据的分析和统计,总结了网购评论数据的特点,进而提出基于词性模式的抽取和合并算法(POSEM算法),应用该算法抽取出训练数据集中的有效词性模式,再根据词性模式的特点,设计了模式匹配规则,最后,运用这些规则,从测试集中抽取出中心词和评价词,并实现了评论语句的倾向性判别。实验结果表明,本文提出的方法取得了较高的精确率和召回率。本文的主要工作如下:(1)本文结合现有的文本倾向性分析理论,对获得的网购评论数据进行了深入地分析和统计,总结了网购评论数据与倾向性分析相关的特点:评论句子中,形容词对倾向性判别的贡献最大,其在主观句中的数量与总数的比例最大,达到86.87%;名词、副词的贡献次之,比例分别达到71.64%和70.79%;其他词性,如动词、介词,对倾向性的分析也有重要的作用。(2)基于对网购评论数据的分析,本文设计并实现了基于词性模式的抽取与合并算法(POSEM算法)。该算法使用"POS\T\O"表示词性模式信息,并对词性模式的长度、在数据集在出现的频率和出现在主观句中的概率,分别设计了长度阈值、频度阈值和上下限概率阈值。其中,满足下限概率阈值的模式用于否定评论句子的倾向性。抽取算法从预处理后的训练文本数据中,抽取出满足全部阈值的词性模式。对于仅符合长度阈值和上下限概率阈值的模式,在保留模式中的中心词和评价词信息的前提下,合并算法尝试将其进行合并,以获得能够满足全部阈值要求的模糊模式。这样的设计可以在一定程度上提高倾向性分析的召回率。(3)基于对POSEM算法抽取到的词性模式的分析,本文设计了模式匹配规则,并从测试文本数据中识别出中心词、评价词,再利用以高精确率抽取得到的中心词和评价词来处理剩余的未处理文本,最后根据总结出的倾向性判别规则得到评论句子的倾向性。通过对实验结果的分析,本文提出的方法具有较高的精确率和召回率。(4)本文设计实现了一个通用的文本倾向性分析框架。该框架可以灵活地替换组件,以满足不同的实验需要。在预处理模块,系统为词性定义了统一的格式,当替换不同的分词工具时,只需要将其自定义的词性格式简单地转换为系统的格式即可。在文本分析模块,系统可以方便地替换训练、测试及应用组件。基于上述的框架,整合开源工具,本文设计实现了一个文本分析的原型实验平台。该平台集成了数据采集模块、文本预处理模块、文本倾向性分析模块和结果展示模块。

【Abstract】 The rapid development of the Internet makes online shopping more and more popular, which greatly changes the model for consumption. The feelings for people to goods and the process of shopping spread not just by word of mouth but also by the online reviews. Then the online reviews are important not just for consumers but also for the producers. The paper seeks to analyze the online reviews and extract the attitudes and emotions of people to the goods, further more, helps consumers choose products and producers improve quality of products.Generally, sentiment analysis, opinion mining based, treats the texts or sentences as the collection of words, phrases or patterns. To calculate the sentiment of the words, phrases or patterns, the value of the sentiment of the texts or sentences could be calculated out. There are four steps for sentiment analysis:data collection, text preprocessing, sentiment identification and the result show.The paper studied the existing methods for sentiment analysis deeply, and crawled the online reviews from the Jingdong Mall. Through the data analysis and statistics, it summarized the characteristics of the data, and then presented the algorithm of extraction and merge for POS patterns (POSEM). With the algorithm, the effective POS patterns were extracted from the training data set. According to the characteristics of the POS patterns, the paper designed the rules of pattern matching, and finally, extracted the title words and opinion words from the test set. Then the sentiment of the online reviews was got. The experiment showed that the proposed method achieved a higher precision and recall rate.In this paper, our work is as follows:1. With the theoretical study of the existing text sentiment analysis, this paper conducted in-depth analysis and statistics, and summed up the characteristics related with the sentiment analysis:in the comments sentences, the adjective contributes the most for the sentiment analysis, the rate of the number of which to the total is 86.87%; the noun and adverb are followed, the ratio reaches 71.64% and 70.79%; the other part-of-speech, such as verbs, prepositions, has also an important role for sentiment analysis.2. With the analysis on the data of the production reviews, this paper designed the algorithm of extraction and merge for POS patterns (POSEM). The algorithm marks the POS pattern with "POS\T\O", and sets a length threshold, the frequency threshold and upper and lower probability threshold by the length, the number and the probability of POS patterns. The POS patterns which meet the lower probability threshold will be used to negate the subjective. The extraction algorithm extracts the POS patterns which meet all of the thresholds. Some POS patterns, which just meet the length threshold and the probability threshold, will be merged in order to get to meet all of the thresholds. This design could improve the recall of the sentiment analysis in some extent.3. With the analysis of the POS patterns which are extracted by the POSEM algorithm, the paper designed the pattern-matching rules, and the center-words and opinion words will be extracted from the test set. Then, the words with high-precision evaluation will be used to identify the remaining ones. The result of the experiment showed that the proposed method reached high precision and recall rate.4. A generic framework was designed and implemented, which can replace the components flexibly to meet the different needs for the experiments. In the preprocessing model, the system sets the uniform format for the POS tagging. When there is the need to replace the different word-segment tools, the system just needs to transform the POS format to the uniform one. In the analysis model, the system can replace the train, test and application components easily. Based on the framework, combining the open-sourcing tools, the paper designed and implemented a prototype experiment platform for text analysis. The system integrates the data collecting model, the text preprocessing model, the text sentiment analysis model and the result show model.

  • 【网络出版投稿人】 东华大学
  • 【网络出版年期】2012年 07期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络