节点文献

中文产品评论挖掘关键技术研究

Research on Key Mining Techniques of Product Reviews in Chinese

【作者】 黄永文

【导师】 何中市;

【作者基本信息】 重庆大学 , 计算机软件与理论, 2009, 博士

【摘要】 随着网络的蓬勃发展,以用户为中心反映了用户使用体验、包含了用户对产品的特征、功能和性能等看法的产品评论越来越多。通过参考产品使用者所发表的产品评论,用户可以挑选最适合自己的产品,厂家也可据此对产品进行改进,从而增强企业竞争力,因此产品评论挖掘技术的研究也就变得越来越重要。本文应用机器学习方法对产品评论挖掘的相关技术如短文本分类技术、特征观点对的挖掘方法、特征观点对的优化算法及产品特征的层次关系提取技术进行了研究。取得的主要成果和创新工作概括如下:提出基于语义特征的产品评论分类方法。产品评论的自动分类可以获取更好的研究素材,降低评论挖掘算法的复杂性,从而提高挖掘效率。基于产品评论普遍较短,本文从短文本的角度处理产品评论的分类问题。首先对从网上获取的产品评论进行人工标注,获得训练集;然后提取出产品评论中位于前列的χ2统计量和语义内容(产品特征、观点词、程度词)作为分类特征,把语义内容的数量、未挑选的语义内容和评论文本长度也加入分类特征;再使用二分类具有优势的支持向量机分类方法对所获取的分类特征进行学习,获得分类器;最后对网上时时更新的产品评论进行分类,挖掘出优秀的评论,建立评论语料库。实验表明,语义内容的加入对产品评论分类效果的改善是很明显的,准确率提升了9%,达到了80%,对属于短文本类型的产品评论来说分类效果是很不错的。采用半监督学习思想,提出在产品评论挖掘过程中把特征挖掘和观点挖掘相结合以获取特征观点对的方法。针对产品特征和观点词具有对应的修饰关系,本文使用半监督学习方法,把用户发表的产品部件、功能、性能等特征和表达了情感的观点词结合在一起进行挖掘,从而保留特征和观点的对应关系。半监督学习方法既可以利用少量标注样本获得专家的标注知识,又可以利用大量未标注数据来改善学习性能,增强学习算法的泛化能力。因此本文把人工定义的少量特征观点对作为种子,结合评论语句中的词、词性和修饰关系等组成的模式特征集对评论库进行挖掘,获取用户真正感兴趣的产品特征和评价。然后使用获得的产品特征词和观点词对多特征的评论进行了处理,实验表明这种处理使准确率和召回率都提升了2%左右。虽然把特征与观点结合在一起进行挖掘的准确率不是很高,但较高的召回率可使半监督学习算法能够挖掘到新的信息。为了改善挖掘结果的性能,提出基于最大化调和平均数(Maximize Harmonic-Mean,MHM)的原则,对观点序列进行优化的方法。针对半监督学习方法具有准确率随着迭代次数的增加而急剧下降的缺点,本文在准确率不高、获取的特征观点对中有很多错误的情况下,利用调和平均数易受极端值的影响,尤其受极小值的影响比受极大值的影响更大的特点,对标准差大的观点序列进行调整,删除序列中的低频元素时,通过最大化调和平均数在确保召回率的同时提高准确率。实验结果显示在准确率上升17%的情况下,召回率只降低了5%,此时准确率达到77.3%。提出从产品说明书和编辑评测中获取产品特征层次关系的方法,该方法采用结构化挖掘方法对产品说明书挖掘得到规格特征及其层次关系,使用半监督学习方法对编辑评测挖掘获得描述特征及其层次关系。现有的评论挖掘系统在获得特征及对应的观点词后没有对上下位的特征、同一特征的不同词语表达进一步处理,这样就会把同一个特征的不同词语表示作为不同的特征、上下位的特征作为平行特征展现给用户。本文首先使用结构化数据挖掘方法对厂家的产品说明书进行挖掘,获取规格特征之间的层次关系,再利用半监督学习方法对网站所提供的编辑评测进行挖掘,获取描述特征及其层次关系。然后把一段中获取的描述特征与规格特征进行相似度比较,从而获得规格特征和描述特征之间的层次关系。本文最后把获取的特征观点对与特征之间的层次关系相连接,合并相同特征的不同表示,对上下位的特征进行归类,统计出各个特征所获得的观点,并以树状的形式从上至下展现整个产品不同层次特征所获得的评价。

【Abstract】 With the vigorous development of the network, the product reviews with the customer experience, reflecting their opinions on the product features, functions and properties has more and more on the web. By the reference to the product reviews, customers can buy their most suitable products, manufacturers can improve their products and increase their competitiveness. Therefore, the study of product reviews mining becomes more and more important. In this paper, machine learning techniques are applied in the product reviews mining, such as the technique of short texts classification, the mining method of the feature-opinion pairs, the optimization algorithm of the feature-opinion pairs, and the extraction technology of the hierarchical relationships among the products features. The main contributions of this thesis are summarized as follows.The product reviews classification method which basing on the semantic features is proposed. The automatic classification of product reviews can provide a better research material to reduce the complexity of the algorithm for reviews mining, thus to improve the mining efficiency. In this paper, the classification of the product reviews is processed from the angle of short text. First, the product reviews obtained from the web are manual labeled to get the training set. Then the forefront of product reviewsχ2 statistics and semantic contents (product features, opinion words, degree words) are extracted as classification features, and the quantity of the semantic information, the semantic contents those are not selected and the length of the text are also added as classification features. Then the binary classification of support vector machine (SVM) method is used to learn the extracted classification features to obtain the classifier. Finally, the constantly updated products reviews online are classified, and the good reviews are extracted to establish reviews corpora. Experiments show that the classification results of product reviews improve obviously with the adding of semantic content. The precision improved 9 percent and attains to 80 percent. The classification effect is very good for product reviews those belong to short text.A Semi-Supervised Learning method is adopted in product reviews mining, and the mining of features and the mining of opinion words are combined in a unified process to get feature-opinion pairs. As there are corresponding modifying relations between the features and the opinions, the features such as the product component, function and performance and the opinion words which expressed the customer emotions are extracted together with the semi-supervised learning method in this thesis, hence retain the corresponding relations between the customer opinion words and the product features. A Semi-Supervised Learning method can be used not only to obtain expert knowledge from the labeled corpus, but also to enhance the performance of learning algorithm generalization ability from the un-labeled data. Therefore, a hand of defined feature-opinion pairs are as seeds, while the words, the part of speech and the modified relations are taking as a pattern feature set to mine the product features and evaluation in which the customers are really interested. Then the evaluations with multi-features but single-opinion are processed with the obtained product features and opinion words, Experimental results show that both the precision and the recall rate improved 2 percent after such processing. Although the precision is not high when features and opinion words are mined in a unify process, the high recall can help the semi-supervised learning algorithm to mine new information.The sequences of opinions are optimized with Maximize Harmonic-Mean (MHM) to improve the mining performance. For the accuracy of a semi-supervised learning method will decrease sharply with the iteration, and the Harmonic-Mean is easily influenced by extremum, especially the minimum, the sequences with big standard deviation are adjusted with MHM to delete the low-frequency elements in the sequences, hence ensure the recall and improve the accuracy. Experimental results show that precision is at 77.3 percent. When it improves 17 percent, the recall rate reduces only 5 percent.The extraction of the hierarchical relationships of features is proposed. The hierarchical relationships of the specification features are extracted from the product specification files with the structured data mining method, and Bootstrapping method is used to extract the hierarchical relationships of the describing features from editor evaluations. After identifying the features and the corresponding opinion words, the existing reviews mining system didn’t further process the features in different expressions and the features with subordinate relationship, so the same features in different phrases may be shown as different features, and the features with subordinate relationship may be shown as parallel features. In this thesis, structured data mining method is used in the mining of manufacturer product specifications to get the specification features and their hierarchical relationships, then a semi-supervised learning method is used in the mining of the editor evaluation on the web site to get the describing features and their hierarchical relationships. Then the similarity between the specification features and the describing features that extracted from a paragraph is compared to get their hierarchical relationships.Finally, the extracted feature-opinion pairs are connected with the hierarchical relationships among the features. Then the same feature in different expressions is merged, and the features with subordinate relationship are put together. Finally, the opinions of every feature are counted, and the product features in different levels are shown from top to bottom in a tree form.

  • 【网络出版投稿人】 重庆大学
  • 【网络出版年期】2009年 12期
  • 【分类号】TP311.13
  • 【被引频次】23
  • 【下载频次】1521
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络