

Study on Domain Adaptation for Sentiment Classification

【作者】 杨文让

【导师】 李培峰;

【作者基本信息】 苏州大学 , 计算机应用技术, 2011, 硕士

【摘要】 随着Internet的迅速发展与普及,网络上出现了越来越多的主观性言论。对于这些主观性文本的分析和挖掘,传统的基于主题的文本分类方法已经无法满足需求。因此,人们开始关注并研究这些主观性文本的情感分类。情感分类是一个领域相关问题,在一个领域训练的分类模型通常很难应用于另一个领域。如果针对每个领域都训练一个分类模型需要大量的标注数据。而标注数据的获得,需要耗费大量的时间和精力,代价非常高昂。因此,领域适应情感分类的研究具有很重要的应用价值。针对领域适应性情感分类,本文的主要研究和贡献如下:(1)针对不同领域特征统计分布的差异,提出了一种新的结合特征相似度计算的领域间特征选择方法,通过这种方法可以选择出在两个领域中具有相似统计分布的情感特征,从而提高了分类效果。(2)提出了基于质心迁移的领域间情感分类方法,该方法利用源领域的标注文本对目标领域的大量未标注文本进行分类,选择一部分可信度高的文本加入到训练集,同时去除源领域中距离目标领域测试集质心较远的文本,通过迭代逐渐缩小两个领域间的质心距离,减小领域间差异。实验表明,该方法能够显著提高分类的效果。(3)由于同一领域内文本可能具有不同的特征,而不同领域的文本也可能具有一定相似的特征,本文提出将两个领域的文本进行聚类,针对每个小类中的测试文本分别进行分类的方法。这种方法同样能够减少领域间的差异,提高分类的效果。

【Abstract】 With the rapid development and popularization of Internet, there are more and more subjective remarks available in Internet. With respect to these subjective remarks and identifying their semantic orientation, the methods of traditional topic-based text classification becomes incapable of meeting people’s needs.Therefore, sentiment classification has been paid more and more attention by various researchers.Sentiment classification is a very domain-specific problem; classifiers trained in one domain usually perform poorly in some others. If, in every domain, a classification model is trained, it would need a lot of annotated corpus. Since labeling data is very time-consuming and expensive, domain adaptation approaches for sentiment classification becomes valuable to handle the cross-domain classification problems.In this study, we focus on the domain adaptation for sentiment classification. Our main work and contributions include:(1)In order to eliminate feature’s statistical distribution’s difference between domains, we propose a novel feature selection approach which unions feature’s similarity. By this way, we can choose sentiment features which have similar statistical distribution in two domains, which can improve the classification performance.(2)We propose a novel domain adaptation approach for sentiment classification under centroid-transfer. The approach makes full use of labeled documents in the source domain to label target’s documents and choose a part of confident documents to join the training set, simultaneously remove some of the source domain’s documents which are far form the test’s centroid, by iteration between the two domains gradually narrow the centroid distance, reducing the differences between domains. The experiment results indicate that the proposed approach could significantly improve the performance of cross-domain sentiment analysis.(3) Based on the finding that the same domain’s documents may have different features in different domains, and the document may also have certain similar features, we propose a new approach to do classification. Specifically, two domains of documents are first clustered and then classification is performed in each clusting. This approach can reduce the differences between the domains and thus improve the classification results.

  • 【网络出版投稿人】 苏州大学
  • 【网络出版年期】2012年 06期