节点文献

基于文本分类技术的文本情感倾向性研究

Text Sentiment Analysis Based on Text Classification

【作者】 郭明

【导师】 昝红英;

【作者基本信息】 郑州大学 , 计算机系统结构, 2010, 硕士

【摘要】 文本情感倾向性研究在近些年成为众多学者关注的热点,其应用领域也在不断的拓宽。从社会舆论监督到产品口碑检测都离不开文本情感倾向性研究。本文在传统的文本分类技术基础上提出一种基于规则与统计方法相结合的情感分析模型。并将该模型在两种有代表性的语料中做了实验。语料一:领域背景复杂且分布极不平衡的新闻文本语料;语料二:领域背景单一的股票领域的专家的股评语料。(1)分析新闻文本的情感倾向性,为新闻文本自动播报提供情感信息。本文提出一种中心句确定方法,并在提取的中心句的基础上运用统计方法提取潜在规则来对人工构建的规则库做补充,使规则库相对完备提高情感分析的效果。实验中采用支持向量机、贝叶斯分类器和K近邻分类器作为分类器与规则结合,并且使用多种特征提取方法和特征权重计算方法来进行对比实验。由于新闻语料自身的分布的极不平衡性,导致单纯的统计的方法在稀有类上的表现比较差,而规则与统计相结合的方法虽然没有能够完全解决这一难题,但却在一定程度上改善了实验效果。实验效果表明规则与统计方法相结合的情感分析模型相比于单纯的统计模型在效果上有了较明显的提高,表明规则结合统计的方法具有很好的普适性。(2)本研究是建立在股票领域的垂直搜索应用上的。该应用需要对股评专家对某支股票的评论做看多、看平、看空、不确定进行分类。在这部分实验中因为所用语料短小、领域性非常强、口语化比较严重,通用的分词软件不能很好的进行分词。本文提出一种简便的定位特征词的方法,不仅满足了实验需求且时间效率非常高,时间复杂度为0(n)。由于领域单一容易提取较完备的规则,在这部分实验中规则的平均准确率均在90%以上,且均优于统计的方法。本文提出的规则结合统计方法的分类模型在背景复杂的新闻文本语料中取得了很好的效果,较单纯的统计方法分类效果有了明显的提高,有效地改善了稀有类的分类效果。但是在背景单一的股票领域语料上并没有多大的提高,说明规则的方法较适用于背景单一的语料。

【Abstract】 Study of Text emotional tendency becomes a focus, more and more scholars tends to work on it and its applications are constantly expanding. Word of mouth from community supervision by public opinion to the test product can not do without emotional bias of the text. In this paper, we proposed a the combining method based on traditional text categorization. Experiments have been done in two representative corpus. Corpus 1:news text corpus which background is complex and extremely uneven distribution; CorpusⅡ:Stock corpus which Background is single.(1) Analysis the emotional tendency of news text and provide emotional information for the news broadcast automatically to. In this experiment we presents a method to determine the main sentence, and extracted potential rules in the main sentences using statistical methods to supply the rule base which was build on the artificial in order improve the result of analysis of the effect of emotion. In this experiments we use support vector machines, Bayes classifier and the K nearest neighbor classifier as the classifier combined with the rules and use a variety of feature extraction methods and feature weighting method to do experiments and then compare their result. As the news text corpus extremely uneven distribution of its own, leading to the simple statistical method’s performance in the rare class of relatively poor, but the combination of rules and statistical methods were not able to completely solve the problem, but has improved the experimental results. Experimental results show that the combination of rules and statistical analysis model is better in many field than simple statistical model. It shows that Rules combined with statistical methods have good universal.(2) This study is based on vertical search applications in the field of stock. The application needs to analysts stock experts’s comments on certain stocks do call and think flat, bearish and uncertain classification. In this part of the experiment because of the Corpus is short, the field background is very strong, colloquial more serious, common segmentation software can not do this job well. This paper presents a simple method of positioning feature words, not only meet the test requirements and is more efficiency on time, the time complexity is O (n). As the field background is simple, we can extract rules easily and completely, the Accuracy in this part of the experiment reach 90% or more, and rule method’s Performance is better than statistical method’s.The combined classification model did well in the news text corpus which background is complex obtain good results than the simple statistical methods, it effectively improved the classification of rare class effect. However, on the single background stock corpus, the combined method have not much increase. It shows that the rules method is suit for the single background corpus.

  • 【网络出版投稿人】 郑州大学
  • 【网络出版年期】2011年 06期
节点文献中: