

Research and Analysis on Semantic Orientation of Forum Replies

【作者】 陆彬

【导师】 郭捷; 刘功申;

【作者基本信息】 上海交通大学 , 计算机应用技术, 2011, 硕士

【摘要】 随着互联网的快速发展,网络论坛已经成为了网络时代的重要组成部分。在论坛中,主题帖固然重要,然而多数人都是通过对所关心的主题帖进行回帖来表达自身的观点,因此论坛中的回帖往往更能反映出社会的舆论倾向。要对网络论坛中的回帖进行准确的情感倾向性分析,就必须要把握论坛中的特点,本文首先分析了论坛回帖中的特点,如楼层的层次关系、论坛回帖的语言特点等。本文以论坛回帖为研究对象,提出了一种结合论坛回帖的特点的基于论坛楼层结构的倾向性分析系统,该系统首先提取所需分析的论坛页面的源代码并进行预处理,得出论坛回帖的楼层层次结构形态及各楼层文本内容。接着对各楼层回帖中无意义帖子进行判断,对于长帖子,还要判断其是否与主题帖相关,然后通过机器学习的方法进行分类。对于短帖子,则进行分词以及语法分析操作,结合预先根据论坛回帖语言特点整理得到的情感词词库以及其他常用词库,对短帖进行倾向性分析。最后,根据单个回帖的倾向性以及之前建立的楼层层次结构,得出并统计出主题帖下所有回帖的情感倾向性。实验表明,新系统的判别准确率在80%左右,具有良好的应用前景。

【Abstract】 With the quick developing of internet, network forum become very major part in the information age. In forum, the main post is important, but most people express their opinion by replying the main post which they concern. The forum replies reflect more emotion orientation to social events.To accurately analyze semantic orientation of forum replies, it is necessary to grasp the features of forum. This paper analyzes the features of the forum replies first, such as floor structure, feature of forum replies language and so on.This paper presents a new system for predicting semantic orientation of forum replies based on forum floor structure and features of forum language. Firstly, this system extracts the required source code of the forum pages. From analyzing the html code of forum pages, the system creates a forum floor structure and saves the dividing text by sequence of forum floor.Next the system will analyze if forum replies are the meaningless. It’s also necessary for long replies to analyze if they are post-related, then classify them by method of machine learning. For short replies, we do word dividing, grammatical analysis work and analyze semantic orientation combined with some word libraries.Finally, we get the semantic orientation of all the replies under the post based on individual reply’s orientation and the forum floor structure which created before. Experiment results have proved the effectiveness of the system.
