节点文献

基于主动学习的语料自动标注方法研究

Research on Active Learning Based Automatic Corpus Annotation

【作者】 宋鸿彦

【导师】 姚天昉;

【作者基本信息】 上海交通大学 , 计算机应用技术, 2010, 硕士

【摘要】 意见挖掘是指针对主观性文本自动获取有用的意见信息和知识。汉语意见挖掘技术的研究需要汉语意见型主观性文本标注语料库的支持。由于汉语意见型主观性文本标注语料库包含了分词、词性、依存关系、语义、词概念、意见等大量信息,最后完成的标注通常比较复杂。为了减轻标注人员的负担,提高标注的效率和精确度,减少标注的错误率,有必要开发一款自动标注工具协助标注人员的工作。本文实现了一个基于主动学习的汉语意见元素标注工具,可以自动识别句子中的主题、情感和意见持有者等意见元素。主动学习算法具有需要训练样例较少,受不平衡训练样例干扰较小,分类性能较好等特点。本文经过实验,证明了主动学习算法应用于意见元素识别的有效性,并提出了一个公式,综合主动学习分类器F值、训练时间、训练样例数量三个方面,对系统的总体性能进行衡量。

【Abstract】 Opinion Mining aims to automatically acquire useful opinioned information and knowledge in subjective texts. Research of Chinese Opinioned Mining requires the support of the annotated corpus for Chinese opinioned-subjective texts.Since the annotated corpus for Chinese opinioned-subjective texts includes much information including word segmentation, part-of-speech tag, dependency relationship, word meaning, and opinion, the finished annotations are usually very complicate. To relieve the burdens of annotators, increase the efficiency and accuracy of annotation, and reduce the possibility of false annotation, it is necessary to develop an automatic annotation tool to facilitate annotators’work.This paper implements an active learning based annotation tool for Chinese opinioned elements. It can identify topic, sentiment, and opinion holder in a sentence automatically. Active learning algorithm is featured with smaller training set size, less influence from unbalanced training data and better classification performance comparing to classical learning algorithm. This paper experimentally demonstrated the validity of active learning algorithm when used for opinioned elements identification and proposed a formula for overall system performance evaluation which consists of F-measure, training time, and training instance number.

节点文献中: