节点文献
面向事件的知识处理研究
Research on Event-oriented Knowledge Processing
【作者】 付剑锋;
【导师】 刘宗田;
【作者基本信息】 上海大学 , 计算机应用技术, 2010, 博士
【摘要】 以“事件”作为知识表示的基本单元和信息组织的重要手段,已经受到越来越多的重视。研究面向事件的知识,可以为自动文摘和问题回答系统等信息处理技术提供服务。本文主要从面向事件的中文语料库构建、事件识别、事件要素识别以及事件因果关系抽取等四个方面进行了深入的研究,并针对以往研究中存在的不足,提出了一些切实可行的解决办法,具体包括:1.语料库建设是自然语言处理技术中的基础性的研究工作,由于研究的目的和研究的对象不相同,现有面向事件的语料库分别采用了不同的标注体系。这些标注体系主要关注某些特定类型的事件或事件要素,但是却忽略了一般意义上的事件以及人们对于事件的理解和认知。本文以调查问卷为基础,了解和分析了人们对于通常意义上的文本中的“事件”概念的理解,研究了中文事件的可标注性,提出了一种中文事件语料库的制作方法。该方法并不局限于标注某几类事件,而是针对文本中所有提及的事件。而且,该方法是建立在中文句法分析和语义分析基础之上的,符合中文的特点。评测实验表明,采用该方法标注得到的语料可以取得较高的一致性。我们还开发了一个标注辅助工具,收集了200篇突发事件领域的新闻报道作为生语料并对其进行了标注,制作了一个中文事件语料库(Chinese Event Corpus, CEC)。整个语料库的加工制作过程历时10个月,先后有近十人参与。与ACE和TimeBank语料库相比,CEC语料库的规模虽然偏小,但是对事件和事件要素的标注却最为全面。2.事件识别是事件抽取任务的基础,目前的事件识别大多采用了机器学习的方法,这种方法需要发掘有效的特征以提高识别效果。本文提出了一种基于多种特征融合的事件识别方法,在构造特征向量时,加入了上下文特征、词性特征、句法特征以及语义特征等等。在两种不同的分类器上对这些特征的区分能力分别进行了实验和分析,实验表明,随着有效特征的加入,事件识别的效果明显提高,而将多种特征融合在一起时,事件识别的效果最好。与基于tf×idf的事件识别方法相比,本文方法可以取得更好的识别效果。3.采用监督(分类)学习的方法识别事件要素,需要大规模人工标注的熟语料库作为训练集以获取事件要素的相关知识,对语料库的依赖性比较强,常常会因为语料稀疏的问题导致效果不理想。本文提出了一种基于半监督聚类和特征加权的事件要素识别方法,以减少对于语料的依赖。该方法利用少量的标记数据作为Seed集指导聚类,并且在聚类分析中根据不同特征的贡献分别赋予相应的权值。此外,本文还对传统的半监督聚类算法(Constrained-KMeans)和特征加权算法(ReliefF)进行了改进,使之适用于事件要素识别任务。实验表明,该方法在带标记语料较少的情况下具有一定的优势,可以取得相对较好的识别效果。4.事件因果关系是非常重要的一类语义关系,从文本中抽取事件因果关系具有广阔的应用前景。传统的事件因果关系抽取方法只能抽取显式带标记的、句内的一因一果关系。实际上,文本中除了包含上述因果关系之外,还包含了大量的无标记因果关系、跨句/跨段因果关系以及一因多果、多因一果和多因多果等。针对这种不足,本文提出了一种基于层叠条件随机场的事件因果关系抽取方法,该方法将事件因果关系的抽取问题转化为对事件序列的标注问题,采用层叠(两层)条件随机场标注出事件之间的因果关系。第一层条件随机场模型用于标注事件在因果关系中的语义角色,标注结果传递给第二层条件随机场模型用于识别因果关系的边界。语料分析和实验表明,本文方法不仅可以有效覆盖文本中的各种因果关系(包括:带标记/无标记因果关系、句内/跨句/跨段因果关系以及一因一果、一因多果、多因一果和多因多果等),并且均能取得较好的抽取效果。
【Abstract】 Taking“Event”as a basic unit of knowledge representation and an important means for information organization has received increasing attention. The study of event-oriented knowledge can provide services for information processing technologies, such as Automatic Summarization and Question Answering System. This paper focuses on the following four aspects: the construction of event-oriented Chinese corpus, event recognition, event argument recognition, and event causal relation extraction. For the shortcomings of these studies, some practical solutions are presented, which include:1. Corpus construction is a fundamental task of natural language processing technology. For different studying purposes and objects, different annotation systems are employed in the existing event-oriented corpora. These annotation systems mainly focus on certain types of events or event arguments, but ignore the general events and people’s understanding and awareness for event. In this paper, a questionnaire based on event is designed, the common sense of event in text is analyzed from the questionnaire, the taggability of Chinese event is explored, and a method for building Chinese event corpus is presented. This method is not limited to certain types of events; all the events which mentioned in text are involved in it. In addition, the method is suitable for Chinese because it is based on syntactic analysis and semantic analysis of Chinese sentence. Evaluation results show that this method obtains a high annotation agreement. Further more, we have developed an annotation tool, collected 200 reported articles about emergencies as raw corpus and annotated it to build a Chinese event corpus (CEC). Nearly ten research members have taken part in the annotation job for 10 months. Comparing with the ACE and the TimeBank corpus, the CEC corpus is the smallest, but the annotated events and event arguments are the most comprehensive. 2. Event recognition is the basis for the event extraction task. Most of the current approaches for event recognition employ machine learning methods, which need to explore effective features to improve the systems performance. This paper presents an event recognition method based on multi-features combination. While construct a feature vector, the context features, part of speech features, grammatical features and semantic features are all combined in it. The experiments with two different classifiers and analysis for the distinguishability of these features are carried out. The experimental results show that the performance improved obviously with the addition of effective features, and the system achieves the best performance while combining multi-features. Comparing with tf×idf based event recognition method, our method obtains better performance.3. The approach of event argument recognition based on supervised (classification) learning needs large-scale annotation corpus as training set to obtain the knowledge of event argument. This approach highly relies on the corpus, and it would get a poor system performance if the corpus is sparse. This paper presents a method for event argument recognition based on semi-supervised clustering and feature weighting, which can reduce the dependence on the corpus. In this method, a few labeled data is taken as seed set to guide the clustering analysis. Different weights are assigned to different features according to their importance of contribution on clustering. In addition, the traditional semi-supervised clustering algorithm (Constrained-KMeans) and feature weighting algorithm (ReliefF) are improved to apply to the task of event argument identification. Experimental results show that our method achieves good performance while the labeled data is insufficient.4. Event causal relation is an important semantic relation. Event causal relation extraction has a broad prospect of application. Traditional methods for event causal relation extraction are limited to marked、inner-sentence and“one cause, one effect”relation. In fact, there are also a large number of unmarked, outer-sentence/outer-paragraph,“one cause, many effects”,“many causes, one effect”and“many causes, many effects”causal relations in text. This paper presents a method for event causal relation extraction based on cascaded Conditional Random Fields (CRFs). The method casts the problem of event causal relation extraction as event sequence labeling and employs dual-layer CRFs model to label the causal relation of event sequence. The first layer of the CRFs model is used to label the semantic role of causal relation of the events, and then the outputs of the first layer are passed to the second layer for labeling the boundaries of the event causal relation. The corpus analysis and experimental results show that our method not only covers each class of event causal relation (including: marked/unmarked, inner-sentence/outer-sentence/outer-paragraph,“one cause, one effect”,“one cause, many effects”,“many causes, one effect”,“many causes, many effects”) in text, but also achieves good performance.
【Key words】 Event; Chinese Event Corpus; Event Recognition; Event Argument Recognition; Causal Relation Extraction;