

Breaking Events’ Information Extraction Based on Event Frame

【作者】 冯礼

【导师】 盛焕烨;

【作者基本信息】 上海交通大学 , 计算机应用技术, 2008, 硕士

【摘要】 在目前信息爆炸的时代,基于事件框架的新闻信息抽取技术能够更好地满足人们获知网上有效信息的需要。通过对新闻语料的分析,可以预定义三类突发事件的框架结构,由此可对事件各侧面采取定制的处理。利用对新闻报道的词性标注、对地点数据库的查询以及基于语料研究的一些抽取规则的制订,能有效地抽取新闻事件的时间、地点、结果等各侧面信息。由于新闻事件的复杂及动态发展的特点,基于事件框架信息抽取中存在一个问题:静态结构的框架限定了能抽取的侧面内容。为此,本文引入事件新侧面探测方法,采用自动探测方法寻找框架中未预定义的侧面。为充分利用句子中词性、语序及词之间的关系,本文使用词对特征模型进行特征提取,选择基于段落的LSA聚类算法来实现新侧面探测。根据原型系统在突发事件语料库上的测试结果,本文提出的方法被证明是切实可行的,对于突发事件新闻要素的抽取达到了较高的正确率和召回率。事件新侧面探测的结果较好地表现了单个事件的特性和同类事件未包含在框架内的某些共性。实验结果证明了本研究的应用前景。

【Abstract】 In today’s information explosion age,the technology of events’information extraction,which is based on event frame, can better satisfy the need of getting valid information from Internet.By analyzing the news corpus,we predefine three kinds of breaking news’ event frame and thus deal with each news’ flank in customized methods.By the use of POS tagging on news article,querying in location database and defining rules based on corpus study,we can effectively extract news event’s flank information such as time,location and results.The complexity and the dynamic changing of news events cause such a problem: the static frame structure restricts extractable contents. In order to solve this problem in information extraction system, we propose a new technology called events’ new flank detection,which uses automatic detection to find out undefined flanks.To take fully advantage of the POS,word order and the relations between words in sentences,we use word pair feature model to extract features and select paragraph-oriented LSA clustering algorithm to implement new flank detection.According to the testing results on the prototype system on three kinds of breaking events corpus, it is proved that the methods in this thesis are feasible. The extraction of breaking news’ elements reaches high precise and recall rates. The results of event new flank detection show uniqueness of single event and several common points in events of same kind, which are not included in the event frame.The experiment results ensure the application foreground of this research.

