

Technique Research of Web Chinese Event Automatic Detection

【作者】 刘嵩

【导师】 李弼程;

【作者基本信息】 解放军信息工程大学 , 信号与信息处理, 2010, 硕士

【摘要】 在现代通信技术及互联网技术高速发展的今天,如何以事件为线索,对构成事件的元素进行分析,抽取事件并对其进行精确描述,从海量互联网数据中快速准确地搜集到感兴趣的信息,已成为当前智能信息处理方向的研究热点。本文研究网络中文事件自动检测技术,主要包括:中文事件自动标注、时间信息提取技术、事件自动抽取技术及基于事件抽取的话题自动检测技术,主要取得如下三个方面的研究成果:(1)对中文事件抽取中的时间信息进行详细研究,提出一种基于自定义规则的时间信息提取方法。该方法针对传统时间信息提取目标单一的缺点,对文本中所涉及的时间信息进行详细分类,明确时间提取范围。然后根据文本中出现时间的规律,利用正则表达式,对不同时间制定不同的提取规则,实现自定义规则的时间信息提取。实验结果表明,新方法在时间提取的准确率和召回率上优于传统方法,是一种有效的时间信息提取方法。(2)研究了中文事件抽取,针对传统方法对事件类别限定的局限性,提出了一种基于触发词指导的自相似度聚类事件抽取方法。该方法改变了传统方法以词为实例进行分类的做法,在事件类别判断上引入聚类思想,将K-means算法应用于事件抽取。同时,在事件触发词的指导下,采用自相似度最大最小策略,对K-means算法中的K值进行自收敛,优化了聚类算法,完成了事件的类别判断。最后,根据文本中命名实体及其位置信息,对事件元素进行详细描述,解决了事件抽取方法对类别模板的依赖性,实现了中文事件抽取。实验结果表明,新方法无论是事件抽取的准确率还是召回率,均优于传统方法,为中文事件抽取提供了新的思路。(3)研究了事件抽取在话题检测中的应用,改变了传统话题检测方法中根据向量夹角余弦进行文本相似度计算的做法,提出一种基于概念相似度计算的话题检测方法。该方法首先对待检测样本及话题集合进行分析,对其中的事件元素及其描述信息进行抽取,并构造文本向量空间模型。然后利用知网知识计算其概念相似度、词相似度及文本单元相似度,完成概念相似度计算。最后,通过相似度比较,实现基于概念相似度计算的话题自动检测。实验结果表明,与传统话题检测方法相比较,新方法所检测话题明确,话题的漏检率及误检率低,是一种有效的话题自动检测方法。

【Abstract】 With the high speed development of communication and internet technologies, internet public information collection based on event has become one of important researching areas in intelligent information processing. It is an exigent problem for researchers to solve how to detect and describe event, and collect interested information based on event in numerous web data quickly and exactly. This paper mainly discusses the technique of web Chinese event automatic detection, which involves automatic Chinese event annotation, time information extraction, automatic event extraction and web topic detection based on event. The major contributions of this paper are listed as follows:(1) A method for time information extraction based on user-defined rules is presented. Aiming at disadvantage of single target of traditional time extraction method, time expressions of text is classified exactly, and time range is defined. Then, different rules for time expressions are constituted, and user defined time information extraction is achieved. Experiment results show that the precision and recall of the new method are superior to those of traditional methods.(2) A self-similarity clustering event extraction method based on triggers guidance is proposed. Firstly, the idea of traditional event classifying method based on feature word is changed, and clustering idea is adopted to classify event catalog where K-means clustering algorithm is applied. Secondly, based on triggers guidance, min-max clustering strategy is adapted to self-constrict K in K-means clustering algorithm, which optimizes clustering algorithm, and event classification is completed. Thirdly, based on Named Entities and their location information in text, event arguments are described, dependency of event catalog model is solved, and Chinese event extraction is completed. Experiment results show that the new method outperforms traditional event extraction methods in precision and recall, and provides a new thought for Chinese event extraction.(3) A method of automatic topic detection is put forward based on document concept similarity in stead of feature word similarity on vector space model in traditional topic detection methods. Firstly, sample and topic set is analyzed, event arguments are extracted, and document vector space model is constructed. Secondly, concept similarity, words similarity and text similarity is calculated based on HowNet. Finally, topic detection is realized based on document concept similarity. Experiment results show that the new method is more efficient than traditional methods in precision and recall of topic detection.


