节点文献

模糊限制信息检测中融合方法的研究

The Research of Fusion Methods for Hedge Detection

【作者】 周惠巍

【导师】 杨元生; 黄德根;

【作者基本信息】 大连理工大学 , 计算机软件与理论, 2012, 博士

【摘要】 作为生物信息抽取的一个重要环节,生物医学领域的模糊限制信息检测旨在区分生物医学文献中的模糊限制信息与事实信息,避免将模糊限制信息作为事实信息用于信息抽取。近年来,随着大规模模糊限制信息语料库的构建,虽然模糊限制信息检测研究已经取得了一定的进展,但是模糊限制信息范围检测性能尚未达到60%,距离实用化还有一段距离。这是由于模糊限制信息范围检测任务比较复杂,具有依赖于语义和句法结构的特点,单纯基于一个统计模型难以满足模糊限制信息范围检测这个复杂任务的处理需求。融合方法可以将自然语言处理任务中的多类特征、多种方法、多个模型有效结合起来,避免了单一模型的片面性,实现准确、健壮的自然语言处理。本文针对模糊限制信息检测中的融合方法展开研究,内容主要包括:(1)研究基于复合核函数融合结构化特征与平面特征的模糊限制信息范围检测。重点研究了基于短语的模糊限制信息范围的结构化表达形式,利用卷积树核函数捕获模糊限制信息范围的结构化信息,减小结构化信息平面化时所引起的信息丢失。然后将基于结构化特征的卷积树核函数与基于平面特征的多项式核函数通过复合核函数集成起来。得到的复合核函数取得了比单独使用两种核函数都好的检测性能。(2)研究统计方法和规则方法相结合的模糊限制信息范围检测。通过统计方法和规则方法的结合,融合基于短语结构和基于依存结构的模糊限制信息范围检测系统。首先分别利用短语结构建立基于支持向量机(Support Vector Machine,SVM)的模糊限制信息范围检测子系统,利用依存结构构建基于规则的模糊限制信息范围检测子系统。然后将两个子系统的检测结果作为两个独立的特征,引入条件随机域(Conditional Random Field, CRF)模型进行融合。这种融合方法有效地利用了短语结构和依存结构,实现了统计方法和规则方法的结合,以及SVM机器学习方法和CRF机器学习方法的结合。统计和规则相结合的模糊限制信息范围检测方法取得了比单独使用两种方法都好的检测结果。(3)研究多分类器相融合的模糊限制信息范围检测。提出一种基于投票策略的模糊限制信息范围检测方法,首先分别基于SVM、CRF、最大间隔马尔可夫网络(Max-Margin Markov Networks,M3N)、以及本文的统计和规则结合的方法,以前向和后向两个解析方向构建八个基本分类器,再分别采用多数投票、分类器加权投票和词性加权投票三种投票策略融合八个基本分类器的结果。基于投票策略的模糊限制范围检测系统都取得了稳定的且比其中最优分类器更好的分类性能。本文的主要成果在于对模糊限制信息检测中的融合策略进行了深入研究,探索了模糊限制信息检测任务中平面特征与结构特征的融合、基于统计方法与基于规则方法的融合、多分类器的融合。提出了基于复合核函数的模糊限制信息范围检测方法,实现了模糊限制信息检测中结构化特征与平面特征的融合;提出了基于统计方法和规则方法的结合,有效利用短语结构和依存结构的模糊限制信息范围检测方法;提出了基于投票策略的多分类器模糊限制信息范围检测方法。这些研究有效地提高了生物医学领域模糊限制信息检测性能,对今后自然语言处理中融合策略的研究提供了有益的借鉴。

【Abstract】 To distinguish factual and uncertain information in biological texts, hedged information detection is extremely important for biomedical information extraction, which avoids extracting speculative information as factual information.As large-scale tagged Bioscope corpus has become available these days, studies in detecting hedge scope have been developed.However, the performance for hedge scope detection is still less than60%.There is a considerable gap between academic researches and practical applications.Hedge scope detection is rather complicated as it falls within the scope of semantic analysis of sentences exploiting syntactic patterns. For complicated hedge scope detection task, there exists no reliable and simple way to achieve a satisfactory performance.Every kinds of feature, every method, every model has its advantages and limitations, and they are complement for each other. So how to combine the advantages of various kinds of features, methods, models, and avoid one-sidedness of a single model to develop high-accurate fusion hedge detection systems, become an important theme of natural language processing.This paper focuses on the fusion methods for hedge detection. The main works are listed as follows:1.The approach to hedge scope detection using a composite kernel which combines structured and flat features.Four phrase-based structured features over a parse tree are explored for hedge scope learning to capture the critical syntactic structure by the convolution tree kernel.The convolution tree kernel that exploits the syntactic structured features and the polynomial kernel that exploits the flat features are combined into a composite kernel.The composite kernel outperforms either of the two individual kernels.2.The hybrid approach based on rules and statistics to hedge scope detection, which can also combine phrase structures and dependency structures.First, phrase structures and dependency structures are used for hedge scope detection respectively.Phrase structures are adapted as important features for hedge scope detection by a Support Vector Machine (SVM)-based model.Dependency structures are used to detect hedge scope by a rule-based method. Then, the phrase-based system and the dependency-based system are combined by a Conditional Random Field (CRF)-based model, which simply extends the feature vectors with the scope tags generated by the two individual phrase-based and dependency-based systems. The combination of rule-based and statistics-based approaches,the combination of phrase structures and dependency structures,and the combination of SVM and CRF in our fusion system are all factors for effective scope detection. Experimental results show that phrase structures and dependency structures are both effective for hedge scope detection and their combination can improve the scope detection performance further.3.The voting technique for detecting hedge scope.First we construct eight classifiers based on CRF,SVM, Max-Margin Markov Network (M3N) and our rule-based and statistics-based combination approach, time two directions (forward and backward).Then three different voting schemes:(1)majority voting;(2) weighted voting by the accuracy of the component classifier;(3)POS weighted voting by the accuracy of the component classifier on all tokens which have the same POS,are adapted to voting-based hedge scope detection. The experimental results show that voting may result in improvement over their component classifiers by combining their individual advantages.This paper explores the fusion methods to hedge detection, including the combination of structured and flat features,the hybrid approach based on rule-based and statistics-based approaches, the method of multiple classifier fusion.The major contributions of this paper lie on the proposal of a phrase-based approach to hedge scope detection using a composite kernel which combines structured and flat features;the proposal of the hybrid approach based on rules and statistics to hedge scope detection, which can also combine phrase structures and dependency structures;the proposal of the voting scheme to detect hedge scope which combines many individual classifiers to exploit the unique advantage of each classifier. This work improves the hedge detection performance significantly, and exhibits reference value to the future research in fusion methods to natural language processing.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络