节点文献

航空领域术语定义抽取关键技术及其应用研究

Research on Definition Extraction in Aviation Domain and its Application

【作者】 潘湑

【导师】 顾宏斌;

【作者基本信息】 南京航空航天大学 , 载运工具运用工程, 2011, 博士

【摘要】 CBT(Computer Based Training)系统作为先进培训技术的重要组成部分,在民航业的飞行员培训和机务培训中具有重要作用。飞行CBT在国内外航空公司已有大量的应用,而部署机务CBT系统也是是国内二级维修单位的必备条件。本文的工作围绕CBT系统开发过程中,利用术语定义抽取技术从专业文献中获取专业相关知识所需的关键技术展开,并探索了将定义知识应用于智能CBT系统中的方法。本文的主要研究内容如下:(1)建设术语定义抽取实验用语料库。语料库是所有自然语言处理研究必须要解决的问题,但是目前国内外并没有现成的专供航空领域中文术语定义抽取研究的语料库,所以本文的第一项工作就是建设一个实验用语料库。根据实验要求,确立了第一阶段语料库的建设规模,并建立了本文语料库的开发规范并开发了相应的配套软件;还对语料库的各种信息进行了详细统计,以此作为本文后续研究的基础。(2)确定进行术语定义抽取的基本方法。由于研究目的不同,以往用于解决自动问答和搜索引擎排序问题的方法在本文中并不适用。针对术语定义在语料中分布极不平衡的情况,提出以平衡随机森林方法来解决术定义抽取问题;针对构建平衡训练集时随机产生合成样本的方法无法有效巩固是少数类密集分布区域边界的问题,提出了采用基于实例距离分布信息定义的重采样策略,相比随机重采样方法,提高了定义抽取的F1-measure和F2-measure。(3)改进术语定义抽取的特征选择方法。针对术语定义抽取语料中,数据分布不平衡以及定义句内部存在小析取项这两个问题,从特征选择角度提出基于类间分布差异和类内分布差异的特征选择方法。该方法改进了传统特征选择函数依赖词频统计结果主要衡量特征的类间分布差异的缺点。实验证明在应用于平衡随机森林方法时可以以更少的特征达到与传统filter方法同样的F1-measure和F2-measure。(4)利用多层次语言学特征进行定义抽取。本文对在信息抽取不同子课题中使用多层次语言学特征的情况进行了总结,针对定义抽取领域中由于缺乏可定量计算的方法,导致无法在进行定义抽取时充分利用语言学特征的问题,以信息熵为基础提出使用不同层次间的特征组合的组合熵来计算不同层次的特征组合对定义抽取的影响,并结合前文的特征选择框架用于多层次特征的筛选。该方法为研究不同层次的语言学特征在定义抽取中的作用和利用这些特征进行定义抽取提供了一种可计算的方法。实验证明了该方法的正确性和有效性。(5)设计并实现了CBT智能考核系统。针对现有AIG(Automatic Item Generation)技术不利于生成专业领域的试题而且干扰项的迷惑性也较弱的问题。本文以加工定义知识得到的多种知识表达为基础,设计了利用句型模板库和知识点库生成考核试题的题面,从领域本体生成干扰项的自动试题生成和评价系统。该方法可以有效满足CBT系统中对于专业知识的自动考核和评价的需求,同时能够大幅减轻开发题库和组卷所需的工作量。

【Abstract】 CBT(Computer Based Training) system plays an important role in pilot training andmaintenance training in civil aviation as a part of advanced training technology.Productions ofCBT have been widely used in airline from home and abroad, and deployment of maintenanceCBT system is a prerequisite for intermediate maintance units. The work in this paper startedaround critical technologies in obtaining professional knowledge from professional literaturesusing term definition extraction techniques. In this paper, we also explore the approach ofapplication of knowledge extracted from professional literatures in intelligence CBT systemdevelopment.The contributions of this dissertation are mainly summarized asfollows:Firstly, Corpus is basic resource of all natural language processing research, but noready-made available for the study of term definition extraction at home and abroad. So theprimary task of this paper is to construct a corpus for experiments. According to the experimentalrequirements, this paper establishes construction scale and standard of corpus of first stage, anddevelops corresponding software. This paper also carries out detailed statistical information on thecorpus as the basis for further study.Secondly, the basic method of definition extraction is unbalanced data classification.Because of different research purpose, solutions for getting definitions for question answer orranking as search engine do not apply in this paper. In view of imbalance distribution of termdefinitions in corpus, a method based on balanced random forests is proposed to extract definitionsfrom corpus. A novel over-sampling strategy based on distance distribution information ofinstances is proposed to solve the problem that randomly synthetic instances cannot effectivelyconsolidate regional border of minority class instances in building a balanced training set.Experiments show that it improves the results of F1-measure and F2-measure in extractingdefinitions.Thirdly, improving feature selection method in definition extraction using distancedistribution information of instances. Inorder to address the imbalance distribution of data andsmall disjuncts in definition sentences, the new feature selection method is defined based onbetween-class distribution difference and within-class distribution difference of features. The newmethod improves the shortcoming of traditional methods that evaluation methodology relies onword frequency statistics. Experiments show that the BRF classifier using new method achievesthe same results with fewer features in extracting definitions.Fourthly, extracting definitions using multi-level linguistic features. Situation of usingmulti-level linguistic features in different sub-topics of information extaction is summarized firstly. Because of lacking of quantitative method, multi-level linguistic features can not be used inextracting definitions. In this paper, a feature combinations entropy based method is proposed tocalculte impact of different combinations in extracting definitions. The method provides acomputable way to evaluate linguistic features of different level in extracting definitions.Experiments show the correctness and validity of this method.Finally, designing and implementing an inteligent assessment system for CBT. Existing AIGtechnology is not conductive to generate questions for professional field and distractors are lessconfusing. In this paper, a novel AIG system is designed to solve this problem. The systemgenerates items using a variety of knowledge and sentence templates, and generates distractorsusing domain ontology. These resources are achieved from extracted definitions. The new designmeets the demond of CBT system for automatic assessment and evaluation of professionalknowledge effectively, and eases workload of developing item bank and examination papers.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络