节点文献

基于农业本体问句分析的问答系统研究与架构设计

The Research of Question Analysis Based on Ontology and Architecture Design for Question Answering System in Agriculture

【作者】 胡德鹏

【导师】 王文生;

【作者基本信息】 中国农业科学院 , 信息技术与数字农业, 2013, 博士

【摘要】 近二十年以来,随着计算机技术、网络技术的在农业领域的飞速发展和广泛普及,农业领域信息技术的应用越来越受到社会关注,农业信息涉及用户覆盖面越来越广泛。农业信息化技术发展面临着新的挑战,特别是如何适应农业不同层次用户的需求,如何把农业技术通过信息技术快速、准确的传送到农业用户,成为农业信息服务建设领域面临的紧迫问题。问答系统是一个综合应用人工智能、信息检索、自然语言处理、信息抽取等技术的综合信息系统,它提供了一个简单的用户输入接口,对用户使用自然语言提出的问题,进行分析、处理,返回给用户一个简洁的答案。比较符合农业用户的需求。把问答系统应用于农业信息领域,通过对农业领域信息的检索、抽取、挖掘,可以解决农业领域技术涉及知识面广、系统结构复杂的问题,可以提高信息获取的精准度。本文结合问答系统的组成部分,分别对其中的若干关键问题进行了研究:1、本文首先对自然语言处理、信息检索、信息抽取、本体论等理论基础和发展现状给予介绍分析;结合学者们在问答系统方向的研究成果,给出问答系统的逻辑组成,然后按照系统组成分别对研究重点和难点给予分析。结合我国农业现状,分析了当前农业信息技术所面临的问题,提出了把问答系统应用到农业领域的可行性。2、对农业本体的构建进行了讨论,一是研究了本体中的基本概念,本体构建的规范和流程;二是重点研究了本体构建中概念、关系的抽取方法,为解决由农业叙词表转换农业本体中出现的本体关系稀疏的问题,为此本文提出了基于互信息的有监督本体关系抽取方法。3、对问句分析中的相关问题进行了研究,主要内容:一是引入了领域特征词的概念,用其来描述本体中的关系;二是提出了基于隐马尔可夫链的领域特征词识别抽取算法,由此实现对问句中蕴含的语义信息和领域中特征词的分析;三是研究了问题分类的方法,给出了基于本体的概念相似度计算方法,提出了基于问句特征词与问题分类特征词相似度的问题分类方法。4、研究基于本体的信息检索的方法,重点研究基于农业本体文档检索模型的构建方法,给出了问句与文档相关度计算方法,本文提出了构建基于领域本体的文档检索模型。5、答案抽取是问答系统的重要组成部分,本文提出了基于LAD的答案抽取方法,该方法主要由以下步骤:一是利用吉布斯(Gibbs)抽样进行推理,间接计算模型参数,获取词汇的概率分布,建立LDA主题模型;二是以Clarity度量块间相似性,并通过局部最小值识别片段边界,对文档进行段落分割;三是依据词汇的香农信息提取片段主题词,采取背景词汇聚类及主题词联想的方式将主题词进行扩充,形成段落主题词串;四是计算问句与段落主题词串的相似度,取相似度最高的段落为答案。6、研究面向农业领域的问答系统的架构设计,提出了基于云计算架构的农业问答系统的架构设计方法,系统架构中的存储系统使用开源分布式文件系统HDFS和非关系型数据库HBase;介绍分析HDFS和HBase的原理,描述了HDFS和HBase农业问答系统中的应用架构,结合上述问答系统的算法,提出了面向农业领域的问答系统逻辑构架。7、针对问答系统设计了实验方法,选择评价标准,主要进行了问句分析中领域特征词识别和问题分类实验,基于本体的信息检索实验和面向农业领域的答案抽取正确率的实验,每个实验都设计了数据模型,对实验结果给予分析,证明本文所提出方法的性能。

【Abstract】 In last two decades, the telecommunication network has spread into countryside, and somepeasants have surfed the Internet with personal computers, which have come up in China. How toaccommodate the special interests of users for agriculture information, and how to accurately propagatethe agriculture technology information, have become a challenge and critical problems for informationtechnology in agriculture.Question Answering System (QA) is a hierarchical, comprehensive system, whose researchbranches refer to Artificial Intelligence (AI), Information Retrieval (IR), Information Extract (IE), andNational Language Processing (NLP). The approach of applying QA to satisfy requirement of users inagriculture by retrieval, extract, and mining information form Internet is a feasible solution. This thesis’main research focused on the key problems of QA. The main works in this paper are as follows:1. At first, this paper introduced the foundation concepts about NLP, IR, IE, and ontology et al. andgave an outline of development process of NLP, IR, IE, and ontology et al. Then, on the basis theresearches of QA system, this paper analyzed the logical structure of QA based free text, which focusedon the research methods and the basic framework of QA. The development of agricultural informationtechnology with Chinese characteristics was briefly introduced, including the application of QA systemin agriculture.2. This part proposed a novel semi-supervised method for domain ontology relation learning. Thekey problem was how to enrich the relations between concepts. On the base of text information analysis,this paper proposed a method for extracting ontology relation with mutual information algorithm.3. The semantic analysis over a question is the key to catch the user’s requirement. In this thesis, inorder to descript the relationship between concepts, this paper proposed concept-feature for thepresentation of domain-specific concepts. A novel algorithm based on hidden Markov model forextracting concept-feature words was proposed, analyzed the key to the learning of the module structureand method of parameter estimation. In the processing, the algorithm makes full use of the formatinformation of list separators and special-labels to segment text, and gains extraction information ofspecial-fields, based on hidden Markov model.4. IR was one main part of QA. The researches of this thesis mainly focus on the informationretrieval model. The ontology-based information retrieval model was introduced, which based on thecomputing equivalent classes of individuals of ontology. ontology was generated using a kind of basicdescription logic, which was a suitable tradeoff between expressivity of knowledge and complexity ofreasoning problems.5. Answering extraction is the key problem of QA. This thesis proposed an answer extractionalgorithm based Latent Dirichlet Allocation (LDA). The main methods as follows:Firstly, the topic-word and document-topic distributions were inference by Gibbs algorithm, andthrough which built LDA model for text. Secondly, Text segmentations were built based on LDA models corpora and texts. Clarity is taken as a metric for similarity of blocks and segmentation pointsare identified by local minimum. Thirdly, the topic words of segments are extracted according toShannon information. Words which are not distinctly in the analyzed text can be included to express thetopics with the help of word clustering of background and topic words association. The significationbehind the words are attempted to be digged out. Last, the similarity between questions and paragraphsare calculated, and take the highest similarity paragraph for the answer.6. The architecture of QA system was described in detail, which was built on Hadoop and HBase.The principle and the application method of open source distributed file system-Hadoop, and theNon-Relational database-HBase were introduced in this thesis. The method develops QA system basedon Hadoop and HBase was proposed. The function of each part of the QA system was presented andintegrated performance analysis of QA system was given in this part.7. The experimental methods and data models for QA system were designed, which include theanalisis of evaluation criteria. At first, the results of experiments for extracting concept-feature wordsand question classification were analyzed. Then recall of ontology information retrieval experimentswere described and compared with the keywords method. Last, the accuracy rate of answer extractionbased on LDA model was analzed, which mainly for the agriculture-based question calssfication. Theexperimental results demonstrate the methods proposed in this paper could enhance the performance ofQuestion Answering system in agriculture.

节点文献中: