节点文献

基于林业科学数据的语义检索研究

Study on Semantic Retrieval Based on Forestry Scientific Data

【作者】 张乃静

【导师】 鞠洪波;

【作者基本信息】 中国林业科学研究院 , 森林经理学, 2013, 博士

【摘要】 随着技术的发展和观念的变更,Web已经成为人们获取信息的主要来源之一,承载的信息量以爆炸方式急剧增长,它在带给人们大量信息的同时,也使准确检索所需信息变得困难。给Web赋予语义信息,将Web作为基于知识的资源共享平台,让人们更加方便快捷地获取信息,是Web发展的必然趋势。科学数据共享工程是国家科技创新体系建设的重要内容,也是我国科技发展基础条件大平台的重要组成部分。林业科学数据共享工程作为其中之一,门户网站林业科学数据中心在十多年的建设和运行服务中不断地深化和拓展,影响范围不断扩大,数据量也不断增加。面对如此大量的林业科学数据,如何让使用者更加快速、便捷地查找到所需内容是平台不断探索和追求的目标。针对传统信息检索中存在的问题,本文尝试从语义的角度挖掘隐藏在数据背后的信息和规律,以期为用户提供更高质量的数据服务。语义信息检索是一种在传统信息检索方法的基础上与领域本体知识管理、数据挖掘和自然语言处理相结合的新技术。本文针对基于本体的语义信息检索进行了深入的研究,以林业科学数据本体为基础,提出了基于林业科学数据的语义信息检索模型,并从系统的角度对本体知识模型、文档的语义预处理、语义查询扩展以及语义检索等主要技术方法进行了分析和研究,主要内容和结论如下:(1)以本体的构建理论及技术为指导,构建了林业科学数据本体模型。详细阐述了本体模型中,概念集的选取、核心概念的主要关系和属性及属性之间的关系。为基于林业科学数据本体的语义信息检索提供了重要的基础。(2)对语义Web框架进行研究,描述和分析了林业科学数据本体知识模型的维护、存储、推理及查询方法。经过比较研究发现:本体的TDB持久化存储方案比关系数据库更为高效,实验中,前者存储本体的效率最多优于后者60倍;同样,使用Jena和Pellet推理相结合的方法对林业科学数据本体进行陈述三元组推理比单独使用其中一种的推理方法的效率高10%以上。(3)对文档进行语义预处理研究。经过对现有林业科学数据的分析,构建了领域词典,专业词汇达7万余条,提高了分词的精度;以向量空间表示词汇在文档中的特征权重,从林业科学数据本体中提取了特征概念集,并作为聚类中心,以余弦相似度作为距离函数,使用改进的k-均值模型对文档进行聚类,并对聚类文档的倒排索引方法进行分析。实验表明使用该聚类方法的聚类结果正确率为81.4%。(4)提出了一种语义查询扩展方法。将用户的查询请求分为单关键词、多关键词和疑问句3种情况进行分析处理。单关键词使用改进的语义相似度进行查询扩展;多关键词使用语义推理和语义相似度相结合的查询扩展方法;对于疑问句探索性的提出了基于句法分析和语义推理相结合的查询扩展方法。这些语义查询扩展方法是实现语义信息检索的核心内容。(5)在前文介绍的相关理论和研究的基础之上,利用语义Web框架设计开发了基于林业科学数据的语义信息检索系统,实现了信息的语义查询方式。并且通过实验分析,与传统基于关键词匹配的检索模型进行对比。结果表明,本文构建的语义检索方法无论在查全率还是在查准率上的表现都优于传统的检索方法。语义信息检索的研究不仅具有重要的理论价值,而且还有实际的应用价值。本文围绕林业科学数据中心现有的八大类数据,对林业科学数据的语义检索进行了深入的研究和探索。通过本体理论方面的研究,构建了林业科学数据本体,为实现林业领域知识模型的共享和复用提供了条件。同时探讨了利用本体实现林业科学数据语义检索的一般方法,在上述研究的基础上,结合网络计算技术设计开发了林业科学数据语义检索系统并进行评价,为海量林业科学数据在语义层次上的共享提供了理论基础和技术支撑。同时,语义检索系统的实现为林业科学数据共享提供了一个全新的思路,对其它数据共享平台的相关研究具有借鉴意义。

【Abstract】 As technology advances and idea is updated, Web has become one of the main sources ofinformation to be obtained today. As the amount of information on the web is dramaticallyincreasing, web brings people lots of information, but has led to difficulties for one toaccurately search for the information needed. It is a trend to endow the web semanticinformation and to use it as knowledge based resource sharing so as to make it easy for peopleto obtain relevant knowledge and information through the web.The scientific data sharing project is an important part of the construction of nationalscientific and technological innovation system as well as of the basic platform for nationalscientific and technological development. Sharing of forestry scientific data is one part of thisproject. As a portal site, Scientific Data Center of Forestry has been greatly developed andexpanded during the construction and service in the past more than10years. More and morefields have been added and the amount of data has been greatly increased. Faced with such alarge amount of forestry scientific data, how to make the search of information more rapid andeasilier has become the goal to develop a such system. Focusing on the existing problems inthe traditional information retrieval, we tried to unearth the information and rules behind thedata in terms of semantics to provide one with high quality service.Semantic information retrieval is a kind of new technology that combines traditionalinformation retrieval with ontology knowledge management, data mining and natural languageprocessing. In this dissertation a research on the semantic information retrieval was conductedbased on ontology and a semantic information retrieval model for forestry scientific data wasproposed. A systematic analysis and research on the critical technologies such as ontologyknowledge model, document semantic pre-processing, semantic query expansion and semanticretrieval were conducted. The significant findings included:(1)In this study, the ontology model of forestry scientific data was developed based onthe theory and technology of ontology construction. The selection of concept assembly and the relationship among the core concepts were decribed in detail so as to provide an importantfoundation for the semantic information retrieval of forestry data.(2)This study conducted the research on semantic web framework and analyzed andexplored the methods of maintaining, storing, inferring and querying with the ontologyknowledge model of forestry scientific data. Results showed that in comparision with therelational database, the ontology based TDB persistent storage was more efficient and themaximum efficiency was60times better. At the same time, if Jena and Pellet reasoning werecombined for triple group reasoning in forestry scientific data ontology, the efficiency wouldbe10%higher than that using Jena and Pellet separately.(3)The study on document semantic pre-processing was conducted. To increase theaccuracy of the dictionary, a total of over70,000professional words were collected throughanalysis of current forestry scientific data. The feature weights of words and terms in thedocument were expressed using vector space. The feature sets of concepts were extracted fromforestry scientific data ontology and used as cluster centers. The document clustering wascarried out using k-means model and the similarity of cosine was employed as the distancemeasure. Finally, the reverse index based method was explored. Results showed that theaccuracy of clustering was81.4%.(4)In this study, a kind of semantic query expansion method was put forward. In thismethod, the queries were first classified into three categories: single key words, multi-keywords and question sentences. For single key words, the modified semantic similarity was usedfor query expansion. For multi-key words, integration of semantic reasoning and semanticsimilarity was applied. For question sentences, syntax analysis and semantic reasoning werecombined. These semantic query expansion methods are critical for semantic informationretrieval.(5)Based on the above research, a semantic information retrieval system of forestryscientific data was developed using the semantic web framework and the means of semanticquery for information was realized. The system was compared with traditional retrieval modelsthat are based upon key words matching. Results showed that the semantic retrieval method developed in sthis study performed much better than the traditional retrieval methods in termsof success and accuracy.The research on semantic information retrieval is theoretically and practically important.This dissertation focused on studying and exploring the semantic retrieval of forestry scientificdata using the eight categories of forestry data that currently exists in Forestry Science DataCenter. Based on the ontology theory, the forestry scientific data ontology was built, whichprovided the potential for knowledge model sharing and reuse in forestry domain. Meanwhile,a method about the semantic retrieval based on the forestry scientific data ontology wasexplored. By combined with the network counting technology, moreover, a semantic retrievalsystem of forestry scientific data was designed, developed and evaluated. This provided atheoretical basis and technical support for sharing of massive forestry scientific data on thesemantic level. The realization of the semantic retrieval system provided a new way forforestry scientific data sharing and it can also be used as a reference for other data sharingplatforms.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络