节点文献

基于本体的Deep Web信息集成关键技术研究

Research on Key Technologies of Ontology-Based Deep Web Information Integration

【作者】 方巍

【导师】 崔志明;

【作者基本信息】 苏州大学 , 计算机应用技术, 2009, 博士

【摘要】 随着万维网(WWW)的飞速发展,Web尤其是Deep Web蕴含了各种各样的海量高价值信息,并且仍在以惊人的速度增长。Deep Web上的信息具有异构性、自治性和动态性等特点,这些特点决定了传统结构化信息集成方法已不能满足人们的需求。为了方便用户快捷准确的使用Deep Web中高价值信息,基于本体的Deep Web信息集成研究已成为一个非常迫切的问题,具有重要理论意义和广阔应用前景。在对Deep Web信息集成的研究现状和发展趋势进行了深入的分析后。在课题组前期工作的基础上,提出了一种基于本体的Deep Web信息集成方案。该方案包括面向Deep Web不确定知识表示的动态模糊描述逻辑方法、基于最大熵和本体的数据源发现技术、基于质量估计模型的数据源选择方法、以及基于多数据源同步标注的信息抽取和Deep Web语义集成中模糊性本体映射方法等内容。本文的主要研究工作和取得的创新成果包括:(1)一个完整、准确的本体是基于本体的Deep Web信息集成的必要前提。本文根据Deep Web特征半自动构建了Deep Web领域本体,并针对Deep Web本体学习和本体映射过程中存在不确定性知识表示问题,提出了一种面向Deep Web不确定知识表示的动态模糊描述逻辑方法(DFDLs),该方法弥补了传统描述逻辑方法对不确定性知识表示的不足。(2)针对Deep Web数据源的动态性和稀疏分布的特征,提出了一种基于最大熵分类器和领域本体的Deep Web数据源发现方法,该方法首先通过最大熵分类器进行Deep Web查询接口自动判定,然后利用基于本体的Deep Web聚焦爬虫发现Deep Web数据源,该方法使得聚焦爬虫聚焦访问那些可能链接到Deep Web入口页面的链接,从而避免访问下载不必要的页面。(3)通过服务质量可以评价Deep Web数据源的优劣,本文提出了一个基于领域本体的Deep Web数据源质量估计模型,并将其应用于Deep Web数据源选择过程中。采用此模型能够选取最符合用户需求的数据源,达到查询代价更少,效率更高的要求。(4)针对信息抽取过程中存在接口模式和结果模式缺失的问题,提出了一种多数据源间的同步标注方法。从一组Deep Web接口模式和结果模式中高效地学习领域本体知识,通过对本体的实例查询可实现多数据源间的同步标注。并成功应用此方法于Deep Web复杂结果页面抽取过程中。(5)针对基于本体的Deep Web信息集成过程中存在的不确定性模式匹配问题,将模式匹配问题转化为本体映射问题,提出了一个模糊性本体映射框架。在此框架中,运用了多个本体映射策略,从不同方面多个角度对本体特征进行描述,尽可能的发掘可能存在的映射关系,从模糊性角度表述映射过程。该方法为基于本体的Deep Web信息集成提供了一种有效和通用的自动映射策略。(6)Deep Web语义集成原型系统设计,本文根据所研究的关键技术和实际应用需求,设计并实现了一个Deep Web语义集成原型系统,该原型系统具有数据源发现、数据源选择、信息抽取和语义集成等功能。实际应用表明,该系统具有一定实用价值。本项研究工作受到国家自然科学基金项目“面向Deep Web的不完备知识处理的逻辑模型研究”(编号:60673092)、江苏省高技术研究计划项目“面向Deep Web的搜索和挖掘关键技术研究”(编号:BG2005019)、江苏省高校研究生科研创新计划项目“基于本体的Deep Web数据源发现与选择技术研究”(编号:CX08B-099Z)以及2008年苏州大学优秀博士论文选题项目资助(苏大研字[2008]22号)的资助。

【Abstract】 As the rapid development of Word Wide Web (WWW), Web especially Deep Web contains various kinds of huge high-valued information which is developing at an amazing speed now. Information hidden in Deep Web has such characteristics as heterogeneous, autonomous and dynamic, which decide that the methods of traditional information integration could not meet the requirements of modern people. In order to make it easier for the users to obtain the high-valued information rapidly and accurately, the research on Ontology-Based Deep Web Information integration has been an urgent problem pressed for solution for its broad application theoretical significance.In this thesis, the current research status and development trends of Deep Web information integration have been deeply analyzed. Based on the preliminary work of our research group, this dissertation puts forward an Ontology-Based Deep Web Information integration solution, which covers the dynamic fuzzy description logic method for Deep Web uncertain knowledge representation, the discovery technique based on maximum entropy and ontology of Deep Web sources, Deep Web data sources selection based on quality estimate model , the semantic annotation based on multiple data sources synchronous, Deep Web fuzzy ontology mapping and so on. The main research work and contributions of this dissertation are as follows.(1)An accurate and integrated ontology is a necessary precondition of Ontology-Based Deep Web Information integration, so we semi-automatically create the domain ontology of Deep Web in complicate with the characteristics of Deep Web. In addition, considering the uncertain problem of Deep Web ontology learning and ontology mapping, a dynamic fuzzy description logic (DFDLs) method based on uncertain knowledge representation is presented in order to overcome the deficiency of uncertain knowledge representation approaches used by the traditional description logic. (2) According to the dynamic and sparse distribution characteristics of Deep Web data sources, this dissertation brings forward a new method of detecting data sources based on maximum entropy classifier and domain ontology. This method firstly automatically identifies the Deep Web query interface through maximum entropy classifier, and then detects the data sources using a focused crawling technology based on domain ontology, which enables the focused crawler to focuse on visting those links which may access to entrance pages of Deep Web and avoid downloading some unnecessary pages in the whole process.(3) The efficiency and quality of Deep Web sources can be evaluated by the quality of services, so this paper proposes a quality estimation model of data source based on the domain of ontology, and applies it to the process of selecting the data sources. In this way, the model can select data source that best meets the users’exacting requirements, to achieve lower query cost and higher efficiency.(4) Considering the problem of interface schema and result schema missing in the process of information extraction, this paper provides a synchronous-annotation approach among multiple data sources, which can be realized by learning knowledge of domain ontology effectively from a set of interfaces and results schema of Deep Web and the case inquiry of ontology . This method is successfully applied to the data extract process of the complex result pages.(5) With regard to the problem of uncertain schema matching under the process of Ontology-Based Deep Web Information integration, this paper raises a new type of framework in which ontology mapping with uncertainty towards the uncertain schema matching. This framework integrates various ontology features, integrates several matching strategies and introduces the uncertain matching in each mapping strategy. This new approach is an efficient and general automatic mapping strategy for Ontology-Based Deep Web Information integration.(6) Based on the proposed key technologies and practice requirement, we propose Deep Web semantic integration architecture and implement a prototype system of Deep Web semantic integration. The system has functions such as sources discovery, sources selection, data extraction and semantic integration etc. Practical application shows that the system has certain practical value.This work is partially supported by Natural Science Foundation of China under grant No.60673092, the High-Technology Research Program of Jiangsu Province Under grant No. BG2005019, the Higher Education Graduate Research Innovation Program of Jiangsu Province in 2008 under grant No.cx08b-099z, and the Excellence Doctoral Dissertation Topic Selection Program of Soochow University in 2008 under grant No.SDY Zi [2008]22.

  • 【网络出版投稿人】 苏州大学
  • 【网络出版年期】2010年 05期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络