节点文献

Deep Web数据抽取及集成技术研究

Research on Deep Web Oriented Information Extraction and Integration

【作者】 刘桂峰

【导师】 崔志明;

【作者基本信息】 苏州大学 , 计算机应用技术, 2009, 硕士

【摘要】 随着万维网技术和数据库技术的结合,网络开始迅速的深化。大量的信息都隐藏在Web数据库中,用户通过查询可以动态的获取这些信息,学者们将这类资源称为Deep Web。由于Deep Web资源分布在各个Deep Web站点,使用起来较为不便,因此,面向Deep Web的数据集成系统便应运而生。本文对Deep Web领域的数据抽取及集成技术进行了研究,并提出了相关的算法和解决方案,最后设计了一个面向Deep Web的搜索引擎原型系统。本文的主要研究工作如下:(1)将Web数据对象从查询结果页面中抽取出来是Deep Web数据集成的第一步,本文基于文档对象模型,通过页面预处理、抽取候选Web数据对象集、去除非Web数据对象三个阶段提出了一种自动抽出Web数据对象的方法。(2)提出了一种对模式异构的Web数据对象进行集成的方法。该方法以向量空间模型为基础,以聚类为手段对来自不同Deep Web站点的异构Web数据对象进行了集成,并以区分度为基础,以相似度为度量手段检测出了重复的Web数据对象,实现了Web数据对象的去重。(3)分析了海量数据的组织方法对查询响应速度的影响,在此基础上提出了一种对海量Web数据对象进行组织的方法。该方法通过递增聚类使Web数据对象根据自身的特征自然的聚集在一起,形成一个科学的类别层次,为查询的快速响应奠定基础。(4)在上述研究的基础上设计了一个面向Deep Web的搜索引擎原型系统。本文还对文中提出的方法和技术进行了实验,结果表明本文提出的方法技术是可行有效的。

【Abstract】 With the development of the World Wide Web and Database technology,Internet is deepening rapidly.Large amount of information are hidden in Web Databases,which are called Deep Web.Users can get them dynamicly by submitting queries to query forms.Because Deep Web resources distribute in many different Deep Web sites,so it is not convenient to get information from Deep Web.Therefor,many researchers and companies had been researching how to integrate Deep Web resources into one system.This thesis researches on Deep Web oriented data extraction and integration technology,proposes corresponding algorithms and solutions,and then designs a Deep Web oriented prototype search engine in the last main section.The main work of this thesis is summarized as followings:(1) Extracting Web Data Objects from result pages of queries is the first step of Deep Web integration.This thesis proposes an automatic method of Web Data Object extraction based on DOM,which identifies the Data Regions and Web Data Objects by following steps:preprocessing the HTML pages,extracting candidate web data object set,and revoming objects which are not web data object from the set,then Web Data Objects can be extracted from the result HTML pages.(2) Proposes a method of integrating heterogeneous Web Data Objects which are extracted from different Deep Web sites.This method is based on vector space model.It was designed to integrate heterogeneous Web Data Objects by clustering,and then identifiy the duplicate Web Data Objects by discriminabiltity and similarity of property in order to eliminate redundant phenomenon.(3) Analyzes the influence on query response speed which are generated by the orgnization of the massive data,and then further proposes an orgnization method of huge amount of Web Data Objects.By incremental clustering,Web Data Objects are divided into different clusters according to their own characters.All the clusters construct a hierarchical structure,which is the basis of quick response to queries submitted by users.(4) Designs a Deep Web oriented prototype search engine based on the above works. Moreover,this thesis also designs and performs several experiments on the methods mentioned in the thesis.The experimental results show that these methods are feasible and effective.

【关键词】 Deep Web数据集成数据抽取聚类搜索引擎
【Key words】 Deep WebData IntegrationData ExtractionClusterSearch Engine
  • 【网络出版投稿人】 苏州大学
  • 【网络出版年期】2009年 10期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络