
Deep Web分类搜索引擎关键技术研究

The Key Technology Research on Deep Web Directory Search Engine

【作者】 高岭

【导师】 崔志明;

【作者基本信息】 苏州大学 , 计算机应用技术, 2007, 硕士

【摘要】 随着World Wide Web(WWW)的飞速发展,整个Web信息已经被各种各样可搜索的在线数据库所深化。这些信息被隐藏在Web查询接口之后,由站点后台数据库动态产生,而传统搜索引擎受技术限制无法对它们进行索引,我们称这类信息为Deep Web。Deep Web信息获取至今仍然是一个新兴的研究领域,也受到越来越多研究人员的重视。为了方便用户获取使用某领域的Deep Web信息,本文提出了一个Deep Web分类搜索引擎的系统架构,依据这个系统架构对Deep Web分类搜索引擎中若干关键问题进行了分析研究,并提出了相关的算法和模型。本文主要研究的工作包括:(1)对中国Deep Web资源的规模、分布、结构等进行了调查研究。(2)针对传统搜索引擎爬虫程序在Deep Web领域的缺陷,设计了一个面向Deep Web的聚焦爬虫,并提出了Deep Web查询接口的判定方法。(3)采用一种高效的Web数据库内容获取算法,对Web数据库内容进行采样,并对采样得到的页面进行分析,去除了无关信息,最终得到Web数据库的内容摘要。(4)依据雅虎的分类目录,提出了一种将Deep Web站点接口页面与数据库内容摘要相结合的方法,对Deep Web资源进行分类。本文最后设计和实现了一个针对中文的Deep Web分类搜索引擎原型系统Deep Searcher,并对文中提出的算法进行了实验和分析。

【Abstract】 With the rapid development of the World Wide Web, the Web has been rapidly deepened by myriad searchable databases online. A large amount of dynamic information from the databases behind query interfaces can not be retrieved because of the restrictions of current search engine technology. We call such information as Deep Web. Deep Web information retrieval is still a fresh field of study and has been paid more and more attention. In attempt to meet users’ need for Deep Web information, this paper proposes a system architecture for a Deep Web directory search engine. According to this framework, we focus on the key issues in the Deep Web directory search engine, and propose relevant algorithms and models. The paper’s main research works include:(1) We do some investigation on scale, distribution and structure of Chinese Deep Web resources.(2) To cope with limitation of traditional search engine crawler in Deep Web domain, we design a Deep Web focused crawler, and present a method to judge a Deep Web Query Interface.(3) We adopt an efficient algorithm to acquire contents of Web Databases. Through analysing the result pages,the irrelevant information is removed and a summary of the Web database contents is eventually constructed.(4) In accordance with Yahoo Directory, we propose a method which combines query interface pages and database summary to classify Deep Web resources.Finally, we design and implement a prototype for Deep Web directory search engine system called Deep Searcher, and we do experiments and analysises on the proposed algorithm.

  • 【网络出版投稿人】 苏州大学
  • 【网络出版年期】2008年 03期
  • 【分类号】TP391.3
  • 【被引频次】15
  • 【下载频次】624