节点文献

垂直搜索引擎的抓取技术研究

Crawl Technology Research in Vertical Search Engine

【作者】 刘迟

【导师】 陈刚;

【作者基本信息】 浙江大学 , 计算机应用技术, 2008, 硕士

【摘要】 垂直搜索引擎的概念,是针对某一特定行业领域提供有一定价值的信息和相关服务,它是搜索引擎的细分和延伸,是为用户提供符合专业用户操作行为的全新信息服务方式,本文是对垂直搜索引擎的抓取技术研究,主要关注垂直搜索引擎的抓取中所遇到的隐蔽网抓取、时效性以及性能和效率问题。本文首先介绍了垂直搜索抓取系统的体系结构,提出了一种分布式和基于可扩展插件的垂直搜索抓取系统框架,其分布式特性和插件模式都便于将来的扩展。然后讨论了垂直搜索抓取系统中隐蔽网抓取的三个问题,并针对隐蔽网抓取中结果消重的问题提出了一种自学习的中文地址判重方法;接下来针对垂直搜索的时效性问题提出了一种基于查询驱动的实时抓取方式;讨论了并比较了影响垂直搜索抓取系统的抓取模式、抓取策略和抓取频率,在本文的系统中采用了稳定持续模式、及时替换式更新、实时抓取与固定频率相结合的方式。本文最后进行了关于判重问题和时效性问题实验,通过实验,证明了本文提出的方法在应用中能获得更好的效果和用户体验。

【Abstract】 The concept of Vertical Search Engine is directed towards a specific domain to provide some valuable information and some interrelated service. It is the subdivision and the extension of Search Engine. It is a brand new way of providing information service in accordance with the operation of professional users. This paper is concerning about the crawl technology of search Engine, mainly concerning about the crawl problem in Vertical Search Engine: Hidden Web, time-effectiveness, performance and efficiency.We first introduce the architecture of our Vertical Search Crawl System and propose a crawl system framework which is distributed and based on extensible plug-ins. The distributed property and the plug-in are all convenient for extensible for the future. Then discuss 3 questions in Hidden Web Crawl, bring a self-learning way of Elimination of Duplicated Chinese address for the crawl result of hidden web; Then develop a query triggered crawling for the time-effectiveness problem. Discuss and compare the crawl mode, crawl strategy, crawl frequency which could affect the Vertical Search crawl system and in our system we adopt the steady mode, in-place strategy, combine of real time crawl and fixed frequency.According to the experiment, our method for eliminating duplicate result and the time-effectiveness could get better effectiveness and better user experience.

【关键词】 垂直搜索可扩展隐蔽网时效性
【Key words】 Vertical SearchExtensibleHidden WebTime-effectiveness
  • 【网络出版投稿人】 浙江大学
  • 【网络出版年期】2008年 07期
  • 【分类号】TP391.3
  • 【被引频次】2
  • 【下载频次】524
节点文献中: 

本文链接的文献网络图示:

本文的引文网络