节点文献

垂直搜索引擎中网络蜘蛛的设计与实现

Design and Realize of Spider in Vertical Search Engine

【作者】 薛建春

【导师】 段红梅;

【作者基本信息】 中国地质大学(北京) , 检测技术与自动化装置, 2007, 硕士

【摘要】 随着Internet的迅速发展,网络成为当今世界最大的信息库,它为信息共享和资源共享提供了一个良好的平台。然而大量的网页资源和网页的动态特性要求信息搜索系统不断升级,同时人们对获取信息的时效性、针对性、准确性等方面有了新的要求。因此基于各专业的搜索系统也应运而生。如何能更快速、更准确的得到网络中的有用信息资源是网络用户面临的一个重要问题,而搜索引擎技术恰好能解决此难题。搜索引擎主要由搜索器、索引器、检索器和用户接口四部分组成。搜索器旨在研究开发出一个智能化的搜索软件,自动的在网络中的网页上爬行,进行信息的发现和抽取,建立本地的索引数据库,向用户提供查询服务。垂直搜索引擎是搜索引擎的细分和延伸,是对网页库中的某类专门的信息进行一次整合,定向分字段抽取出需要的数据进行处理后再以某种形式返回给用户。垂直搜索引擎与传统的网页搜索引擎最大的区别就是将网页中的信息进行结构化的提取。使得信息在提取的时候就建立了分类,更好的适应查询需求。本文从研究和设计的角度对WWW搜索引擎的相关技术作了详细的分析和讨论,论述了目前搜索引擎的国内外发展现状和发展趋势。分析了搜索引擎的工作原理及其各部分主要功能,抓住如何评价页面的主题相关性和设计高效的爬行策略这两个关键问题,提出一个基于图书专业的定题搜索器,它是垂直搜索引擎的核心。在文章的主体部分,以搜索引擎的设计流程为主线,从HTML页面解析的一般概念入手,结合网页之间的超链接分析(HITS算法),按照搜索引擎系统的要求,采用深度优先的搜索策略设计一个适合中小型网站专业网页信息获取的网络蜘蛛,并给出此网络蜘蛛的爬行算法,使用C++ Builder工具实现程序。另外,为了保证数据库中的信息不重复,还设计了一个专门用于数据查重的程序以保证资源准确。在此基础上采用数据库索引和检索工具Lucene相结合的方法建立索引、为检索结果排序。保证为用户提供更加准确的信息,更好的满足用户的检索要求。这种搜索方法对其他的专业搜索引擎系统的建立具有指导意义。最后的软件功能测试表明,此Spider程序算法准确、稳定、不会引起本地资源耗尽;它支持按指定站点搜索,按给定Url范围进行搜索的搜索策略。可以完成指定信息的自动搜索和下载。

【Abstract】 With the rapid development of Internet, web has become the largest data base in the present world, which provides an ideal place for sharing and communicating infor-mation. However, the large amount of website resources and their dynamic characteris-tics require continual update of the data-searching system, as well as higher level of ef-ficiency, pertinence and accuracy in searching data. Therefore, various specialty-based searching engines have been invented. How to get access to useful information on the net more quickly and more correctly is one of the problems which web surfers face, while the technology of searching engine which consists of Spider、Indexer、Searcher and User interface system is the key to solve this problem. The spider aims at producing intelligent searching software which can automatically search information on the web for selecting the useful information, and at setting up a local index data base for the searching service to users. The vertical searching engine is a typical type of searching engine, which can classify information in certain field from those websites, select nec-essary data string by string along one direction, analyze those data and then return them to the user. The major difference between vertical searching engine and traditional searching engine is that the vertical one select information from website in a structural way– classify the information while selecting it to better satisfy the searching require-ments.The paper has analyzed and discussed the research and development of WWW searching engine technology in details, and its current situation as well as future trend in mainland and abroad. It also states the working theory of searching engine and the main function of each component. Firstly the paper emphasizes how to evaluate the subject pertinence of web page and designing efficient searching strategy as two key steps. Then it also describes a fixed-subject searching engine basing on the specialty of book, which is the core of vertical searching engine. The main part of the paper covers the whole procedure of designing the engine. Basing on the general conceptions of analytic HTML, combined with the analysis of hyperlinks between web pages(HIT al-gorithm), according to the requirements for searching engine, the paper has designed a web spider (with depth-preferred searching strategy) fitting for middle or small sized websites’information selection. The Searching arithmetic of the web spider has been presented and it can work with the aid of C++ Builder tools for better satisfying searching engine users. Besides, to avoid repetition of data, a program specified in checking the data repetition has been designed to guarantee the accuracy of data. Bas-ing on these principles the searching engine is set up by data index and searching tool Lucene to composite the searching result in guarantee of offering accurate information and better satisfying users’requirements. In general, this searching method is guidance for setting up other specified searching engine systems.The results of software function test show that the algorithm of Spider program is accurate and steady without the risk of local information resource exhaustion. It sup-ports the searching strategy of searching on fixed site or in a given Url circle. It can also do automatic searching and downloading according to the given information.

  • 【分类号】TP393.092;TP391.3
  • 【被引频次】22
  • 【下载频次】1580
节点文献中: 

本文链接的文献网络图示:

本文的引文网络