节点文献

中文智能搜索引擎的设计与实现

The Design and Application of Chinese Intelligent Search Engine

【作者】 高清霞

【导师】 张书杰;

【作者基本信息】 北京工业大学 , 计算机应用, 2000, 硕士

【摘要】 随着Internet的迅速普及和发展,搜索引擎已成为Internet用户上网不可或缺的工具。本文通过分析国内外搜索引擎的特点和研究现状,指出了进一步研究中文智能搜索引擎的必要性和重要性。 文章系统地介绍了“首信”搜索引擎的制作过程,揭示了Web搜索引擎在幕后的工作原理。 “首信”搜索引擎是一个多用途、可调式的Internet中文智能搜索引擎,它采用浏览器/服务器(B/S)体系结构,由浏览器和服务器两端协同来提高服务的智能程度,并通过对网页内容进行自然语言处理来提高检索性能。 “首信”搜索引擎主要由分布式并行Spider、全文检索数据库、智能信息处理模块、CGI和智能浏览器(Smart Browser)等模块构成,支持全文检索、基于语料库的概念检索和基于知识库的概念检索。 其中,作者重点介绍了“首信”搜索引擎的信息获取工具Spider的设计和实现过程。 Spider(或称robot,WebAgent)是Internet搜索引擎的数据来源,它决定着整个系统的内容是否丰富、信息是否能够得到及时更新。“首信” Spider采用Client/Server体系结构,是一个分布式并行搜索的系统。它由服务器端TaskManager(简称TM)和客户端Gather Agent(简称GA)组成。 TM是一个基于TCP/IP的程序,采用Visual++实现。它的主要功能有:1)通过TCP/IP协议(Socket)以及系统的通信原语与各GA进行通信,维持管理与之相连的GA线性表。2)负责搜索任务的调度,向任务负载低(包括无负载)的GA发送搜索任务。3)搜索策略控制以及与用户的交互。 GA的实现采用多线程(Multi-thread)技术,它的主要功能有:1)通过TCP/IP协议(Socket)以及系统的通信原语,与TM进行通信,报告自身的状态信息。2)接收由TM传来的搜索任务,即ROOT_URL表。3)采用宽度优先算法,获取Internet网页信息。4)收集网页,以适当的方式保存到数据库。

【Abstract】 With the development and popularization of Internet, search engine becomes anessential tool for Internet users. This paper analyses the features and currentresearch status of search engine domestic or overseas, and points out the necessityand importance of the research of Chinese intelligent search engine.This paper introduces systematically the design and development of “ChinaInfo”search engine, and uncovers the secret of how search engine works.a“China Info”search engine is a multipurpose and adjustable Chinese intelligentsearch engine. With the Browser/Server architecture, it improves its intelligence viacooperation between client and server. It also improves search performance vianatural language processing in contents of web pages.“China Info”search engine consists of distributed parallel spider, whole-lengthsearch database, intelligent information processing model,CGI and smart browser, etc. It supports whole-length search, concept search basedon language database and concept search based on knowledge database.Here, the design and development of “China Info”spider is the emphasis ofthis paper.Spider is the data source of Internet search engine. It decides whether thecontents are abundance and the update of information is in time. “China Info”spideris a distributed parallel system with Client/Server architechture. It is composed ofTask Manager(TM), the server program and Gather Agent(GA), the client program.TM is a program based on TCP/IP protocol , using Visual C++ as developmenttool. It achieves the goals: 1) communicate with GA via TCP/IP protocol(Socket)and communication primitive, and manipulate the GA Host list; 2)responsible fordispatching of search tasks, sending search task to a GA lowed load or unload;3)control the search strategy and communicate with users.GA is implemented with multi-thread technology, its main function includes:1)communicate with TM via TCP protocol(sockets) and communicate principles,and report its status.2)receive the search task from TM , namely ROOT-URL list. 3)use breadth-first strategy to get Web pages information. 4)gather web pages andsave in database in proper form.

  • 【分类号】TP393.09
  • 【被引频次】3
  • 【下载频次】523
节点文献中: 

本文链接的文献网络图示:

本文的引文网络