

The Research and Implementation of Search Engine Based on LUCENE

【作者】 高磊

【导师】 徐东平;

【作者基本信息】 武汉理工大学 , 计算机应用技术, 2007, 硕士

【摘要】 随着信息技术的不断发展,互联网技术也得到了迅猛发展,而在互联网上大家每天用的频率最高的就是搜索引擎,人们已经把它当作日常学习、工作、休闲不可缺少的一个工具。大家都知道用搜索引擎可以快速的找到自己想找的资料或信息,那么到底什么是搜索引擎呢?网络上通常说的搜索引擎指的是收集了因特网上几十亿到上百亿个网页,并对网页中的每一个词(即关键词)进行索引,建立索引数据库的全文搜索引擎。当用户查找某个关键词的时候,所有在页面内容中包含了该关键词的网页都会作为搜索结果被搜出来。在经过复杂的算法进行排序后,这些结果将按照与搜索关键词的相关度高低,依次排列,呈现给用户。本文首先介绍了搜索引擎的发展现状,在上世纪九十年代以后以互联网为基础的信息化进程中,面对浩瀚的网络信息资源,人们寻找自己需要的信息变得越来越困难,大多数人很大程度上是依赖搜索引擎来帮助自己获得有用信息,因此搜索引擎技术作为最典型的web信息获取技术,其发展水平高低直接影响人们获取信息的质量。接着介绍了搜索引擎的特点和分类,并对搜索引擎的原理及网络机器人等技术进行了探讨,对google主流搜索引擎系统结构进行了分析研究。在此基础上对开源代码项目Lucene的历史,应用,特点,系统结构,Lucene索引文件格式进行了论述。然后对搜索引擎中的关键技术进行了研究。由于Web站点上的页面频繁更新,随着时间的推移,将会有许多页面过时或者不存在,通过对网络机器人页面抓取过程进行分析,提出了递增式的网络机器人页面变化模型。最后对中文分词的常见算法及中文分词岐义和未登录词进行了相关分析论述。

【Abstract】 Along with the information technology unceasing development, the Internet technology is also developing swiftly, but the most high frequency tool which everybody uses every day on the Internet is the search engine, the people already treated it as an essential tool for study, work,the leisure activities. Everybody knows with the search engine one may get the material or information that he wants to find, and then what is the search engine? Genarally we referred the search engine on the Internet as it has collected from several billions to 10 billions web pages, and index each word(namely key word) of the whole webpages, established the full-text search engine of the index database. After the user entering the key word, all the pages containing the key words would be find out as the search results. After sorting according to complex algorithm, these results will be presented to the users based on the correlation degree to the key words.First of all,the thesis introduces present situation of the development of search engine. After 1990’s, when facing vast network information resources, it become more and more difficult for people to seek information they need in the process of informationization based on the Internet. The majorities will rely on the search engine to help themselves to obtain the useful information to a great extend. Therefore,the development of the search engine technologies as a typical web information accessing technology will have directly impact on the quality of people access to information. In the next place, we introduced the search engine characteristics and classification , have a discussion on search engine principles and Robot,analyze and study on the architecture of the google search engine .In this foundation,we have elaborated on the open source code project Lucene, its history, application, characteristics ,system structure, the Lucene index format .Then,we have study on several key technologies. Because web pages frequently updated, along with time passed, some many pages would be obsolete or do not exist. Through the analysis on process of the robot’s fetching webpages, we proposed the robot’s increment Page Change Model . Finally, we have discussed on the common algorithms on Chinese Word Segmentation , the ambiguity of Chinese Word segmentation and unregistered words.

【关键词】 搜索引擎Lecene网络机器人中文分词
【Key words】 Search EngineLuceneRobotChinese Word Segmentation
  • 【分类号】TP391.3
  • 【被引频次】14
  • 【下载频次】891

