节点文献

个性化垂直搜索引擎研究

Research on Individuated Vertical Search Engine

【作者】 李文泽

【导师】 徐彬;

【作者基本信息】 河南大学 , 应用数学, 2007, 硕士

【摘要】 目前互联网领域主要的搜索引擎服务商如Yahoo、百度、Google等,为用户提供的都是横向的海量信息搜索。而在互联网不断更新和演化的现阶段,我们发现:普通网络用户想找到所需的资料简直如同大海捞针,海量的信息已经不再是发展的主要动力,意识和时效性才是真正的动力。互联网发展的关键不再是能否快速、大量地向用户提供和传递信息,而是能否实现使用户在期望的时间、期望的地点,以期望的方式和成本,获取期望的信息。然而综合搜索引擎可以满足大量信息的横向搜索,但很难兼顾搜索的准确度与相关度的质量。综合搜索引擎的价值在于在做大量的信息导航,对于信息需求相对集中、分类更加详细的行业客户缺乏导向。解决这个问题成为搜索发展的机会,也成为未来科研机构竞相研究的热点。垂直搜索这一新的搜索模式正是在这一背景下产生的。本文主要的研究工作分为两个部分:第一部分通过理论研究分析,提出了对垂直搜索引擎信息采集算法的改进思路;第二部分通过对垂直搜索引擎的核心技术进行剖析,设计并实现了一个垂直搜索引擎的原型系统。正文部分分五章对研究内容进行详细介绍。第一章绪论部分详细介绍了搜索引擎的发展历史,指出了目前综合搜索引擎所面临的问题以及解决这些问题的途径,即本文所研究的方向:垂直搜索引擎。通过和综合搜索引擎在信息服务以及关键技术上的比较分析,指出垂直搜索引擎存在的巨大优势和发展空间。最后,分析了垂直搜索引擎在国内外发展状况以及提出本文所要解决的问题。第二章总体架构与信息采集部分给出了垂直搜索引擎总体架构的设计方案和工作流程,并对垂直搜索引擎自身特点进行分析。此外,在信息采集策略方面给出了常用的信息采集模型,并分析了目前通用的信息采集算法——基于向量空间模型的相似度匹配算法的核心思想及不足。最后,通过对本体的介绍,提出了构建基于本体知识库的智能化信息采集策略的实现思路来解决信息采集过程中一词多义和一义多词的问题。第三章Lucene框架的研究部分对目前最优秀的开源全文检索框架Lucene进行了详细的分析。包括对全文检索技术的介绍,Lucene项目的来源和框架构成的介绍,以及Lucene所提供的索引和搜索功能中非常重要的倒排索引技术和评分机制的介绍,并给出了索引建立和搜索实现的核心程序代码。最后,还介绍了中文分词技术以及Lucene中分词的实现原理。第四章垂直搜索引擎的实现部分结合Hertrix开源爬虫和Lucene框架设计并构建一个面向手机产品信息的垂直搜索引擎的原型系统。该系统分三个部分来实现,第一部分基于Heritrix框架实现了信息采集功能并设计了信息结构化抽取程序。第二部分设计了面向手机产品信息的分词工具,并利用Lucene框架实现了结构化文本信息的索引。第三部分设计了基于MVC架构的查询接口,并实现了原型系统的检索功能。从而为垂直搜索引擎在技术实现层面提供有益的借鉴和指导。第五章总结与展望部分对本文工作进行了小结,并提出了垂直搜索引擎的发展趋势以及若干继续研究的方向。搜索领域有句名言:“用户无法描述知道他要找什么,除非让他看到想找的东西”。微软研究院一名技术专家说:“75%的内容通用搜索引擎搜索不出来”。垂直搜索引擎作为搜索引擎技术发展的一个分支方向,是互联网用户的搜索倾向从起初单纯的希望搜索内容全面向搜索内容全面、搜索准确率提高以及信息的时效增强转移的必然结果。并且,垂直搜索引擎通过对行业领域内的信息模型和用户模型结构化的搜集或再组织,将会提供更多、更专业、个性化的行业相关服务,与传统综合搜索相比,显得更为聪明且更具人性化。因此,垂直搜索引擎市场有其存在的必要性和广阔的发展前景,然而垂直搜索作为一项刚刚起步的新技术,还有许多需要改进和突破的地方,本文对垂直搜索引擎技术的研究将为垂直搜索的发展提供现实指导意义。

【Abstract】 At present the main search engine in Internet field main facilitator is Yahoo, Baidu and Google, etc, which provide the customer to find horizontal and large numbers of information. Go with the continuous update and evolvement of Internet, If the ordinary network user wants to find the necessary data it just like looking for a needle in a bottle of hay, the large numbers of information is no longer the main power of further development, that is consciousness and timeliness are the real motive force. The key problem of the Internet development is not to provide and transfer information for customer fleetly and largely, but to make our customer to obtain anticipant information at anticipant time and destination in anticipant mode and cost. We can satisfy the largely information’s research in horizontal way by common search engine, however ,it is very difficult to give consideration to the accuracy and the relevant of search quality. The value of common search engine lies in the navigation of in a large amount of information, which is lack of direction for trade customer whose demand for information is relatively centralized and classifying is more detailed. To solve this problem becomes the chance to the development of search engine. It also becomes the focus of the scientific research institution to competitively study in the future. The new search mode Vertical Search Engine is just produced under this background.The investigation of this dissertation constructs a prototype system of Vertical Search Engine by theoretic analysis and idiographic design. The text will introduce the investigation content detailedly in five parts.The introduction part of chapter one has introduced the development history of the search engine in detail, in which have pointed out the problem at present that the comprehensive search engine faces and the route to solve these problems. That is the direction of the dissertation studies: Vertical search engine. Through the comparative analysis with comprehensive search engine in information service and key technology, it points out that the vertical search engine is provided with enormous advantage and development space. Finally, it analyzes the state of development at home and abroad of the vertical search engine and proposed the problem that this text should solve.Overall frame analysis and design that builds up the chapter two, which provides overall design plan and workflow of the vertical search engine, and then analyzes it’s own characteristic. In addition, it provides collection information model which is in common use in gathering strategy, and analyzes the kernel idea and the deficiency of the commonly collection algorithms– comparability matching algorithms based on the vector space model. Finally, through the introduction of ontology, it proposes the implement way of the intelligent information gathering strategy based on the ontology repository, which is to resolve the problem that one word more than justice and one justice more than word in the course of information collection.The chapter three is the Lucene frame research part which detailedly analyses the classic opening code full-text retrieval frame. Including the introduction of retrieval technique of the full text, the source of the project, the introduction on how to construct the frame, the introduction on the very important inverse arranging index technology and marking mechanism which the index and search function that Lucene provide, and show the core code of how to construct the index and realize the search. Finally, also introduces the participle technology in Chinese and the realization principle of Lucene.Chapter four describes with the opening code reptile Heritrix and the Lucene frame design how to realize the individualized vertical search engine, and construct one prototype system of vertical search engine which faced to the mobile phone product information. It is implemented in three parts, Part one realizes that gathering function of information based on Heritrix frame and designs the procedure of information structurization collection. Part two designs the participle tool facing mobile phone product information, and make use of Lucene frame to realize the index of the structurization text information. Part three designs the inquiry interface based on that MVC frame, realizes the search function of the prototype system. Thus it provides beneficial reference and guidance for the vertical search engine on the aspect of technology. Chapter five summarizes and expects have carried on the brief summary to the work of this text, has put forward the development trend of the vertical search engine and several directions studied in continuation.There is a famous motto in the search field: " the customers are unable to describe what he wants to look for, unless let him see the thing he wanted to look for ". A technologist of Microsoft research institute says: " There are almost 75% content that we can’t search them out in the common search engines ".As a branch direction of the technical development of the search engine, the vertical search engine is necessity result that the Internet customers’search that inclines to the originally simple hope to search overallly in content convert to not only overallly in content but also improve the accuracy and timeliness of the information .It will provide us related service that is not only in quantity but also more professional and individuation. Compared with the traditional search, it is more smart. So the vertical search engine market have its existing necessary condition and expansive development foreground. But as a new technology at the early-stage , there are a lot of places need to improvement and break through, this essay’s study on the technology of the vertical search engine will provide realistic directive significance for the development of vertical search.

【关键词】 垂直搜索引擎本体Lucene索引信息抽取MVC
【Key words】 vertical search engineontologyLuceneindexinginformation extractMVC
  • 【网络出版投稿人】 河南大学
  • 【网络出版年期】2007年 06期
  • 【分类号】TP391.3
  • 【被引频次】17
  • 【下载频次】1406
节点文献中: 

本文链接的文献网络图示:

本文的引文网络