节点文献

基于Map/Reduce的分布式智能搜索引擎框架研究

The Study of the Framework of Distributed Intelligent Search Engine Based on Map/Reduce

【作者】 付志超

【导师】 聂规划;

【作者基本信息】 武汉理工大学 , 国际贸易学, 2008, 硕士

【摘要】 随着搜索经济的崛起,人们开始越加关注全球各大搜索引擎的性能、技术和日流量。作为企业,会根据搜索引擎的知名度以及日流量来选择是否要投放广告等;作为普通网民,会根据搜索引擎的性能和技术来选择自己喜欢的引擎查找资料;作为技术人员,会把有代表性的搜索引擎作为研究对象。搜索引擎经济的崛起,又一次向人们证明了互联网所蕴藏的巨大商机。互联网离开了搜索将只剩下空洞杂乱的数据,以及大量等待去费力挖掘的金矿。如今互联网中的信息每天以指数级的数量增长,面对海量数据的处理和存储,传统的集中式搜索引擎显得无能为力。另外传统搜索引擎系统一般都采用关键词匹配模式,无法理解用户搜索意图,使得用户在互联网上搜索自己真正需要的信息很困难。因此搜索引擎的分布式智能化是未来发展的趋势。本文从研究和设计的角度出发,对分布式智能搜索引擎的相关理论和技术进行了详细的分析和讨论,将基于Map/Reduce的分布式智能搜索引擎框架研究分为三个层次,即分布式并行计算理论方法研究、搜索引擎原理的研究以及基于分布式的智能搜索引擎研究。论文主要研究的内容如下:论述了目前搜索引擎的国内外发展现状、存在的问题以及发展趋势;分析了搜索引擎的工作原理以及各部分的主要功能;对分布式计算理论、网格计算、云计算、Map/Reduce分布式计算模型进行分析与研究。对开源搜索引擎工具包Lucene、开源分布式计算框架Hadoop进行了详细的分析与研究。在基于Map/Reduce的分布式计算模型的基础上,借助语义词典,对分布式的智能搜索引擎系统进行了研究。设计并实现了基于Map/Reduce的分布式智能搜索引擎——IEBSou。重点阐述了IEBSou系统框架的实现.不仅给出了系统各模块之间的关系,而且还分析了各个模块的实现原理和思想。对IEBSou的Map/Reduce基础框架进行了设计;结合Lucene设计了统一文档处理框架,并对中文分词中人名识别、新词的识别进行了研究;提出了基于Map/Reduce的网页消重算法;提出了通过构建概念集的方式来提供基于语义联想的搜索推荐词生成算法。借助语义词典,对用户搜索关键词的概念进行语义扩展,构造概念集,让系统智能的理解用户搜索意图,提高系统的查全率和查准确率。

【Abstract】 With the economic rise of search, more people begin to concern the world’s major search engine performance, technology and daily flow. An enterprise will choose whether to launch advertising based on the search engine popularity and daily flow, as ordinary internet users, which choose a favorite search engine to find information according to search engine performance and technology, as technicians, will choose a representative of search engine as the research object. The economic rise of search engines, to the people once again demonstrates the Internet by the tremendous business opportunities. Without search engines , Internet will be left only empty clutter of data, as well as so much gold miner which needs digging with hard sledding. Today, the information in the Internet is mounted up exponentially everyday, and in the face of massive data processing and storage, the traditional centralized search engine appears to be powerless. On the other hand, traditional search engine system is generally used words matching model, and unable to understand customer search intentions, making it very difficult for the users to search on the Internet for the really wanted information. Therefore, the distributed intelligent search engine is the future development trend.From the research and design point of view, this thesis makes a detailed analysis and discussion on the distributed intelligence of the search engine-related theory and technology. The research on the framework is subdivided into three levels which are correlated with each other closely to support the distributed intelligent search engine based on the Map/Reduce. The first is the theory and methodology of distributed Parallel Computing. The second is the Principle of search engine. The third is the theory and methodology of the distributed intelligent search engine. The main content of the thesis is as follows:Firstly the thesis discusses the current development status of search engine at home and abroad, as well as the existing problems and the development trends. After analysis of the search engine’s working principle as well as some of the main functions, the theory of distributed computing, grid computing, cloud computing. Map/Reduce Distributed computing model are elaborated. And the open source search engine kit Lucene, open-source distributed computing framework Hadoop are analyzed and studied.Based on the Map/Reduce distributed computing model and semantic dictionary, the distributed intelligence of the search engine system is studied. The distributed intelligent search engine - IEBSou, which based on the Map/Reduce, is designed and implemented. And the thesis focuses on the framework for the realization of the IEBSou system. Not only displays the relationship between the modules, but also analyzes the implemented principles and ideas of the various modules. After that the basis of the framework of the IEBSou’s Map/Reduce is designed. Combined with Lucene, a unified framework for dealing with the document is designed, and then the names in Chinese word recognition and recognition of new words have been studied. The elimination re-page algorithm based on the Map/Reduce and the search recommended word generation algorithm based on the semantic association are proposed. Through constructing a concept set, IEBSou can intelligently generate the semantic related words for the users. On the other hand, with semantic dictionary, IEBSou will conduct a Semantic extension for user’s searcher keywords and build a concept set, so the system can intelligently understand the user’s searching intent, and improve the recall and precision.

【关键词】 搜索引擎分布式计算Map/ReduceHDFS
【Key words】 Search EngineDistributed ComputingMap/ReduceHDFS
节点文献中: 

本文链接的文献网络图示:

本文的引文网络