节点文献

基于索引云的企业搜索引擎实现研究

Study on Implementation of Enterprise Search Engine Based on Index Cloud

【作者】 陈旭毅

【导师】 周宁;

【作者基本信息】 武汉大学 , 管理科学与工程, 2011, 博士

【摘要】 随着企业信息化的发展,企业内部的数据资源正在急剧膨胀。企业对信息的管理和资源的访问提出了更高的要求,因此,建立企业内部搜索引擎具有必然性,也是企业信息资源管理的发展趋势。企业搜索引擎实现的关键技术之一是对企业内各种信息化资源索引结构的构建,索引结构的构成方式在很大程度上对企业搜索引擎的检索性能起着决定性作用。本文在对企业内部搜索引擎设计时,在传统索引结构的基础之上经过创新改造,将云计算思想引入到索引系统中,提出了一种新的索引框架——索引云模型,并在此基础上提出了新的企业搜索引擎体系架构。本文首先阐述了搜索引擎的概念和分类,研究了搜索引擎的工作原理和技术,了解了搜索引擎的发展,然后阐述了云计算的概念和分类,研究了云计算的技术和实现。本文对索引的组织方式进行了细致的研究,阐述了索引的概念和索引文件的组织方式,对几种常用的索引组织方式B-树、B+树、R树、R*树进行了详细的研究和讨论。对索引项的构成方式,如正排索引、倒排索引、后缀数组、签名文档技术进行了介绍。在搜索引擎和云计算理论的基础上,依据索引理论提出了索引云模型,该模型基于数据分类存储、分布式运算及并行处理三个基本原理进行设计,具有高度虚拟化、高性能、高可靠性、安全性强、可扩展性强、通用性好等显著特点,更适合于企业搜索引擎的需求。本文对索引云模型进行了全面深入的研究,详细给出了索引云的定义、索引云的原理、索引云的基本特征。针对搜索引擎中索引组织策略在检索性能和可扩展性等方面存在的问题,在对基本索引组织策略进行比较后,本文在索引云系统中采用了一种混合型分布式索引组织策略。在索引云数据结构中,采用了一种新的以B+树为基础结合字典顺序数据结构的DicB+Tree索引树结构的框架DPIC (Distributed & Paralleling Index Cloud).基于DPIC设计了索引云的核心管理策略,保证了系统资源能够得到最大限度的利用。研究并给出了索引云的内部处理架构、索引数据的组织方式,索引数据的分配,索引项数据的备份以及索引数据的调整和重构的方法。此外,本文还详细阐述了索引云中的数据检索任务的分析、分布式调度的处理过程。本文系统综述了企业搜索引擎的特点、企业搜索引擎技术的研究现状,分析了企业搜索引擎在检索需求、检索方法、检索对象和安全性等方面与传统的web检索存在的差异。因此,我们需要从搜索引擎的系统架构、索引组织策略、信息检索算法以及任务调度算法等方面全面研究企业搜索引擎系统,提出了企业搜索引擎与云计算相结合的思想。本文进一步提出了基于索引云的企业搜索引擎体系架构。介绍了企业搜索引擎的三个组成部分:通用存储平台、通用服务平台、通用应用平台,并详细说明了三个平台实现的方法。它以较低的硬件投入解决了全文搜索系统索引文件膨胀,网络带宽瓶颈以及磁盘I/O瓶颈等问题,提供了高效的数据存储和并行计算服务。本文设计出针对此体系的分布式的任务调度设计,综合考虑到索引节点的任务负载水平和索引词频,优化任务分配,避免出现系统热点,提高了索引系统的查询速度和可靠性。本文利用分布式开源系统框架Hadoop和开源搜索引擎系统Lucene,搭建了基于索引云原型的企业搜索引擎系统,进行了系统性能实验验证。本文详细讨论了基于索引云架构的企业搜索引擎的实验系统中各个部分的详细构建方法,从响应时间、吞吐率、负载均衡度等三个方面,对索引云原型系统进行了评估,证明了其可行性和良好的应用效果。

【Abstract】 With the high speed development of the enterprises informatization, the internal data resources of enterprises are rapidly raising. Therefore, the enterprises call higher request on the information management and resource access, which brings out the higher demand for enterprise search engine. Index Structure is one of the core technologies of search engine and has an influence on the performance of whole search engine directly. For the design of enterprise search engine, this dissertation applied the typical thought of Cloud Computing to the index system and presented a novel index framework:Index Cloud. We also presented a new architecture of enterprise search engine based on the design of Index Cloud.The paper firstly gives the concept of a search engine and how the search engine is classified. Then studied the principle and technology of the search engine, and take a look at the development of the search engine. On the other hand, the paper states the concept and classification of the Cloud Computing. Then the paper studied core technology of Cloud Computing.This organization of the index and then conducted a detailed study to explain the concept of indexing, and index files are organized on the organization of several commonly used B-tree index, B+trees, R-tree, R*tree for a detailed research and discussion. Then the composition of the index entry methods, such as being ranked index, inverted index, suffix array, the signature document technology are discussed. Cloud computing in the search engine and based on the theory, based on the index theory of the index cloud model, the model classification based on data storage, distributed computing and parallel processing of three basic principles of design, with a high degree of virtualization, high performance, high reliability, strong security, scalability, versatility and other notable features, more suitable for enterprise search engine requirements.In this paper, a comprehensive index of the cloud model to study in depth. Detailed definition of the cloud is given an index, the index of the principles of cloud; cloud the basic characteristics of the index. Index for search engine retrieval performance in organizational strategy and scalability problems, etc., in the basic index-organized strategy comparison, this cloud system, the index uses a hybrid distributed index-organized strategy. Cloud data in the index structure, to use of a new B+tree-based dictionary index tree(DicB+Tree) forming DPIC(Distributed & Paralleling Index Cloud), and based DPIC, an index designed to cloud the core management strategies to ensure that the system resources can be utilized. Research shows the index of the main cloud of internal processing architecture, distributed parallel index tree structure, the index distribution of the cloud index data, index data replication, data migration and reconstruction of the index method. In addition, This paper describes the index to retrieve data in the cloud analysis tasks, distributed scheduling process.Then this systematic review of the concept of enterprise search engine and features, enterprise search engine technology, Research, analyzed the needs of enterprise search engine in the search, retrieval, retrieve objects, and security aspects of traditional web search with the existing differences. Therefore, we need a system architecture from the search engines, indexes organizational strategy, information retrieval algorithms and scheduling algorithms in a comprehensive study of enterprise search systems, search engines and the proposed business combination of cloud computing.The design of Index Cloud model is based on three fundamentals:data classification storage, distributed computing and parallel processing. It is characterized by visualizations, high performance, high reliability, strong safety, easy extensibility as well as universality; hence it can be more suitable for the requirements of enterprise search engine.The architecture of enterprise search engine based on the Index Cloud is further put forward. The new architecture not only resolves the problems exist in full-text searching system, such as index data inflation, network bandwidth bottleneck and disk I/O capability bottleneck, but also provides efficient data storage and parallel computing service. A distributed task scheduling model is established for the architecture, which took the task load level of index node and the index frequency into account with the purpose of optimizing task allocation, avoiding hot spots and ultimately improving the performance of system.Finally, a prototype system of Index Cloud based on Hadoop and Lucene has been constructed as a platform for the validation of system performance. We have conducted extensive simulation studies for response time, throughput, load balance and precision ratio. The experiment results demonstrate its feasibility and satisfactory applicable effects.

  • 【网络出版投稿人】 武汉大学
  • 【网络出版年期】2012年 07期
  • 【分类号】TP391.3;F270.7
  • 【被引频次】2
  • 【下载频次】691
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络