节点文献

一种开放式高性能全文检索平台的研究与实现

Research and Implementation of an Open High-Performance Platform of Full-Text Retrieval

【作者】 洪田玉

【导师】 曾志文;

【作者基本信息】 中南大学 , 计算机系统结构, 2009, 硕士

【摘要】 信息的快速增长促使搜索引擎的迅速发展。通用搜索如Google、Baidu已取得很大成功,然而,一方面它们的技术严格保密,另一方面,开发人员不可能将庞大的通用搜索引擎无缝地嵌入到自己的应用程序中;此外,缺乏对中文支持良好的开源搜索引擎。为此,本文研究并实现了一种新的中文全文检索平台。该平台具有高性能、架构灵活等特点。它既可以很方便地应用于各种动态数据环境的实际领域,也可以用来构建信息检索的实验系统。本文的主要研究工作如下:1.针对传统最大正向匹配算法的效率较低和灵活性差的问题,提出了一种改进算法。该算法采用了基于HASH和TRIE树的词典结构,使分词效率提高了约200%。同时,该算法摆脱了传统最大正向匹配算法的固定最大词长度限制,具有更好的灵活性。2.针对传统索引结构难以满足动态数据环境的不足,本文提出一种新的索引创建方案。该方案主要包括:(1)分级的倒排索引组织结构和链式存储方式,能够很好地解决索引动态增长要求;(2)基于动态平衡树的索引合并策略;(3)可配置的限制性指数分配策略,提高了索引内存利用率和分配效率;(4)基于d-gap的差量压缩算法,使索引文件大小减少了75%,从而减少I/O次数,提高系统性能。3.基于前面提出的分词算法和索引创建方案,采用C++面向对象设计思想以及工厂模式等设计模式,设计和实现一个架构灵活、扩充性良好的全文检索平台,系统平台主要包括索引子系统,检索子系统,存储子系统和插件管理子系统,以及内存管理组件。4.利用该平台设计和实现一个实用的商用搜索引擎系统。该搜索引擎提供用户对网络监控数据的搜索。为各种类型(文本、html、email、office文档、pdf文档等)的监控数据创建大容量索引,提供基于内容分类的高性能查询。该系统投入实际使用半年多所取得显著的成效也很好地证明检索平台的高效性。

【Abstract】 The explosive growth of information promotes the expeditious development of search engine. General search engines such as Google, Baidu have been proved to be successful. However, on the one hand, their business technology is confidential, on the other hand, developers can’t seamlessly embed these general search engines into their applications; besides, it lacks open source search engines which support Chinese well. Therefore, the thesis researches and implements a new Chinese full-text retrieval platform. With high-performance and flexibility, It aims to either be applied into practical field of dynamic data environment, or provide for a feasible of research and experimentation in information retrieval. The main research works and innovations in the thesis are as follows.1. An improved method is presented accounting for the low-performance and poor flexibility problems of the traditional MM(maximum matching) segmentation method. It uses a new dictionary structure based on Hash and Trie Tree structure, which greatly increases the speed of word cutting by 200%. Moreover, freeing itself from fixed maximum matching length, it has more flexibility.2. Aiming at the traditional index structure hard to adapt the dynamic data environments, a new index creating scheme is presented. It includes: (1) improved inverted indexing structure and chain storage perfectly solves the problem of dynamic increasing index data; (2) a novel index merging strategy based on dynamic balance tree; (3) configurable memory allocating strategy based on limited exponent method greatly improves the utilization rate and efficiency of index memory; (4) differential compressing algorithm based on d-gap, which greatly reduces the size of index files by 75% and indirectly reduces I/O times.3. Based on the word automatic segmentation algorithm and index structures, described above, using object-oriented programming with C++ and several design patterns such as factory pattern, we design and implement a high-performance Chinese index platform with flexible architecture and scalability. The subsystems and modules includes index subsystem, searching subsystem, storage subsystem, plug-in managing subsystem and memory managing module.4. At last, based on the index platform, we develop a business searching engine. It creates high-capacity index for all kinds of monitoring data which records users’ behaviors of accessing Internet, and provides rapid-response query services. Results from practical use for more than half a year proved the efficiency of the full-text retrieval platform.

  • 【网络出版投稿人】 中南大学
  • 【网络出版年期】2010年 04期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络