节点文献

基于Nutch的并行搜索系统的优化设计

The Optimized Design of Parallel Search System Based on Nutch

【作者】 陈车前

【导师】 董守斌;

【作者基本信息】 华南理工大学 , 计算机系统结构, 2011, 硕士

【摘要】 Nutch开源系统的出现,大大地促进了企业、校园甚至个人网络的搜索引擎的发展。它具备完整的商业搜索引擎的基本功能,包括采集器、索引器以及检索器,用户可以根据自己的需求在上面搭建专用的搜索系统。为了提高处理的数据量,Nutch提供了多机版的搜索解决方案。它利用Hadoop提供强大的分布式采集功能,能够极大地提高了采集的速度,但它对索引以及检索部分却支持不够,需要较多的人工参与。同时,随着互联网的快速发展和网页数据的海量增长,即使是多机版的搜索,性能也会成为一个瓶颈。鉴于此,本文基于Nutch提出了一套优化设计的并行搜索方案,为企业级检索应用提供高性能的搜索解决方案。首先,本文利用Shell脚本与配置文件来控制系统的运行流程,免除了大部分的手工操作,为Nutch提供了完整的多机版索引以及检索的解决方案,提高了系统的自动化管理以及可维护性。其次,本文提出了一套简单高效的索引划分方法——基于URL的索引划分。即根据网页URL的哈希值,对计算节点总数取模,得到被分配的计算节点号。一方面,根据URL便可以唯一确定网页所在的计算节点,因此这种方法能够有效地解决索引动态更新的问题;另一方面,由于URL的哈希值非常随机,取模操作符合伯努利大数定律,因此能够非常均匀地将网页分配到各个计算节点。然后,本文实现了“静态+动态”的高效缓存机制。静态缓存是指根据用户的搜索日志,统计出一部分热门的查询词,将它们的搜索结果存入缓存中。本文提出以查询词的公众关注度和查询词的稳定热门性来选择热门的查询词。动态缓存是指采用一定的替换策略来缓存当前的搜索结果。静态缓存与动态缓存各有优劣,但它们刚好优劣互补,因此能够更加有效地提高缓存的命中率。实验测试表明,加入缓存后,系统能够有效地减少完整处理搜索过程的次数,从而较大地提高了系统的性能。最后,为了提高系统的稳定性,本文提出了一种简单的冗余备份机制——“一级循环冗余备份”,即下一号节点备份前一号节点的索引数据,在前一号节点突然崩溃的情况下,启动下一号节点的搜索服务。实验测试表明,在部分节点崩溃的情形下,仍然能给出正确的搜索结果,虽然导致性能略为下降,但保证了搜索结果的准确性。

【Abstract】 The arising of Nutch open-system, greatly promotes the development of enterprise, campus or even personal web search engine. Nutch has complete basic functions of commercial search engines, including crawler, indexer and searcher, users can build their own search system based on it. In order to increase the amount of search data, Nutch provides a multi-machine version of search solution. It uses Hadoop to provide powerful distributed crawl function, which can greatly improve the speed of crawling, but don’t provide enough support to indexer and searcher, you need human intervention to get multi-machine version of that parts. As the development of Internet, the amout of web pages will increase rapidly, even in the multi-machine version of Nutch, performance still will become a bottleneck. In view of that, this paper proposes a Nutch-based optimized design of parallel search, which can provide a high performance solution for multi-machine version of Nutch.First,this paper uses Shell scripts and configuration files to control the run of the system, which eliminates most of the manual operation, so can improve the automatic management and maintainability of the system.Secondly, this paper proposes an efficient method of index partitioning, which is partitioned by URL. Because URL is a unique feature of Web page, and the hash code of URL is very random, this method can not only solve the dynamic update problem of index, but also equally divide web pages to every node.Thirdly, this paper achieves a“static+dynamic”caching method, which can effectively reduce the number of processing search procedure, so can greatly improve the performace of the system. Here, static cache means the content of cache is not changeable, and dynamic cache means the content exchange frequently under a cache placement policy.Finally, in order to improve the stability of the system, this paper presents a simple redundant backup method, which is“one-level cyclic redundancy”. That is, next node backs up the index of the former node, when the former node suddenly collapses, the next node automatically starts up to replace it, thus can ensure the accuracy of search results.

【关键词】 Nutch并行搜索索引划分缓存冗余备份
【Key words】 NutchParallel SearchIndex PartitionCacheRedundant Backup
  • 【分类号】TP391.3
  • 【下载频次】216
节点文献中: