节点文献
基于对等网络的大地规模内容检索研究
Peer-to-Ppeer Based Large Scale Content Retrieval
【作者】 陈汉华;
【导师】 金海;
【作者基本信息】 华中科技大学 , 计算机系统结构, 2010, 博士
【摘要】 随着网络技术的迅猛发展和网络应用的迅速普及,互联网日益形成一个巨大的分布式信息库。互联网应用产生的超大规模信息对现有的网络数据管理基础设施提出了新的严峻挑战。互联网信息库的无限扩张性和与生俱来的分布式特性使研究非集中式的数据管理和共享机制成为一种必然趋势。基于分布式技术的大规模内容检索研究具有重要的学术价值和应用价值。对等网络(Peer-to-Peer Network,简称P2P)打破了传统的“客户机/服务器”模式,以“自主、平等”的原则将处于网络边缘的计算、存储、通信、信息等各种资源高效地共享起来,形成分布式的协作网络。对等计算模型凭借其分布式、易扩展、容错性高等优点,日益在互联网信息共享方面显示出巨大潜力。然而,对等网络的分布式、动态性、异构性等特性,又给基于对等网络的大规模内容检索带来了巨大的挑战。首先,虽然分布式哈希表技术使现有的对等网络系统能准确、快速地定位全局数据对象,但分布式哈希映射的精确性与用户查询语义多样性的矛盾,却是构建大规模对等网络内容检索系统带来难以突破的瓶颈;其次,由于缺乏集中的索引服务器,传统集中式信息检索的模型、算法和技术在分布式对等网络环境下无法适用。大规模分布式内容检索系统的核心问题,即如何建立高效的分布式索引以支持大规模网络环境下的复杂内容检索,在国际学术界至今并没有有效解决。基于对等网络的大规模内容检索是一个极具挑战性的开放性课题。本文从这一核心问题出发,通过扩展传统对等网络的概念、结构、资源描述与组织、资源发现与路由、结果融合与排序等,在大规模对等网络内容检索方面作了一系列研究,提出了一套行之有效的新理论、新方法,全面、深入、系统地论述了利用对等网络构建大规模分布式文本内容检索系统的解决方案和关键技术。具体来说,本文主要提出了以下创新性理论或方法:1.分布式集合运算布隆滤波优化理论及其多关键字搜索协议:基于传统的分布式哈希表全局索引,进行多关键字搜索,需要在广域网上进行分布式集合运算,这将给系统带来难以接受的网络开销。本文针对此难题,提出了一套针对分布式集合运算的布隆滤波优化理论,并基于此优化理论设计了一种高效的多关键字搜索协议PWEB。在美国国家标准研究院发布的TREC WT10G大规模文本检索测试集以及主流商业Web搜索引擎的查询日志上对PWEB进行了大规模的模拟测试。实验结果表明,相对现有结构化对等网内容搜索协议,PWEB协议将查询所需的网络流量显著降低了73%,同时将查询延迟降低了41%。2.多维分布式哈希表技术及其全文索引、检索及排序策略:提出一种新颖的多维分布式哈希表技术用于更高效的支持全文索引和检索,并设计了一种分布式多维索引剪枝算法TSS。基于TREC WT10G数据集和主流商业搜索引擎查询日志的大规模实验结果表明,TSS显著地将分布式多维索引空间复杂度从O(2n)降低到了O(nlog n);将查询网络流量降低到现有算法的28%;大规模实验结果同时显示TSS算法获得了与传统集中式信息检索算法相当的检索质量和性能。3.基于语义拓扑的联邦式搜索策略:基于自主开发的P2P文献共享平台SemreX,证实了对等网内容共享网络中的“兴趣局部性”原理,基于此原理提出一种结点内容相似性度量模型,并采用此模型将对等网络中的相似结点聚集起来形成语义覆盖网络,同时进一步探索了如何利用“small world”特性改进语义覆盖网络的拓扑属性。对提出的算法进行的全面仿真测试结果显示基于语义覆盖的SemreX联邦式搜索协议将传统无结构搜索协议的总体性能提高了81.6%。4.难度感知的混合式对等网络搜索协议:通过结合结构化DHT和无结构对等网搜索协议各自的优点,混合对等网搜索策略能有效提高对等网系统的检索效率。混合对等网搜索策略的关键问题是如何高效估计网络中拥有与查询相关数据的结点的数量,并据此选择最优的查询搜索策略。现有研究基于这样的假设:如果网络中与某查询匹配的相关数据很多,则这些数据广泛地分布在网络中,对此查询使用无结构搜索协议更有效;反之,则采用分布式哈希表查找更有效。从“兴趣局部性原理”出发,指出前人的研究假设并不成立,与查询匹配的大量数据往往聚集在少量结点上,而使无结构搜索协议效率显著降低。并进一步提出了一种查询难度感知(Difficulty-aware)的混合搜索协议QRank,它能够根据查询关键字在网络中出现的频率等统计信息有效预测各种搜索策略针对此查询的搜索效率,并智能地选择高效的搜索策略。基于Gnutella网络的真实拓扑和查询跟踪数据对QRank的协议进行了大规模全面的系统仿真测试。实验结果表明QRank混合搜索协议显著地提高了混合对等系统的搜索性能。相对于现有混合搜索协议,QRank将系统查全率提高了21%,将查询延迟降低了26%,同时将查询产生的平均网络流量降低了40%。
【Abstract】 The Internet has become an extremely large-scale distributed information repository with the prevalence of network applications and the rapid development of networking techniques. The extremely large-scale information with unlimited expansion brings new challenges to the current Internet data management infrastructure and makes the topic of buding decentralized data management and sharing infrastructure the inevitable research trend. Large-scale distributed content retrieval has an important academic and practical value.Peer-to-Peer (P2P) network is one of the state-of-the-art distributed systems built over the Internet. It breaks the traditional client-server model by introducing the concept of peer which basically acts as both a client and a server in the network. The P2P model efficiently harnesses resources such as computation, storage, communication, and information residing at the edges of the Internet to form a distributed cooperative network in an independent and equal manner. P2P networks demonstrate many good features such as decentralization, good scalability, and the ability to tolerate faults, hence it indeed has a great potential to become a major infrastructure for sharing information on the Internet.However, P2P networks are distributed, dynamic, and heterogeneous in nature, making it extremely difficult to build a large-scale content retrieval system without centralized repository and index. First, by utilizing distributed hash tables (DHTs), existing P2P data sharing systems can only provide exact-match style search support (such as file-name based search) with poor complex content retrieval capacity. Second, without a cetralized index existing state-of-the-art retrieval models and algorithms extensively used in centralized search engines are not applicable to such a distributed environment. Overall, the key issue - how to build efficient global index that supports complex queries and enables novel distributed retrieval model has not been effectively solved up to now, making large-scale P2P content retrieval a challenging and open research issue.To address this issue, we extend dimensions of current P2Ps including the concept, architecture, resource description and management, routing mechanisms, and result merging and ranking mechanisms. In this dissertation, we focus on the solutions to build scalable content retrieval in large-scale P2P networks, and propose a set of novel theories and enabling techniques.Specifically, the contributions of this dissertation are fourfold.1. Theory of optimizing Bloom Filter (BF) for distributed set operations and multi-keyword search protocol over structured DHTs using optimal BF settings. Multi-keyword searching over DHTs needs distributed set operations over wide area network, raising unacceptable bandwidth cost. To address this issue, we propose a novel Bloom Filter optimizing theory and design PWEB, an effective and efficient multi-keyword search protocol over DHTs based on the theory. We conduct comprehensive large-scale trace-driven simulations to evaluate PWEB design using National Institute of Standards and Technology (NIST) benchmark TREC WT10G large-scale standard testing collection and the query logs from a major commercial Web search engine. Results show that PWEB significantly outperforms existing techniques. It reduces the network traffic by 74%, as well as reduces the search latency by 41%.2. Multi-dimensional distributed hash table technique and global full-text indexing searching and ranking strategies. We extend the traditional DHT structure, and propose a novel multi-dimensional distributed hashing structure to support term-set indexing over distributed environments. A novel index pruning algorithm, TSS, is designed to reduce the index size. We base our design on the observations of the query logs from a major commercial search engines. We show through experiment that the TSS significantly reduces the space complexity of the global index from O(2n) to a formal bound of O(nlogn), reduces the communication cost to 28% of the cost of existing schemes, and achieves performance comparable to traditional popular centralized information retrieval algorithms.3. Federated search over semantic overlay. We design and implement SemreX, a P2P based literature sharing and retrieval system. The system work demonstrates the principle of interest locality in P2P systems. Based on the principle, we propose a novel peer similarity measurement model. Peers with high similarity values are linked together to form the semantic overlay. We further investigate how to leverage the small world topology features to improve the search performance of the semantic P2P overlay. Performance evaluation results show that SermeX federated search protocol based on semantic overlay can greatly improve the overall search efficiency by 81.6% compared to the popular Gnutella protocol, greatly improving the scalability of the unstructured P2P retrieval systems.4. Difficulty-aware hybrid search protocol in P2P networks. By combining structured DHTs and unstructured search protocols, hybrid P2Ps have the potential to effectively improve the performance of P2P retrieval systems. The key issue of hybrid P2P network is how to estimate the number of peers with matched items. Existing studies assume that such popularity can be obtained by estimating the number of relevant items in the distributed networks; that is to say if there are many matched items, they are also widely distributed across the network. In this case, flooding based search protocol will be more efficient, otherwise DHT-based search protocol will be more preferable. Based on the demonstrated principle of interest locality, we argue that the assumption of existing work is invalid and not practical in real world P2P systems. We further propose QRank, a novel difficulty-aware hybrid search protocol. It intelligently selects search strategies based on the ranked query difficulty estimated with the gathered statistical information of keyword popularity in distributed P2P networks. We collect the trace from popular real world P2P systems, and conduct comprehensive large-scale trace-driven simulations to evaluate the performance of QRank design. Results show that QRank hybrid P2P search protocol significantly improve the search performance for hybrid P2P networks. It improves the recall by 21%, reduces the search latency by 26%, and reduces the network traffic by 40%, compared to existing work.
【Key words】 Peer-to-Peer Network; Distributed Hash Table; Bloom Filter; Retrieval;