节点文献

基于对等网络的资源搜索策略的研究

Research on the Resource Retrieval for Peer to Peer Networks

【作者】 徐婕

【导师】 金海;

【作者基本信息】 华中科技大学 , 计算机系统结构, 2007, 博士

【摘要】 近年来,随着Internet的飞速发展、网络带宽的成倍增加以及计算机计算能力的大大提高,对等网络逐渐引起了来自工业界和学术界越来越多的关注。对等网络通过对等和分布式的方式,在网络中不同节点间提供空闲的CPU处理能力,磁盘空间以及网络带宽。除了采用中央索引服务器的集中式对等网络之外,从网络拓扑上,对等网络还可以分为无结构对等网络和基于分布式哈希表的结构化对等网络。与任何大规模的分布式系统一样,对等网络系统成功与否不仅在于其网络结构的合理和有效,而且在很大程度上还取决于其资源搜索策略的灵活性和可扩展性。对等网络的资源搜索策略可以分为两种类型:基于关键字的搜索和基于全文的搜索。无结构对等网络采用类似泛洪的盲目搜索机制,虽然可以支持上面两种类型的查询,但搜索的效率和可扩展性都较低;结构化对等网络依据文档标识符进行查找,可扩展性和查找效率都较高,但是采用内容的哈希(Hash)值作为索引,其索引和内容语义无关,无法真正做到全文检索。对等网络的资源搜索的研究中,逻辑拓扑结构的组织方式和数据索引的放置方式是两个很重要的研究内容。在基于关键字检索的对等网络平台的设计中,首先充分考虑节点网络邻近性特征,通过节点类划分的方法将物理距离较近的节点归为一组,以确保网络中邻近节点间的路由过程大部分能在组内部完成,从而避免了Chord等网络存在的绕路问题,并能够降低系统路由时间开销,减少发送消息的数量,据此提出了使用Landmark策略进行节点类划分的方法;然后,由于网络中的工作负载都是有空间和时间的局部性,而且用户总是趋向于查找自己感兴趣的资源,这些资源通常属于同一个类别,因此根据数据的类别进行数据索引的存放,使得同一类数据索引放在相近的节点上,然后根据类检索表,可以快速的找到同一类别的数据;另外,小世界现象在网络中广泛存在的,将非确定性缓存策略应用于路由表,然后采用SW(Small World)缓存置换策略,使得对等网络能够逐渐收敛于小世界模型;在最后的理论分析和性能测评中,表明了这种策略能够提高系统的查找性能并且可以减少系统的维护开销。在基于全文检索的对等网络平台的设计中,首先考虑了数据索引的放置和定位策略。使用一个平衡树(文档聚类树)来组织对等网络环境中的共享数据,通过调节平衡因子的大小,可以控制和减小文档搜索的时间复杂度;然后,给出一个简单的树节点放置策略,从而保证了系统的负载均衡以及保证了系统的容错性;随后,提出TRES-CORE查询策略,使得每个查询对每个节点只操作一次,降低了分布式环境下的查询时间,避免了查询中的路由绕路问题。另外,使用向量空间模型(Vector Space Model,简称VSM)技术提取全文的数据索引,通常会有成千上万个关键字,对应了成千上万维的特征空间。这些高维的特征集对资源索引的建立是非常有害的。进而,提出了基于粗糙集的文档空间降维技术来提高资源的搜索性能。逻辑拓扑结构的组织方式,是基于全文检索的对等网络平台设计中的另外一个重要研究内容。首先,采用分层的方式给出了一个对等网络资源检索的通用模型,并设计了每层之间的接口。采用分层的结构,可以使得模型具有较强的适应性,当某层策略发生改变时,其它层次的策略可以不变。随后,基于扩展性好、查询效果好、不存在系统瓶颈以及能够支持全文检索的目标,提出了一个半结构化混合模型。在半结构化混合模型中,所有节点根据物理位置分成若干个节点类。每个节点类中存在一个超级节点(Super Peer,简称SP)和若干个普通节点(Ordinary Peer,简称OP)。超级节点采用分布式哈希表(Distribute Hash Table,简称DHT)的方式进行组织,每个节点类中所有的节点以非结构化的方式组织。在这里,半结构是指在系统的构造中同时存在结构化和非结构化的组织方式;混合是指在系统中仍然存在超级节点SP,但此时SP记载用于节点和文档的分类信息,并不维护整个节点类中所有共享文档的索引信息,从而部分解决了SP是系统瓶颈的问题。在最后的实验测试中,可以获知这个设计在能够完成全文搜索的前提下,在资源搜索效率上也有一定的提高。

【Abstract】 With the rapid growth of Internet and computing power, peer-to-peer (P2P) systems have gained much attention from both industrial and academic fields. P2P systems share idle CPU power, free disk space and network bandwidth between different peer nodes in a distributed and equal way. Expect for centralized systems based on an index server, P2P systems can be roughly classified into two categories: unstructured P2P systems and DHT-based structured P2P systems. As for any large distributed system which is used heavily, the effectiveness of P2P systems largely depends on not only its topology structure, but also the versatility and scalability of its retrieval mechanism.The resource retrieval mechanism for P2P systems can be classified into two categories: keyword-based retrieval and fulltext-based retrieval. Resource retrieval mechanisms in unstructed P2P systems are inherently blind, which makes the search inefficient and unscalable. While structured P2P networks can provide search efficiency and scalability by deploying identifier-based retrieval mechanism, they fail to support flexible full-text retrieval just as unstructured P2P systems can do.For the research on the resource retrieval in P2P systems, there are two important research fields: the logical topology structure of network system and the placement of resource index.For the P2P system which supports the keyword-based retrieval, the Landmark scheme is proposed to group all of peer nodes into several clusters based on the physical topology of network firstly, which makes peer nodes in the same cluster have small link latency and peer nodes in the different cluster have long link latency. It can guarantee most of the routing is in the same cluster which can avoid the“reroute”in the Chord system and can reduce the time cost for the routing as well as the number of messages. Then, because it is obvious that P2P system workload has temporal and spatial localities just as that in the web traffic and users always retrieve data of a kind, which they are interested in, the resource index should be stored based on resource semantics which makes the same kind of resources placed in the same cluster. After that, a class cache table is utilized to cache the identifier of peer node where the resource of some kind searched recently stores and the identifier of this kind. If the resource of this kind is researched next, the information of cache table can be used directly. Lastly, because it has been observed that the small world phenomenon is pervasive in the network. A non-deterministic caching scheme is given to reduce maintenance cost for updating the routing cache table. And the SW cache replacement scheme with the small-world paradigm instead of the traditional LRU scheme is proposed to further improve the performance of object lookup. Both theoretical analysis and simulations show this scheme can improve the lookup performance as well as it can reduce maintenance cost under the same size of routing table.For the P2P system which supports the full-text retrieval, the placement scheme of resource index is considered firstly. A height-balanced tree structure DOC-Tree used to organize data objects in vector-format in the P2P system is proposed, which can reduce the time complex of search. The simple strategy for the placement of tree’s nodes is given, which can guarantee both load balance and fault tolerance. After that, TRES-CORE searching scheme is used to reduce the search time in the distribute environment. The resource index is extracted using the vector space model technology, which will result in hundreds or thousands of dimensions in the resource vector space. So a dimension reduction technology based on the rough set is presented to improve the efficiency of search mechanism.The logical topology structure of the network system is another research field for the P2P system which can support the full-text retrieval. Firstly, a general hierarchical model for the resource retrieval of P2P systems is presented and interfaces among each level are also given. Then, a semi-structural and hybrid logical network structure (SSH) is proposed, which can obtain a good scalability for the system and a good efficiency for the search mechanism as well as can also avoid the system bottleneck and support the full-text retrieval. In the SSH model, all of peer nodes are partitioned into several peer clusters according to their physical locations. There are a super peer (SP) and several ordinary peers (OP) in each peer cluster. Super peers are organized by Distribute Hash Table (DHT) and peer nodes in each cluster are organized in the unstructured way. At last, it is known this design can get an improvement for the resource retrieval in the P2P system through the experiment results.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络