节点文献

基于局部网络信息的贪婪式P2P资源定位技术研究

Research on Greedy P2P Resource Location Technologies Based on Local Information of Networks

【作者】 马文明

【导师】 孟祥武;

【作者基本信息】 北京邮电大学 , 计算机科学与技术, 2013, 博士

【摘要】 P2P (Peer-to-Peer)技术目前已广泛应用于资源共享和内容分发服务,在工程和理论方面都成为了最活跃的研究领域之一。随着各种P2P内容共享网络规模的逐渐增大,节点扰动现象越来越严重。单个节点难以获取和维护网络中大部分节点的信息,节点的邻居数量相对于网络规模越来越小。因此,在节点仅掌握局部网络信息的情况下,如何使用贪婪式的定位技术帮助用户快速找到所需要的资源,成为了一个具有挑战性的研究课题。P2P网络资源定位的性能评价标准不仅包括搜索命中率,还涵盖了网络开销、存储开销、路由维护开销、负载均衡等多个方面,而性能的影响因素也涉及到搜索算法、网络结构、复制机制、路由构造机制等多个方面。提高P2P网络的资源定位性能,需要从多种技术角度来进行研究。P2P网络可分为非结构化和结构化两种类型。在节点仅掌握局部网络信息的情况下,非结构化P2P网络的搜索具有严重的盲目性。贪婪式定位方法在网络扰动比较严重、资源变化比较频繁时,各种性能急剧下降。本文针对该问题,从搜索机制、网络结构和复制机制三个方面进行了研究,提出了不同的解决方法,从而使网络能够在节点扰动频繁的环境中,保持较高的资源定位性能。在节点仅掌握局部网络信息的情况下,结构化P2P网络面临的主要问题是如何构造可提高贪婪式定位性能的路由表,以及降低贪婪式多关键字搜索造成的过高的通信量。本文针对当前的主流DHT (Distributed Hash Table)型结构化P2P网络进行了研究,形式化描述了构造路由表规律,并分析其缺点,给出了改进方法。对常见的Kademlia型结构化P2P网络给出了降低多关键字搜索通信量的实现方法。本文主要研究成果如下:(1)在非结构化P2P网络中,贪婪式搜索通常利用节点存储的历史搜索记录作为消息转发的重要依据。在节点仅掌握局部网络信息的情况下,节点频繁的加入和退出以及资源分布变化会导致历史搜索记录失效,降低搜索命中率。本文针对该问题,形式化地分析了随机漫步搜索易转向高度数节点的特性,逆向随机漫步的性质和单独使用逆向随机漫步的缺点,并提出了一种面向非结构化P2P网络的双向随机漫步搜索机制。这种机制采用Piggybacks方式,在正向搜索转发消息的过程中同时执行逆向搜索,将资源的存储信息、请求信息、更新信息嵌入到转发包中,并与转发过程中访问过的节点进行信息交换,沿搜索路径依次传递。本文对该机制的性能进行了理论分析,实验结果表明该方法能够以较低的开销,提高各种资源包括稀有资源的搜索命中率。(2) Q-learning型搜索是一种应用于非结构化P2P网络的典型的贪婪式搜索机制。尽管双向随机漫步搜索机制可以有效地提高搜索命中率,但转发包中嵌入了大量的额外信息,当资源ID及其状态标志等信息占用的字节数比较多时,容易造成过高的通信量。因此,本文提出了基于节点差异化的Q-learning搜索机制。该机制根据节点的在线时长记录和所存储的资源种类,将节点分成三种类型:服务节点、稳定节点(代理节点)、普通节点。这三种类型的节点分别组成核心服务区、代理区和普通区。节点包括两种偏好:本地偏好和服务偏好。不同类型的节点按照服务偏好进行连接,相同类型的节点按照本地偏好进行连接。Q-learning模型根据资源的种类而不是资源本身来建立且仅在核心服务区内执行Q-learning搜索和模型的更新。普通节点或代理节点需要间接或直接选择合适的服务节点来执行Q-learning搜索。本文从多个方面分析了影响Q-learning搜索性能的各种因素。模拟实验结果表明,当网络扰动比较频繁时,在节点差异化的网络结构中,Q-learning型搜索仍然可以保持较高的命中率。(3)当资源的副本数量和资源的请求数量都比较少时,在节点仅掌握局部网络信息的情况下,对该资源副本和资源请求同时进行Flooding式主动复制是提高搜索性能的一种重要方式,然而这种方式很难控制资源副本和请求副本的数量。针对该问题,本文在对多种P2P网络拓扑结构的行为传播进行了分析,并提出了一种分布感知的协同主动复制机制。该机制执行受限的Flooding式主动复制,节点根据网络局部范围内的资源副本和请求副本分布情况,决定是否执行复制和继续进行传播。本文在模拟网络环境中实现了该方法,并对比和分析了不同网络拓扑结构中,影响副本传播范围和数量的各种因素。(4)在DHT型结构化P2P网络中,资源的索引通常发布到一组特殊节点上,这些节点的ID与资源Key值比较接近。使用基于局部网络信息的贪婪式定位时,每个节点寻找或发布同一个资源,最终发现的目标节点集合并不完全相同。因此,提高搜索命中率,就需要最大化相同目标节点数量,而路由表的构造机制起到了至关重要的作用。本文对该情况下的结构化P2P网络路由表构造机制进行了研究。首先假定网络节点数与ID空间大小相等,网络不存在扰动,对路由构造问题给出了形式化描述方法,并将该问题转化为等价的Change-making问题,基于Canonical原则,得出了2k系统为该假定条件下的最优的路由构造方式。然后延伸到实际的网络环境,并得出2k系统依然是合理的构造方式。最后对ID空间过大的2k路由表结构的缺点进行了分析,提出了一种ID冲突允许的路由表构造方法,并通过实验从多个角度对比和分析了这种网络与典型的Kademlia型网络之间的资源定位性能,结果表明这种方法在节点仅掌握少量网络节点信息的情况下,仍保持较高的搜索命中率。(5)由于寻找或发布相同资源,使用基于局部网络信息的贪婪式定位最终发现的目标节点集合不完全相同,为提高搜索命中率,相同的索引信息会分散存储在多个不同节点上。使用多个关键字寻找资源时,需要寻找每个关键字对应的资源,并将每个关键字分散存储在多个节点上的资源信息进行并集操作,然后再将不同关键字的结果信息进行交集操作。由于单个关键字对应的资源数量比较多,并集和交集操作过程会造成巨大的通信量。本文提出通过优化的布隆过滤器进行搜索结果合并,来降低Kademlia型网络多关键字搜索通信量。这种方法利用通信代价和丢失率评估模型,仅根据集合的大小就可获得优化的布隆过滤器参数。实验结果表明,该方法在允许的丢失率范围内,可以极大地降低网络通信量。

【Abstract】 P2P (Peer-to-Peer) networks have been widely used for resource sharing and content distribution services. They have become one of the most active research areas both in the engineering and theoretical fields. In recent years, with the increase of network size, the nodes have much higher churn rate than before. A single node then cannot maintain the information of most of the others’ones in the whole network, and the number of neighbors of each node is far smaller than the network size Therefore, efficiently locating resources using greedy search becomes a challenging research topic, if nodes only know local information of networks.The performance metrics of P2P network resource location include not only the search hit rate, but also the network overhead, storage overhead, routing table maintenance overhead and load balancing. The impact factors of performance include the search algorithms, network structure, replication mechanism, routing table structure and so on. To improve the performance, we should study the location technologies from multiple points of view.The P2P network can be divided into two types:unstructured and structured ones. When nodes only know local information of networks, locating resources in unstructured P2P networks is severely blind. Traditional greedy search algorithms of the unstructured networks suffer serious performance degradation. In this thesis, we study this problem from three aspects:search mechanism, network structure and replication mechanism. Different solutions are proposed, which can keep high resource location performance under high peers’churn rate. For structured P2P networks, we focus on routing tables design and multi-keyword search. In this thesis, we mainly study on the main stream DHT (Distributed Hash Tables) structured networks. We present methods how to design routing tables for increasing the search hit rate, and how to use the optimal Bloom Filters for decreasing the communication cost of multi-keyword search in Kademlia-like networks. The main contributions of this thesis are as follows:(1) In unstructured P2P networks, the greedy search algorithms forward the query messages depending on the historical search records stored on each node. When nodes only know local information of networks, high peers’churn rate makes the historical search records useless, which can decrease the search hit rate. To solve this problem, we formally analyze the characteristic of the random walk search, which has a high probability of forwarding queries to the high-degree nodes. We also analyze the disadvantages of only using the reverse random walk search. And then we introduce our bidirectional random walk search mechanism (BRWS). This mechanism runs in a piggybacks way. It both uses the reverse and forward random walk. The information of the resources’locations and status, requirements is embedded in the query messages, and exchanged along the search paths. The performance of this mechanism is analyzed theoretically. The experimental results show that BRWS can actually improve the search success rate with lower overhead even for the rare resources.(2) Q-learning search is a typical greedy search mechanism used in unstructured P2P networks. Although BRWS can effectively improve the search hit rate, but too much information is embedded in the query message. If the information of resources occupies many bytes, BRWS will cause excessive traffic. Therefore, we present another method to improve the performance of greedy search, which differentiate the peers to construct special network structures. This method divides the nodes into three types according to their online time and resources:service nodes, stable nodes, and ordinary nodes. Different types of nodes make up different areas:core service area, agent area and the ordinary area. Stable nodes are also called agent nodes. Each node has two kinds of preference:local preference and service preference. The different types of nodes are connected according to their service preference, and nodes belonging to the same type are connected according to their local preferences. Q-learning models are established according to the type of resources, not the resources themselves. We perform the Q-learning search and update models only in the core service area. Ordinary nodes or agent nodes need to select the appropriate service nodes, directly or indirectly, to execute Q-learning search. We analyze a variety of impact factors affecting the Q-learning search performance in such a network structure from several aspects. Simulation experimental results show that, even if the networks are disturbed frequently, the Q-learning search can still keep high hit rate.(3) When the number of copies of a specific resource and requests for this resource are relatively small, replicating this resource and the requests using flooding is a direct way to improve the search performance. However, it is difficult to control the number of the copies. To solve this problem, we first analyze the spread behavior of the P2P networks with different topologies, and then propose the distribution aware collaborative active replication mechanism. This mechanism is designed based on limited flooding active replication. Each node decides whether to execute replicating and spreading according to the number of copies in local area, which is obtained by distribution aware methods. We compare and analyze various factors affecting the replication spread behavior in a simulated network environment with multiple different topologies.(4) In DHT based structured P2P network, a resource index is usually published to a special set of nodes, the IDs of which are relatively closer to this resource’s key. When nodes only know local information of networks, the sets of target DHT nodes located by the queries or publications for the same resource have smaller intersection size using the greedy search. To enhance the search hit rate, we should maximum the intersection size, which is mainly determined by the routing table structure. The rules of building routing tables for structured P2P networks are explored. We first analyze a special situation that network size is equal to the ID space with no peers churn, and present formal descriptions for this situation. We transform the way of building routing tables in such a network into a Change-making problem. We find that canonical solutions systems are the best choice to construct routing tables. And then we extend to the actual network environment, and find that canonical solutions systems are still reasonable. Finally, we analyze the drawbacks of canonical solutions systems when the ID space is too large. An ID collision-allowed (ICA) DHT routing table structure is presented, and is compared with the typical Kademlia network from multiple aspects. The experimental results show that ICA DHT based networks can keep high search hit rate with low average degree.(5) The sets of target DHT nodes have small intersection size, when nodes only know local information of networks. To guarantee we can find the resources under high peers churn rate, we should publish the same resource on multiple different nodes. When performing multi-keyword search, we should find all the target DHT nodes responsible for each keyword, and then merge the search results. The results on the nodes responsible for the same keyword are merged by union operations, and then the union results for different keywords are merged by intersection operations. The number of the resources for a keyword, especially a common keyword, is very large. The union and intersection operations will cause huge communication cost. This thesis presents a method of performing union and intersection operations using optimal Bloom filters to reduce the Kademlia network multi-keyword search traffic. This approach optimizes the Bloom filter settings dynamically according to the communication cost and loss rate models. We can get the optimal Bloom filter settings only depending on the sizes of sets. The experimental results show that this method can greatly reduce communication cost under an acceptable loss rate.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络