节点文献

布鲁姆过滤器查询算法及其应用研究

The Researches and Applications on Bloom Filter Query Algorithm

【作者】 谢鲲

【导师】 张大方;

【作者基本信息】 湖南大学 , 计算机应用技术, 2007, 博士

【摘要】 资源交互共享是计算机网络和分布式系统的核心,如何有效的表示信息和查询信息是资源交互共享中最本质的问题。高速发展的计算机网络和计算机系统中,当数据不断膨胀时,数据集合的表示和访问越来越困难。因此,设计精简数据结构支持日益增长的数据存储需求,设计与之对应的算法支持海量数据下的高效查询交互成为当前网络、数据库、分布式系统中资源交互共享的核心问题与严峻挑战。布鲁姆过滤器(Bloom filter)采用一个位串表示数据集合并能有效支持元素的哈希查找,是一种能够简洁的表示集合并支持集合查询的数据结构,广泛应用于数据库、网络和分布式系统中。本文从理论和应用两个方面对布鲁姆过滤器查询算法进行了深入的研究。系统地综述了布鲁姆过滤器查询算法迄今为止的主要研究成果,分析了目前布鲁姆过滤器查询算法的研究现状和缺陷,针对目前算法的不足,提出了分档布鲁姆过滤器查询算法、可扩展布鲁姆过滤器查询算法、联合多维布鲁姆过滤器查询算法、基于布鲁姆过滤器距离的集合变动评估算法,并探讨了布鲁姆过滤器代数运算和集合查询的关系。研究布鲁姆过滤器在分布式系统中的应用,提出了基于布鲁姆过滤器的节点轨迹标签P2P副本一致性维护算法和基于布鲁姆过滤器的混合移动自组织网络服务发现模型。本文的创新性成果主要体现在以下几个方面:1)提出代价敏感的分档布鲁姆过滤器查询算法针对现有的布鲁姆过滤器查询算法没有考虑查询失效代价这一缺陷,本文提出一种新的代价敏感的分档布鲁姆过滤器查询算法。它将元素根据不同的查询代价分为不同的子集,通过考查每档子集最低查询失效率的关系,建立由每档子集最低查询失效率表示的集合最低查询失效总代价目标函数,使用类目标函数梯度遗传算法获得每档的最优哈希函数个数ki,完成集合到向量的映射与查找。仿真实验结果表明,新的分档布鲁姆过滤器查询算法和标准布鲁姆过滤器算法相比,所用的查询计算时间基本相同,而查询失效总代价降低至少40%。2)提出基于H3哈希函数的可扩展布鲁姆过滤器查询算法布鲁姆过滤器可扩展性主要是当集合元素动态增长超出过滤器算法设计的容量时,如何调整布鲁姆过滤器参数,使得布鲁姆过滤器查询算法有较低的查询误判率,同时具有可接受的计算性能,保证过滤器的可用性。本文研究布鲁姆过滤器的可扩展性问题,提出基于H3哈希函数的可扩展布鲁姆过滤器查询算法,当集合元素增长超过布鲁姆过滤器集合容量限制时,通过增加成倍数扩大的布鲁姆过滤器向量来保持很低的误判率,利用H3哈希函数实现可扩展布鲁姆过滤器的设计以及过滤器中元素的插入、查询过程。实验分析表明,新的可扩展布鲁姆过滤器的元素查询误判率永远小于动态布鲁姆过滤器,平均为它的21.3%,且查询时间呈对数增长,解决了现有算法查询时间增长过快问题。3)研究多维布鲁姆过滤器查询算法针对现有多维布鲁姆过滤器查询算法(MDBF)在多维元素表示和查询时分割了各维属性为一体的缺陷,提出一种改进的两步表示和查询的联合多维布鲁姆过滤器查询算法(CMDBF)。CMDBF在MDBF的基础上,增加一个联合标准布鲁姆过滤器CBF,将多维元素整体经过两次哈希运算表示到CBF中。CMDBF的元素表示和查找分两步进行,将MDBF作为元素表示和查找第一步,第二步联合元素各属性值利用CBF进行确认。仿真实验表明,CMDBF相比MDBF查询误判率降低明显。4)探讨布鲁姆过滤器的代数运算布鲁姆过滤器是集合到向量的一个映射,本文探讨布鲁姆过滤器的代数运算和集合查询的关系。定义布鲁姆过滤器的“并”,“交”,“异或”,“补”,“差”代数运算,从理论和仿真实验两方面分析布鲁姆过滤器的代数运算和集合代数运算并集,交集,异或集,补集,差集的元素查询性质。理论分析和实验结果表明,布鲁姆过滤器的“并”,“交”运算能够支持集合并集交集的元素查询,这一结论可以大大简化利用布鲁姆过滤器进行的系统设计。5)提出基于布鲁姆过滤器的节点轨迹标签P2P副本一致性维护算法本文研究P2P系统副本一致性维护算法,从直接更改消息报文角度出发,提出一种基于布鲁姆过滤器的节点轨迹标签无结构P2P副本一致性维护算法。通过在传输消息的报文中添加已接收更新消息的节点轨迹地址链表标签,可在消息传输源节点进行冗余判断来减少冗余消息数目。因为直接存储节点地址轨迹标签算法的消息长度随着消息传输轮数和网络度数增加而不断加大,论文采用布鲁姆过滤器表示地址链表轨迹标签。通过布鲁姆过滤器这种简洁的结构表示地址链表,可以减少添加到报文中的轨迹长度,同时利用布鲁姆过滤器的“并”运算还可以简化传输节点的冗余判断。仿真实验表明:基于布鲁姆过滤器的节点轨迹标签算法可以大大降低冗余消息数目,提高P2P系统的可扩展性。副本节点网络连通性越强,消息数目和传输带宽减少越明显。6)提出基于布鲁姆过滤器的混合移动自组织网络服务发现模型研究服务发现中服务信息的精简存储和查询方法,提出基于布鲁姆过滤器的混合移动自组织网络服务发现模型。模型采用计数式布鲁姆过滤器表示注册服务目录,采用两层混合服务发现体系结构和两级服务信息存储方式。论文详细描述了基于布鲁姆过滤器的服务发布、服务查询、服务取消、服务注册信息的扩散与同步和节点移动时对应在服务协调者节点的相关操作和过程。定义布鲁姆过滤器距离,从分析布鲁姆过滤器的统计特性出发,提出了基于计数式布鲁姆过滤器距离的集合变动评估算法。将距离的评估算法用于服务注册信息的扩散与同步中,用于制定有效的服务注册信息发散和同步更新策略。实验仿真和理论分析表明,距离评估算法评估准确性高,准确率高达99.7%,提出的基于布鲁姆过滤器的混合移动自组织网服务发现模型具有良好的性能。

【Abstract】 Information representation and query are two core problems of exchanging data objects for most application systems. The size of data sets encountered in databases, networks and many other applications has expanded dramatically in recent years. When data sets become very big or expensive to access, querying is a challenge for accessing and representing the set. It becomes increasingly important to handle massive data set using compact data structure with efficient representation and membership query operations. Thus, how to make membership queries by the aid of a compact structure becomes urgent with the data expanding in modern computer systems. This is a common issue for distributed systems that need to share information.A Bloom filter is an excellent data structure that can succinctly represent a data set in order to support membership queries, and filter out effectively any element that does not belong to the set. It is widely used in databases, networks and distributed systems and it has great potential for distributed applications where systems need to share information about available data. This thesis surveys the mathematics behind Bloom filters, some important variations and network-related applications of Bloom filters. The analysis of current research shows that although Bloom filters are now starting to receive significant attention from the algorithmic community and there have been a number of recent results, there may well be further improvements to be found. This thesis concentrates on the algorithm of Bloom filter. Some creative works in this thesis are mainly focused on the following aspects:1) This thesis presents a novel Bloom filter, called Basket Bloom filter (BBF). The BBF differentiates elements in a data set depending on their query invalidation cost, by clustering elements into different baskets. The total query invalidation cost function is defined. In order to minimize the total query invalidation cost, the genetic algorithm is employed to find the optimal number of hash functions for every basket. Simulation results show that, the BBF has 40% lower total query invalidation cost than the standard Bloom filters while the executing time is almost the same.2) To solve the scalability problem of Bloom filters, this thesis presents a new design of a scalable Bloom filter (SBF) for an expanding data set. The SBF keeps a low false positive rate by adding Bloom filter vectors with double length when necessary. The thesis proposes algorithms for element insertion and query operation of SBF by employing the H3 class of universal hash functions. Theoretical and experimental results demonstrate that the new SBF provides false positive rate as low as 21.3% of the dynamic Bloom filter presented before and the querying CPU time increasing with logarithmic rather than linear. Therefore, the proposed SBF outperforms other current scalable Bloom filters significantly.3) Based on the analysis of the current Multi-dimension Bloom filter (MDBF), this thesis find that MDBF separates each attribute from whole element and any attribute’s false positive will result in the whole element’s false positive. This thesis proposes a new Multi-dimension Bloom filter, termed Combine Multi-dimension Bloom filter (CMDBF). CMDBF adds a Combine Bloom filter to represent the whole element comparing with the MDBF. When represent or query an element, CMDBF needs two steps, the first step is hashing or checking every attribute with the MDBF, the second step is further hashing or checking the whole element with Combine Bloom filter for further validation. Simulation results show that the CMDBF outperforms MDBF for large reduction of false positive rate.4) This thesis comprehensively discusses the relationship between algebraic operations on Bloom filters and algebraic operations on data sets. This thesis completely define algebraic operations including OR, AND, XOR, NOT, MINUS on Bloom filter, and study the membership query performance on Bloom filter and data set. Theoretical analyses and simulation results show that the Bloom filter ORed (ANDed) from the original Bloom filters can support element membership query on data set ORed (ANDed) from the original data sets.5) This thesis presents a new trace label based consistency maintenance algorithm in unstructured P2P systems, with the trace label is denoted by Bloom filter. It modifies the message datagram by attaching the address list of peers to which message have been sent. This can help to tell the duplicated message from the source peer by the aid of the attached address list in message datagram. Considering that the address list can become longer with the transmit time lapsing and the degree of P2P increasing, in this thesis we use Bloom filter to denote the address list. The Bloom filter can succinctly represent the address list and simplify the query actions in the list by OR operations. Simulation results show that the new trace label based consistency maintenance algorithm can reduce the number of duplicated message largely. Moreover, the higher the degree of P2P, the more reduction of the number of duplicated messages and bandwidth utilization.6) This thesis presents a new hybrid service discovery scheme for MANET which is based on Bloom filter. In order to represent service information compactly and search service efficiently, the counting Bloom filter is used to represent the published service index. The scheme adopts two-level hybrid service discovery architecture and two-level storages. This thesis gives the main processes of server discovery, such as service publish, service query, service withdrawal, service consistency maintenance among several service coordinators, service update when node moving. The Bloom filter distance is also defined. A variation evaluation algorithm for dynamic data sets based on the Bloom filter distance is presented, called Bloom filter distance evaluation algorithm (BFDE). Experimental and theoretical results show that the BFDE has 99.7% evaluation accuracy rate and the scheme has merits of compact storage and good searching performance.

  • 【网络出版投稿人】 湖南大学
  • 【网络出版年期】2007年 06期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络