节点文献

多布鲁姆过滤器查询算法及其应用研究

Multi-Bloom-Filter Query Algorithms and Their Applications

【作者】 田小梅

【导师】 张大方;

【作者基本信息】 湖南大学 , 计算机应用技术, 2013, 博士

【摘要】 布鲁姆过滤器是一种表示集合的空间高效的有损数据结构,支持快速的数据成员查询,能有效地过滤不属于集合的成员。布鲁姆过滤器被广泛应用于数据库、网络和分布式系统,它在需要共享现有数据信息的分布式应用系统中有巨大的应用潜力。针对布鲁姆过滤器算法和应用的研究已被越来越多的研究团体所重视,涌现出了大量布鲁姆过滤器算法的变种及相关应用的研究论文,而且这种快速发展的势头还将持续下去,必定会出现更多布鲁姆过滤器算法的相关变种及应用研究。通常我们使用布鲁姆过滤器的一般场景是:将集合S表示到布鲁姆过滤器这一精简结构中,在需要查询元素是否属于集合S时,使用布鲁姆过滤器而不是集合S本身进行集合成员查询,节约存储空间及提高查询的时间效率。目前大多数有关布鲁姆过滤器的扩展算法及应用研究主要是针对单个布鲁姆过滤器结构进行的。本文的研究工作则是考察如何使用多个布鲁姆过滤器结构进行相关查询,并将之用于分布式内容分发系统、数据同步系统等。本文对多布鲁姆过滤器查询算法从理论分析和实际应用两个方面进行了深入的研究。首先分析网络高速发展给多布鲁姆过滤器查询算法带来的机遇,指出多布鲁姆过滤器查询算法研究的重要意义。然后,概括了多布鲁姆过滤器查询算法的研究现状和多布鲁姆过滤器查询算法目前的主要研究成果。考虑到单布鲁姆过滤器查询算法在解决分布式数据分发及数据同步等问题时不能完全胜任,本文提出了使用多个布鲁姆过滤器结构进行查询的数个多布鲁姆过滤器查询算法,如双布鲁姆过滤器直接查询算法、计数布鲁姆过滤器代数运算查询算法、使用多个标准布鲁姆过滤器进行查询的数据调和算法及使用多计数布鲁姆过滤器运算的数据调和算法。本文的创新性成果主要体现在以下几个方面:1)探讨双布鲁姆过滤器直接查询法的查询性能探讨直接使用两个集合的布鲁姆过滤器结构查询集合并集、交集、补集、差集或对称差成员的性能问题,即双布鲁姆过滤器直接查询法的性能。理论分析和实验结果表明,双布鲁姆过滤器直接查询法能够较好地支持集合并集、交集、补集、差集及对称差的成员查询问题,其中使用双布鲁姆过滤器直接查询法进行并集查询及交集查询不会产生假阴性,仅有少量假阳性的存在,而双布鲁姆过滤器直接查询法查询补集、差集及对称差则除存在少量假阳性外,还存在少量假阴性,即部分本来属于补集、差集或对称差的元素被判为不属于补集、差集或对称差。2)研究多个计数布鲁姆过滤器向量进行代数运算(简称为计数布鲁姆过滤器代数运算)的性质由于在使用双布鲁姆过滤器直接查询法查询补集、差集及对称差元素时,存在假阴性问题,因此,我们尝试从计数布鲁姆过滤器向量运算的角度寻求能解决前述假阴性问题的方法,探讨两个或多个计数布鲁姆过滤器的代数运算和集合运算的一致性关系,研究使用计数布鲁姆过滤器代数运算进行集合成员查询的性能。理论分析和实验结果表明,计数布鲁姆过滤器的并、交、补、减、异或运算产生的新过滤器依然保持计数布鲁姆过滤器的特征,支持元素的删除操作,不会出现假阴性,能用于集合并集、交集、补集、差集及对称差的成员查询;与双布鲁姆过滤器直接查询法相比,使用计数布鲁姆过滤器代数运算后的过滤器进行补集、差集及对称差成员查询,不存在前述假阴性问题,空间效率能提高一倍,时间效率亦能显著地得到改善。计数布鲁姆过滤器代数运算的使用有利于进一步扩展计数布鲁姆过滤器的应用范围。3)提出基于多标准布鲁姆过滤器运算的精确集合调和方法分布式系统中,集合调和是指分布式节点交换各自节点的数据集合本身或数据集合的某种表示,找出集合的差集元素,进而获得数据集合并集的过程,在这一过程中,节点间花费的通信代价(节点间的消息交换轮数及传输消息位数)越少越好。集合调和问题对于分布式文件分发、闲谈协议、同步与复制协议等分布式计算应用来说,是一个重要的基础构件。本文在分析现有特征多项式插值精确集合调和法的工作原理的基础上,提出了一种基于多标准布鲁姆过滤器运算的精确集合调和方法(BFESR)。使用特征多项式插值调和法(Characteristic polynomial interpolation-based synchronization, CPISync)时有一个前提:插值时的求值点数要大于对称差规模(即|SA-SR|+|SB-SA|)。分布式系统中,对称差规模的上界值往往无法事先获知,从而不能简单地调用CPISync算法完成调和。现有的解决方法通常是使用试探法,即逐次增加求值点数和特征多项式值个数,直至求值点总数超过对称差规模,才能成功插值,实现集合调和。BFESR首先使用两个布鲁姆过滤器的内积运算或本文提出的准交集查询法估算出对称差规模;然后以逐轮增加的求值点和特征多项式值作为插值算法的输入,重复调用插值算法,直至确认成功;最后进行因式分解得到差集元素,进而获取并集完成调和。通常,BFESR调和过程中,调用一次插值算法即能成功。理论分析和实验结果表明,使用布鲁姆过滤器内积运算和准交集查询法估算对称差规模,其准确程度均较高,而且准交集查询法得到的估算值更为接近实际对称差规模。与已有的试探法进行比较,BFESR调和时间和消息交换轮数降低非常明显,尤其是使用准交集查询法估算对称差规模的BFESR方法,其调和效率更高。4)提出基于多计数布鲁姆过滤器运算的精确集合调和方法由于BFESR算法中使用的标准布鲁姆过滤器不支持集合元素的动态更新,若用于更新频繁的P2P网络等分布式系统则需要定时重建标准布鲁姆过滤器,这样会增加系统实现的负担及难度,因此,为解决BFESR调和算法的这一应用局限性,本文提出了一种基于多计数布鲁姆过滤器运算的精确集合调和方法(CBFESR),该方法将集合用计数布鲁姆过滤器表示,利用计数布鲁姆过滤器减运算得到的新过滤器,查询并获得集合中的差集元素,再用差集和自身集合进行集合并运算,完成集合调和。理论分析和在P2P系统中的仿真实验结果表明,CBFESR既具有精确集合调和能得到全部差集元素的优点,也具有近似集合调和仅需单轮消息交换、计算简单的优点。此外,由于计数布鲁姆过滤器支持集合元素的删除操作,因此,CBFESR非常适合应用于数据集合更新频繁的P2P网络等分布式系统。

【Abstract】 A Bloom filter is an excellent data structure that can succinctly represent a data set in order to support membership queries, and effectively filter out the elements that do not belong to the set. It is widely used in databases, networks and distributed systems and it has great potential for distributed applications where systems need to share information about available data. Many research communities have been taking deep researches into Bloom filter algorithms and their applications, and there will be more Bloom filter algorithms and their applications to be found.Usually we use Bloom filter in such a general scene:set5represented in Bloom filter BF(S), when we query whether an element x belong to S, in order to save memory space and to improve query time efficiency, we query BF(S) instead of S1to check whether x belong to S or not. At present most of the variants of Bloom filter algorithm and their applied researches have been conducted on single Bloom filter. The goal of this thesis was to study how to use multi-Bloom-filter query algorithm in P2P systems for content distribution or data synchronization applications.We conducted in-depth study for multi-Bloom-filter query algorithm in theoretical analysis and practical application. At first we analyzed the opportunities brought about to multi-Bloom-filter query algorithms by the rapid development of the network, and found out that the multi-Bloom-filter query algorithms were worth studying. Secondly we summarized the research status of the multi-Bloom-filter query algorithm and main findings of the multi-Bloom-filter query algorithm. Because single-Bloom-filter query algorithms can not be be fully qualified for solving distributed content distribution and data synchronization problems, we presented a few query algorithms using multiple Bloom filters, such as double-separate-Bloom-filter query algorithm, counting-Bloom-filter algebra operation query algorithm, counting-Bloom-filter based set reconciliation algorithm which use counting-Bloom-filter subtraction operation and Bloom-filter based exact set reconciliation algorithm which use multiple Bloom filters to reconcile sets. Some creative works in this thesis are focused on the following aspects:1) This thesis examined the problem of membership queries for sets’union, intersection, complementary set, sets’difference or symmetric differences between sets by using double separate Bloom filters. The theoretical analysis and experimental results show that the double-separate-Bloom-filter query algorithm can support membership queries for sets’union, intersection, complementary set, sets’difference or symmetric differences between sets, of which membership queries for sets’union or intersection have no false negatives, only a few false positives, while membership queries for complementary set, sets’difference or symmetric differences between sets have a small number of false negatives in addition to small amounts of false positives.2) Because membership queries for complementary set, sets’difference or symmetric differences between sets using double separate Bloom filters have a small number of false negatives, we attempted to seek solutions to this problem using algebra operations on counting Bloom filters. This thesis examined the consistence between algebra operations on counting Bloom filters and algebra operations on data sets, and studied the membership query performances of algebra operations on counting Bloom filters. Theoretical analyses and simulations show that the counting Bloom filter which is ORed(ANDed, COMPLEMENTed, SUBTRACTed, XORed) from the original counting Bloom filters can support membership query on the set ORed(ANDed, COMPLEMENTed, SUBTRACTed, XORed) from the original data sets. When using double-separate-counting-Bloom-filter query algorithm to query elements belonged to complementary set, sets’difference or symmetric differences of the two sets, some elements in complementary set, differences or symmetric differences of the sets will be misjudged, while the query method using algebra operations on counting Bloom filters has no false negatives and gain a remarkable improvement in space complexity and time complexity over the method using double-separate-counting-Bloom-filter query method, hence it can be applied to various network fields. For example, counting-Bloom-filter SUB operation can be used in set reconciliation protocol for distributed database or file systems.3) Set reconciliation aims at efficiently reconciling two similar sets held by different hosts while minimizing the communication complexity. Set reconciliation is a fundamental problem in distributed computing, arising in protocols for content delivery, gossiping, synchronization and replication. On the basis of analyzing the existing accurate set reconciliation algorithm using characteristic polynomial interpolation-based synchronization (CPISync), we presented an exact set reconciliation algorithm based on multiple Bloom filters (BFESR). There is a premise when we use CPISync to reconcile data:the number of evaluated points shoud be greater than the size of the symmetric differences between remote sets (|SA-SB|+|SB-SA|). In distributed systems, the size of the symmetric differences between remote sets is often unable to be known in advance, therefore CPISync algorithm can not be simply invoked to reconcile data. Existing solutions to this problem are usually heuristics that iteratively increase the number of evaluated points and characteristic polynomial values until the total number of evaluated points exceeds the size of the symmetric differences, then successful interpolation is achieved. These heuristics have too long reconciliation latency. BFESR has less reconciliation latency. The basic idea of BFESR is to obtain symmetric difference size between remote sets before calling CPISync algorithm, thus characteristic polynomial values for CPISync algorithm are wholesale transferred. At first, BFESR estimates symmetric difference size between remote sets SA and SB using Bloom filters, and calculates the same number of characteristic polynomial values as the estimated size of the symmetric differences between SA and SB, and then interpolates rational polynomials, finally recovers the sets SA-SB and SB-SA through factoring the rational polynomials and gets the union of SA and SB-Theoretical analyses and experimental results show that, compared with the existing methods for exact set reconciliation, BFESR needs only a single round-trip to get the accurate union of the sets in most cases, thus greatly reducing reconciliation latency. BFESR uses Bloom filters’inner-product or quasi-intersection query method to estimate the size of the symmetric difference between sets, both methods have higher estimating accuracy, and the quasi-intersection query method estimates closer to the actual size of the symmetric differences between sets. Compared with existing heuristics for exact set reconciliation, BFESR has very significantly less reconciliation time and fewer message exchange rounds, particularly when BFESR uses quasi-intersection query method to estimate the size of the symmetric differences between sets.4) Because Bloom filters used in BFESR does not support dynamic update of set elements, BFESR is not suitable for data reconciliation in update-intensive distributed systems. To solve this application limitation of BFESR algorithm, we present a new set reconciliation algorithm based on multiple counting Bloom filters, which we call Counting-Bloom-filter based exact set reconciliation(CBFESR). This method represents sets SA and SB as counting Bloom filters, subtracts SA’s counting Bloom filter from SB’S counting Bloom filter, the differences we denote CBF(SB)-CBF(SA), then determines SB-SA elements in SB with CBF(SB)-CBF(SA), and finally performs union operation on SB-SA and SA. Counting-Bloom-filter based set reconciliation combines the advantages of exact set reconciliation and approximate set reconciliation, besides gaining all SB-SA elements with single-round messgage exchange, it can also be applied to update-intensive distributed systems because counting Bloom filters support element deletion operation.

  • 【网络出版投稿人】 湖南大学
  • 【网络出版年期】2014年 07期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络