节点文献

基于主存的高维空间连接及查询算法研究

Research of High-Dimensional Space Join and Query Algorithms Based on Main-Memory

【作者】 刘艳

【导师】 郝忠孝;

【作者基本信息】 哈尔滨理工大学 , 计算机应用技术, 2011, 博士

【摘要】 许多新出现的数据库应用,如CAD、多媒体、医学图像、时间序列、分子生物学和科学数据库,将它们的数据表示为多/高维特征向量。每个特征向量由d个值组成,可以解释为d维空间中的坐标以及一些相关内容的数据。通常使用两个特征向量之间的距离表示原始高维对象之间的(不)相似性。在许多应用中,需要结合两个数据集中的点对,该操作本质上是高维连接(高维数据空间中各种相似连接的统称)。到目前为止,对高维连接的大多数研究主要集中在基于磁盘的大量数据的高维连接。目前计算机可用主存的容量越来越大且价格逐渐低廉,以及对高维连接的有效处理的需求表明,一大类问题的高维连接能够在主存中处理。△-tree是一种新提出的多层索引,能够加速主存环境中的高维查询,已被证实优于其它主存索引。因此,本文以△-tree为基础,设计了几种类型的高维连接及查询算法,以满足不同应用的需求。本文的主要工作可概括为以下5点:(1)研究了高效的主存索引△-tree及相似连接的特性,分析了△-tree中节点的修剪策略。充分利用△-tree的特性,为单个数据集提出了主存自相似连接算法△-tree-Join。充分利用聚类重合的特性,以△-tree中各节点的中心为聚类中心,设计了主存索弓△-tree-S,基于△-tree-S采用自顶向下的模式为两个不同的数据集提出了主存相似连接算法△-tree-Join*,在这两个算法中使用较少的维数计算节点之间的距离及数据点与节点之间的距离,通过该距离过滤掉不必要的节点和数据点,减少计算量,从根本上提高主存相似连接的效率,实验结果表明这两种连接算法均具有较高的连接效率。(2)分析了kNN查询的搜索半径未知的特点,对主存中的高维k最近邻(kNN)查询算法进行了研究。基于△-tree提出了三种用于高维数据的主存kNN查询算法,通过性能分析和实验证明它们比已有的算法具有更高的查询效率,其中以第三种查询算法BU_DF_knn_Search效率最高。该算法首先通过隶属节点的定义确定查询点在△-tree中的隶属叶子节点,并在该节点中进行kNN搜索,利用较优的kNN候选尽快缩小修剪距离,然后按自底向上的顺序对以隶属叶子节点的各祖先节点为根的子树进行深度优先遍历。(3)分析了kNN连接的性质,对主存中的kNN连接算法进行了研究。根据需求提出了主存kNN连接的索引结构△-tree-R,该索引对生成的每个节点进行了编码。基于△-tree-R和△-tree-S提出了主存kNN连接算法,该算法利用△-tree-R中的编码在△-tree-S中进行解码,快速定位位置对应节点,利用其中的近优kNN候选,并采用自底向上和深度优先搜索相结合的遍历策略,快速缩小kNN修剪距离。实验结果显示A-tree-KNN-Join是一种有效的主存kNN连接算法。(4)分析了RkNN查询的解决方法及高维RkNN查询的特点,得出选择自修剪方法解决高维RkNN查询的结论。基于△-tree-R提出了△-Rdknn-tree索引结构,通过在该索引结构上进行△-tree-KNN-Join的自连接算法,预处理数据集中点的kNN距离信息,并将这些距离扩展到索引△-Rdknn-tree的各层节点中,基于该索引给出了主存反向k最近邻(RkNN)查询算法。(5)研究了主存环境中面向集合的RkNN查询即主存RkNN连接问题。为进行RkNN连接的两个数据集分别构建索引△-Rdknn-tree和△-tree,利用△-Rdknn-tree中各节点具有的RyNN搜索距离信息,基于相似连接算法△-tree-Join*设计了主存RkNN连接算法,通过性能分析证明该算法是有效的。本课题提出的所有主存查询和连接算法均经理论和实验证明,具有实用价值,为主存中高维数据的各种查询及连接算法的研究提供了理论和实践基础。

【Abstract】 Many emerging database applications such as CAD, multimedia, medical image, time series, molecular biology and scientific databases represent their data as multi/high-dimensional feature vectors. Each feature vector consists of d values which can be interpreted as coordinates in a d-dimensional space plus some associated content data. The distance between two feature vectors is commonly used to measure the degree of (dis-)similarity between the original high-dimensional objects (with regard to the feature represented). There is increasing need to combine similar tuples (points) of two datasets for many applications.The operation of generating all pairs is in essence high-dimensional join (high-dimensional join is a general designation of all kinds of similarity join).So far most of researches focus on the execution of high-dimensional joins over large amounts of disk based data. The increasing sizes and lower price of main memory available on current computers, and the need for efficient processing of high-dimensional joins suggest that high-dimensional joins for a large class of problems can be processed in main memory. Δ-tree is a novel multi-level index structure, it can speed up the high-dimensional query in main memory environment and has been proven to be an efficient index method. Therefore, in order to meet different application needs, several kinds of high-dimensional joins and queries were designed based on index structure Δ-tree. The main work of this topic can be summarized as follows:Firstly, the efficient main-memory index Δ-tree and the characteristics of similarity join were studied, the pruning strategies of the nodes in Δ-tree were analyzed. Made full use of the properties of Δ-tree, A novel main-memory similarity self-join algorithm Δ-tree-Join was presented for a single dataset. Used the properties of cluster overlap and taked the center of nodes in Δ-tree as the center of clusters, main-memory index structure Δ-tree-S was designed.A main- memory similarity join algorithm called Δ-tree-Join*adopted the top-down join scheme for combining two different database sets was presented based on A-tree-S. The two algorithms computed the distances between clusters and between point and cluster with fewer number of dimensions, so as to filter unnecessary nodes or points, reduce computations and improve main-memory join efficiency fundamentally. Experimental results indicate that these two join algorithms have high join efficiency.Secondly, the feature that the searching radius of kNN query is unknown apriori was analyzed, the algorithm of high-dimensional kNN(k Nearest Neighbor) query in main-memory was studied. Three kinds of main-memory kNN query algorithms based on A-tree for high dimensional data were developed. Preformace analysis and experiments prove that these three algorithms improve query efficiency compare to existing algorithm, among them the third query algorithm named BU_DF_knn_Search has the best efficiency. This algorithm determines the membership leaf node of query point in A-tree through the definition of membership node firstly, and searches kNN in this node, narrows the pruning distance quickly using near optimal kNN candidates, then traverses the subtrees used the ancestors of membership leaf nodes as root nodes by depth first manner from bottom to top.Thirdly, the properties of kNN join was analyzed, the algorithm of kNN join in main-memory was studied. According to requirements index structure A-tree-R for main-memory kNN join was proposed, every generated node in this index was coded. A main-memory kNN join algorithm was presented based on Δ-tree-R and Δ-tree-S, this algorithm decodes in Δ-tree-S by using the code of A-tree-R, fast locates positional correspondence nodes, uses the near optimal kNN candidates in them and adopts the traversal order of combining bottom up and depth first search, narrows kNN pruning distances rapidly. Experiments results illustrate that Δ-tree-KNN-Join is an efficient kNN-join method in main memory.Fourthly, the solutions of RkNN (Reverse k Nearest Neighbor) query and the characteristics of high-dimensional RkNN query were analyzed, the conclusion that choose self-pruning approaches to process high-dimensional RkNN query was drawed. An index structure called Δ-Rdknn-tree was proposed based on Δ-tree-R, precomputed kNN distances of points in the dataset by main-memory kNN self-join algorithm Δ-tree-KNN-Join based on this index and propagated these distances to higher level index nodes. Main-memory RkNN query algorithm based on this index was proposed.Fifthly, the problem of set-oriented RkNN queries in main-memory enviroment namely main-memory RkNN join was studied. Constructed indexes Δ-Rdknn-tree and Δ-tree for the two datasets that were going to join, used the RkNN searching distance information of nodes in Δ-Rdknn-tree,main-memory RkNN join algorithm was designed based on the similarity join algorithm Δ-tree-Join*, performance analysis was conducted for the effectiveness of this algorithm.All the main-memory query and join algorithms in this paper were certificated by theories and experiments, have practical value and lay the theoretical and practical foundation of the research for all kinds of high-dimensional queries and joins algorithms in main memory.

节点文献中: