节点文献

大规模图像库的高维索引技术研究

Research on High-Dimensional Index in Large-Scale Image Databases

【作者】 梁俊杰

【导师】 冯玉才;

【作者基本信息】 华中科技大学 , 计算机软件与理论, 2007, 博士

【摘要】 高维数据的索引机制是大规模图像库的基于内容检索能够达到实时性要求的关键技术。面临“维度灾难”带来的影响,如何通过索引的表示、索引的组织和索引的提取提高高维图像数据的检索效率是高维索引研究的关键问题。本文主要针对索引的表示和索引的组织进行了研究,提出了一系列简单可行的索引方法。在综合研究现有高维索引技术的基础上,针对高维索引表示和组织的关键问题:索引剪枝过滤策略、高维向量近似表示和快速近似检索算法,详细讨论了目前已提出的索引方法的局限性,并设计了改进方法。高维主存索引作为高维索引的未来发展方向,引起了很多学者的关注。对高维主存索引结构的研究可以为基于磁盘的高维索引结构的研究提供新的思路,文中设计了一种新的高维主存索引结构。为了减少大规模图像数据库在检索过程中引入的误中点个数,本文在定义向量排序和活性维等概念的基础上,提出了一种新的索引快速剪枝过滤技术。该技术采用分段处理思想,实现非候选节点以序列方式和以点方式的两阶段剪枝过滤,从而快速排除所有的误中点,尽可能减少距离计算次数,实现大规模高维向量空间的快速范围查询。该技术适用于目前已提出的基于一维转换思想的高维索引结构中,如金字塔技术,可以提高这类索引机制的检索效率。对数据空间的有效划分是高维向量近似表示的前提,结合近似向量表示和一维转换两种索引构造思想,提出基于位码和距离的高维向量压缩表示形式,实现高维向量的二维表示形式。检索时采用两层过滤技术,可以显著减少检索需要访问的数据向量的个数。实验证明,这种两种索引机制相结合的方法取得了比单独的索引技术更好的性能。基于高维向量压缩表示的索引构造思想,提出一种简单有效的KNN检索算法。通过聚类将数据划分成多个子集空间,对每个聚类子集内的高维向量,利用距离和位码定义简化表示形式。KNN搜索时,在不需要计算向量距离的情况下,根据部分维的位码不相同信息的比较,即容易实现的字符串比较,将某些非候选节点迅速过滤,以此减少高维向量距离计算次数。该方法可以大大降低利用索引进行相似性检索的CPU代价,达到快速检索的目的。主存技术的不断进步,使得主存多媒体数据库的实现成为可能。研究表明,主存多媒体数据库系统性能深受处理器缓存未命中的影响,缓存感知型主存索引是提高数据检索效率的有效手段。针对SA-Tree不适用于主存存取的缺点,提出它的变体CSA-Tree。CSA-Tree利用PCA降维技术,将树的各层节点采用不同的维度表示,这样不仅提高缓存空间的利用率,还降低了CPU负载,从而提高了索引查询效率。

【Abstract】 High dimensional index technique is one significant research issue for content-based image real-time retrieval in large-scale image databases. Facing the influence of the“curse of dimensionality”, how to speed up retrieval in image databases with rational design of the expression、organization and extraction of the index is key problem in high-dimensional indexing schemes. The issues of the expression and organization of high-dimensional index are discussed in this paper and a series of simple and feasible index methods are proposed.Based on the research of expression and organization of current index methods, some key issues such as pruning and filtering of index tree, approximation presentation of high-dimensional vector and algorithm for fast similarity retrieval are discussed in detail, especially on their disadvantage, based on which some improvement approaches are proposed. With the development of high-dimensional index, high-dimensional main memory index arouses more and more interest, which techniques will be benefit to the research on the disk-based high-dimensional index. A new high-dimensional main memory index is designed in this paper.With large amount of points in the image databases, the retrieval performance degrades dramatically because the cost of distance computation with high dimensions increases greatly with many false hit points brought in the process of the range query. In this dissertation, a fast pruning and filtering technique is proposed based on the basis of the introduced concepts of vector ordering and active dimension. This method can solve two phases of pruning: one is in a sequence way; the other is in a point way, such that it can reduce the range that needed to search. By incorporating it into the Pyramid-Technique (PT), we can improve the retrieval efficiency of the PT.Efficient partition of data space is the basis of approximation presentation of a high-dimensional vector. Based on the techniques of approximate vector presentation and one dimensional transformation, an optimal compressive presentation of high-dimensional vector is proposed, called bit-code and distance based index. This representation enables two levels filtering. Experiments on large-scale image databases demonstrate a remarkable reduction of the amount of accessed vectors in similarity searches compared with the existing index schemes. To support efficient querying and retrieval on large-scale image databases, we propose a methodology exploring Local Bit-code Difference (LBD). On clustering the data space into a number of partitions, LBD extracts a distance and a simple bitmap representation called Bit Code (BC) for each point in the database with respect to the corresponding reference point. Pruning during KNN search is performed by dynamically selecting only a subset of the bits from the BC based on which subsequent comparisons are performed. In doing so, expensive operations involved in computing L-norm distance functions between high-dimensional data can be avoided.In main-memory databases, the number of processor cache misses has a critical impact on the performance of the system. Cache-conscious indices are designed to improve performance by reducing the number of processor cache misses that are incurred during a search operation. Considering the disadvantage of SA-Tree inefficient for main memory access, we present its variant called CSA-Tree, which is a multi-level structure where each level of the tree represents the data space at different dimensionalities by using Principal Component Analysis. Each level of the tree serves to prune the search space more efficiently as the reduced dimensions can better exploit the small cache line size. Moreover, the distance computation on lower dimensionality is less expensive. We conducted extensive experiments to evaluate the proposed structures against other methods. Our results show that the CSA-Tree is superior in most cases.

节点文献中: