节点文献

云计算系统中索引与查询处理技术研究

Research on Indexing and Query Processing in Cloud Computing Systems

【作者】 王金宝

【导师】 高宏;

【作者基本信息】 哈尔滨工业大学 , 计算机软件与理论, 2013, 博士

【摘要】 云计算是一种新兴的计算模式,它隐藏了计算资源以及计算的执行过程,用户只需要通过浏览器或者应用程序界面提交计算任务或者服务请求,而不必考虑如何构建计算架构;如何组织、调度计算资源;如何使用计算资源完成计算任务或者服务请求。随着越来越多的数据和应用服务从超级服务器迁移到公有云计算系统或私有云计算系统中,如何在云计算系统中有效地进行数据管理并成为一项具有重要意义的研究工作。与传统的数据管理相比,云计算系统中的数据管理需要提供良好的可扩展性以及高效的数据存取能力。查询处理和查询优化也是云计算系统中的数据管理的核心技术,查询性能严重影响用户使用云计算系统的服务质量。索引技术在数据管理系统中能够有效地提高查询处理性能,减少查询使用的CPU时间、磁盘读取等操作,以此提高查询处理性能,云计算系统也需要构建有效的索引结构来提高查询处理性能。综上所述,云计算系统中的查询处理和索引技术是具有重要意义的研究课题。然而现有工作主要针对MapReduce计算框架下的并行数据分析操作,对于其他查询类型还缺乏研究成果。本文的研究工作针对云计算系统中的查询处理技术和索引技术,运用数据管理技术、计算复杂性和算法学的理论和知识,针对云计算系统中不同查询类型的处理算法和不同的数据类型的索引技术进行研究。本文的主要研究工作包括以下方面:首先,本文提出了云计算系统中的多维索引结构。现有研究工作主要针对单个计算节点或者服务器-客户端模型,在大量计算节点构成的云计算系统中,使用单个计算节点中的多维索引将造成系统性能的瓶颈,无法提供良好的可扩展性。本文提出了一种两层的索引结构,在云计算系统中支持多维数据查询,查询过程中索引结构在大量计算节点之间和单个计算节点的外存中同时为查询提供剪枝能力,提高系统的吞吐量。本文给出了该索引结构的构建方法、维护方法,并在索引维护过程中提出优化策略,进一步提高查询处理的吞吐量。本文针对云计算系统中的多维数据提出点查询、范围查询和k最近邻查询的处理算法,包括计算节点之间的分布式算法和计算节点内部的索引选择算法。真实云计算平台中的实验验证了本文提出的云计算多维索引结构的有效性。第二,本文提出了云计算系统中字符串相似性查询的算法。现有的字符串相似性查询技术都针对单个计算节点,在处理大规模字符串数据集合时将导致内存溢出和外存溢出两个问题。针对以上问题,本文提出一种分布式索引结构,在云计算系统中支持字符串相似性查询。为了获取更好的本地查询效率,本文将现有的字符串查询优化技术应用在外存环境中,设计了支持长度过滤器和位置过滤器的外存索引结构,及其构建方法和实现细节。在查询过程中,使用非对称的字符串概要模式,自适应地从查询字符串的数据概要集合中选择一部分元素,用于获取查询使用的倒排表。为了减少查询在系统中使用计算节点的数量,本文设计了基于字符向量的数据划分方法,用于划分字符串数据集合。该方法将相似的字符串划分到相同的计算节点中,并在查询处理过程中确定查询需要访问的计算节点集合。模拟实验结果验证了本文提出的字符串相似性查询算法的有效性。第三,本文提出了云计算系统中空间近似关键字查询算法。现有工作集中于单个计算节点中的索引结构,并提出了内存中的精确算法和外存中的近似算法。然而,由于单个计算节点的CPU计算能力和磁盘带宽有限,内存方法乃至外存方法都无法满足系统对性能的要求。本文设计了一种两层索引结构,支持空间近似关键字查询,提高系统相应查询的吞吐量。本文设计一种新颖的树形索引,在外存中支持空间近似关键字查询,并高效地返回完整的查询结果。本文的全局索引将整个空间划分成多个划分,全局索引维护在各个计算节点的内存中,用于加速查询处理过程。本文给出了全局索引选择方法,用于全局索引的初始化和周期性维护。在查询处理方面,本文给出了基于编辑距离的范围近似关键字查询和最近邻近似关键字查询的算法。在分布式集群中的实验结果验证了本文提出的索引结构的有效性。第四,本文提出了云计算系统中多维聚集查询处理算法。现有云计算中的研究工作MapReduce计算框架缺乏对多维数据中聚集操作的有效支持。另一方面,使用MapReduce计算框架需要启动大量计算节点,造成巨大的系统功耗。针对以上问题,本文提出了云计算系统中的多维聚集操作方案,通过两层索引结构减少参与查询的计算节点以及单个计算节点中聚集操作计算量。本文给出了使用两层索引结构处理多维聚集查询的算法框架,并在该框架中提出了性能优先模式和低功耗模式下的多维聚集查询算法。在两种模式下的多维聚集算法中提出了查询分配问题,并证明了两种模式中查询分配问题都是NP完全问题。本文给出解决两个NP完全问题的近似算法,并证明了两个近似算法的近似比。理论证明分析和模拟实验结果验证了本文提出的多维聚集查询方案的有效性。

【Abstract】 Cloud computing is a recently developping computational framework. Clients ac-cess cloud resources and submit computing tasks through web browsers, with nothingtaken into accounts on how to build computing hardwares. As more and more dataset-s and services are delivered into cloud, building efcient data management systems incloud becomes an essential task for research community. Cloud systems provide scala-bility, which supports large scale data analysis and highly cocurrent transactions. As thatof existing data management systems, query processing is significant since it has large ef-fects on service level agreements. Query processing is also important for various servicesprovided in cloud, including IaaS(Infrastructure as a Service) and so on. Indexes are wellstudied for reducing CPU time, disk access operations for data management systems toimprove query performance. It is also expected to play the same role in cloud systems.As illustrated above, query processing and indexes are essential topics in cloud systems,however, existing works focus on key-value data in MapReduce framework, while otherare still in lack of attention.This thesis focuses on indexes and query processing in cloud systems, and designsindexes together with query processing algorithms based on data management techniques,computational theory and algorithmic technology. A wide range of data types and querytypes are considered in this thesis, and the following achievements are obtained.First, we propose a multi-dimensional index in cloud systems. Existing works main-ly focus on indexes in a single server or server-client structure. Unfortunately, such worksfail to provide efciency since performance bottleneck is introduced. This paper designsa two-layered index structure to prune search space among computing nodes for queryprocessing. The index reduces the number of involved computing nodes while query pro-cessing, and improves I/O efciency inside a single server. The initiation method andmaintenance methods are proposed for the index, together with optimization strategiesfor improving query throuphput. This thesis designs the point query algorithm, the rangequery algorithm and the kNN query algorithm for cloud systems, including distributed al-gorithms among computing nodes, and optimization strategies inside a single computingnode. Second, we target at string similarity search in cloud systems. Existing works fo-cus on query processing within a single server, and it incurs main memory overflow andexternal memory overflow while dealing with big data. For the above problems, we pro-poses a distributed index to support string similarity search in cloud environments. Toprovide efcient searching in a single node, an external memory index is designed, whichadopts multiple filtering techniques and optimizing strategies. The external memory res-ident index supports length filter, positional filter in disks. This paper proposes the indexconstruction method. During query processing, asymmetric q-gram is used to reduce thenumber of inverted lists read from disks. An adaptive algorithm is given to choose in-verted lists, and seek the tradeof between two aspects of query cost. The global indexpartitions the entire string dataset according the content of strings, and a char vector spacepartition method is proposed. In char vector space partition method, similar strings arepartitioned into the same computing nodes, thus the number of computing nodes involvedin a single query is reduced. The partition method is also adopted to determine necessarycomputing node set for a query to access. Simulation results validate the efciency andefectiveness of our proposed index.Third, we propose spatial approximate keyword query algorithms for cloud systems.Existing work targets on single server solutions, and an exact algorithm is given in mem-ory while another approximate algorithm is given for disk resident datasets. However, asingle server fails to provide reasonable throughput due to the limited CPU time and diskbandwidth. Facing the above challenges, this paper gives a two-layered index consistingof global index and local index, which works in a shared nothing cluster for larger querythroughput. This paper designs a novel external memory index as local index, which re-turns exact answer within disks efciently. It is equiped with keyword set signature andmultiple optimizing strategies to reduce I/O cost. The global index partitions the entirespatial space, and each computing node in system maintains a partition. A global indexselection algorithm is given. This paper also provides spaital approximate keyword queryalgorithms based edit distance, including range and the nearest neighbor spatial condi-tions. Experiments in a shared nothing cluster illustrates the efciency and efectivenessof our proposed index and query algorithms.Fourth, we propose multi-dimensional aggregation algorithms for cloud systems.Existing works focus on key-value data in MapReduce framework, and multi-dimensionaldata is not well considered. What is more, MapReduce framework launches a large num- ber of computing nodes for a single query and costs huge amounts of energy. For theabove problems, this paper designs a solution for answering aggregation queries in cloudthrough a two-layerd index, which reduces the number of computing nodes involved in asingle query and eliminates unnecessary computing time in a single server. The construc-tion and maintenance methods are given, together with a query framework. Followingthe given query framework, this paper proposes two aggregation algorithms under perfor-mance priori mode and low-power mode. Two sub-query scheduling problems are definedfor each of aggregation algorithms given above, and they are proved to be NP-Complete.Then, approximate algorithms with reasonable lower bounds are designed. Theoreticalanalysis and simulations validate the efciency and efectiveness of aggregation algo-rithms proposed in this paper.

节点文献中: