节点文献

分布式数据流查询处理若干关键技术的研究

Research on Some Key Techniques of Distributed Data Stream for Query Processing

【作者】 杨颖

【导师】 乐嘉锦;

【作者基本信息】 东华大学 , 控制理论与控制工程, 2006, 博士

【摘要】 随着大规模网络的发展和Web的广泛应用,在网络监控、入侵检测、传感器网络、通讯数据管理、股票分析等应用领域中产生了一种新型数据—数据流(或流数据),如关系元组、传感器读入值、网络性能参数、电话记录和股票交易数据等。与传统数据库应用模型不同,数据流模型具有以下特点:(1)数据连续、实时到达;(2)数据量大、无限制并且难以预测;(3)数据一经处理,除非特意保存,否则不能被再次取出处理,即一次性处理(one-pass),或者再次提取数据的代价昂贵。如何对这些流数据进行存储、查询处理已经成为当前国际数据库研究领域的热点问题。在许多实际应用中,如决策支持系统、查询优化等,用户并不需要获得确切值,而只需要一个近似值。因此,数据流分析和管理的核心是设计一次扫描算法,即在一个远小于数据规模的内存空间里不断更新一个代表数据集的结构—概要数据结构,使得在任何时候都能够根据这个结构快速实时地获得近似查询结果。如果流的长度为N,则概要数据结构的规模大小不超过0(polylog(N)),并且处理流上每一组数据的时间不超过0(polylog(N))。传统数据库中的查询主要是一次查询,即系统根据当前数据集合的快照给出查询结果,并将该结果返回给用户。而数据流的查询为连续查询,即查询随着新数据的到来而不断的返回查询结果。连续查询是数据流上特有的操作,具有长期运行的特点。由于数据流环境中的数据集不是静态的,而是不断有数据插入和更新。用户需要的也不是在某个时刻的静态查询结果,而是对整个数据流的一个动态监测,随着数据流的不断变化持续地产生查询结果。现有的数据流的研究主要为集中式的流数据系统,而数据流的本质是分布式的,越来越多如传感器网络、数据通讯、Internet流量分析和Web日志等的大量数据都来自不同的远程数据源,因此,需要构建分布式数据流查询处理的中间件以支持上述各种应用。P2P技术利用互联网的终端机来建立一个庞大的分布式计算网络,并对迅速涌出的大量信息进行处理。这些计算机(即对等点)在网络中处于同等的地位,各自拥有独立的网络自主权,以解决把所有的计算压力全部加在服务器一端所造成的瓶颈问题。P2P以其可扩展性、通信负载平衡,资源的高利用率以及由基于内容的路由机制所提供的动态变化的适应性等特性成为构建中间件的良好平台,以便在减少网络带宽和网络连接所消耗的计算资源情况下,提供快速有效的数据流查询处理的实时响应。本论文以分布式数据流为主要研究对象,分析了国内外的研究现状,从目前存在的问题和不足出发,研究数据流基于时间变化的特性,监测当前流入的数据,探索数据流变化的表示与建模方法,分析数据进化和变化的趋势,并对未来流入的数据进行预测。在大规模分布式环境中,研究时间和空间复杂度最小的分布式数据流查询处理和挖掘算法。一方面,研究小波分解技术,利用小波系数的近似处理方法构建和维护小波直方图,以获得好的精确度,并且将其扩展到多维直方图的构建和维护,解决传统的直方图技术难以解决的问题,并利用小波系数构造数据流集的概要,建立一个复合索引结构来响应各种查询;还研究小波多分辨分析思想,构造一种小波神经网络模型,解决了传统神经网络中隐层节点数难以确定的问题,初步建立分布式时间序列数据流的预测模型。另一方面,运用草图技术解决在数据流上的聚集查询等难点问题。研究分布式数据流中频繁项的发现算法,通过设置精确梯度来减少通信开销,实现数据流查询的实时响应。同时,以P2P环境的Chord网络结构和协议为平台,研究分布式数据流挖掘和及时响应查询处理的中间件,探索在对等计算系统中提供流数据的近似查询功能所涉及到的数据和查询路由、定位与查找、索引及数据流概要的映射等关键技术问题。具体来说,本论文的主要创新点在于以下四个方面:(1)研究了基于小波技术的分布式数据流的查询处理算法。首先通过离散小波变换理论与DWT分解哈尔小波方法获得小波系数,然后分析了数据流的计算模型,形式化了数据流的查询模型。在此基础上,提出了一种新的方法来构造数据流集的概要,建立一种复合索引结构来处理内积查询和相似查询。此外,还结合小波神经网络WNN良好的时频局部化性质以及神经网络的自学习功能,初步建立适应于时间序列数据流的预测模型。(2)研究了基于草图技术的分布式数据流的聚集查询算法。首先分析了基于草图的近似处理算法,然后利用随机技术,在数据流到达时实时计算数据的伪草图概要。在此基础上,提出新颖的草图分割技术,通过属性值域的智能分割来减小分割后的自联接规模以及为每个分割的独立草图公平地分配存储空间两个方面来保证近似估算质量。(3)研究了大规模分布式数据流中频繁项的发现算法。通过对单个数据流频繁项的发现算法的分析,形式化地定义了基于时间点的分布式数据流频繁项的发现问题。并提出了基于Lossy Counting算法的、分布式的合并算法DMA(Distributed Merging Algorithm)的一种分层结构来发现从叶子结点直至根结点的概要结构,并通过设置精确梯度使网络数量最小及数据中心和网络链接所消耗的计算资源最小来优化分布式系统的通信负载。(4)研究了基于P2P的分布式数据流查询处理的中间件和原型开发。首先利用P2P的特性改进了索引结构的定位查询过程和稳定性。然后,将数据流的概要映射到改进的弦环节点,将基于内容的路由扩展到分布式流索引中,在此基础上,提供连续近似查询,并利用最小边界矩形MBR等优化方法,通过自适应地调整MBR的每一维f的高低边界来改进系统的精确度。在减小中心数据和网络链接所消耗的计算资源的情况下,加快和提高流数据查询和挖掘的效率,及时响应客户的查询请求。本论文的研究依托于国家863项目“基于Web服务的数据库新技术”的子项目“基于Web服务的电子商务”的研究来进行。所有的科研工作是建立在对大量参考文献的阅读理解、理论分析和实验测试的基础上,经实验和分析表明,所提出的算法和基于P2P的中间件具有良好的性能特性,可以为分布式数据流应用提供运行与开发的环境。

【Abstract】 With the development of large network and Web application, a new kind of data—Data stream come into being in the application areas of monitor and sensor networks in-break detecting, communication data management stock analysis and so on. These data sequence are relational tuples、sensor values、network parameters、phone records and share data etc. Different from traditional database application model, data stream model have characteristics as below: (1) Continuous, real time arrival; (2)Vast, unrestrainted and hard estimated; (3) Unless storing especially, they can’t be taken out to handle again, namely one-pass processing. Otherwise, withdrawing data again is very expensive. How to store、query and mine these data streans has become the hot issues in the field of international database currently.In many practical application, such as decision support system, query optimization, the exact values are not necessary to obtain but approximate value. Therefore, the core issue of data stream management and analysis is designing one-pass algorithm, namely the feature structure of data streamsynopsis data structure are unceasingly updated in the least memory than data scale in order to quickly achieve approximation answer in time at all hours. If the length of data stream is N, the size of synopsis structure are not excess to O(polylog(N)), and the processing time of each group of stream are not excess to O(polylog(N)).The primary query in traditional database is one-time queries, namely the system give answer according to the snapshot of data set. But in data streams, the query is continuous and long-run because the query answers are returned incessantly along with new data. The data streams are not static but ceaseless insertion and update. Users need dynamic monitoring along with the changed data streams but not static result of some time.The exist research of data stream are centralized stream system. But the essence of data stream is distributed. More and more practical data in sensor network、communication、Internet flow analysis and Web log etc, are transported from teledata sources. Therefore, we need to construct the middleware of distributed query processing for data streams to support above application.P2P technique is used to build a huge distributed network for the processing of large number of information by internet terminals. These computers (Peers) are equal in independent computing and processing to solve the bottleneck issue of single server. P2P can build the good platform of middleware, which validly provide real-time query answer quickly, by means of its good performance of scalability、load balance、high efficiency and dynamic adaptability based on content-indexing in order to reduce the overhead of network bandwidth and links.Focusing on distributed data streams, this thesis analyzes the domestic and international present research. From the current existent problem and shortage, we study the data stream’s characteristic based on time variety, monitor the real time streams, investigate on the presentation and modeling methods for stream’s variety, mine the data evolution with the trend of the variety, and forecast the future trend of data streams. In large-scale distributed environment, we study the distributed approximate algorithms for query processing and mining with minimum complexity of time and space. On one hand, we utilize the wavele coefficients by DWT to construct histogram with good precision, and extent to multi-dimension histogram to solve the traditional difficulty, and build a composite indexing structure with data stream synopsis, at the same time, we build wavelet-neural network model with the wavelet resolution to solve the tradition difficulty of hidden number of nodes, and elementaryly establish the forecast model of time sequences. On the other hand, we solve the difficulty of aggregate query in data streams by utilizing sketch tenique. We study the algorithm of frequent items in distributed data streams and set the precision gradient to reduce the load of communication. Meanwhile, with the platform of Chord network and protocol in P2P system, we study the middleware of distributed query processing and mining data streams, investigate on key functions and techniques such as query routing、lookup and location、indexing and mapping of data stream synopsis in the peer computing system. That is concrete to say, this thesis has four central innovations as follow:(1) Studying the query processing algorithm of distributed data streams based on wavelet.The first step is according to the theory of discrete wavelet transition (DWT), we decompose Haar wavelets in order to obtain the important wavelet coefficients.Then, we analyze the computing model of data streams, formalize the stream’s query model. Based on above research, we provide a kind of new method to construct synopsis data structure of data streams, establishing a compound indexing construction to handle the inner product queries and similarity queries. In addition, combining the good performance of time-frequent localization in wavelet neural network (WNN) and self-study in neural network (NN), we establish the forecast model of time sequence.(2) Studying the query processing algorithm of distributed data streams based on sketching technique. Firstly, through analyzing the approximate sketch-based processing methods, and utilizing the random techniques, we computing the pseudo sketching synopisis to the on-time arrival streams. Based on above research, we provide a novel technique of sketching partition through intelligent division for the value range of attribute and reduction of the self-join scale to fairly allocate memory space. In result, we ensure the good estimated quantity of approximate query.(3) Studying the mining algorithm of frequent items in large scale distributed data streams. Firstly, we formally define the mining algorithm of frequent items base on time point. Then, we take forward to the Distributed Merging Algorithm (DMA), a hierarchical structure to discover the stream synopsies from all leaves to root node, through setting the precision gradient to optimize the communication load and overhead of distributed system.(4) Studying the query processing middleware of distributed data stream based on P2P system. The first step is utilizing the P2P characteristic to improve the procedure of location searching and stabilization.Then, the data stream synopisis are mapping to the improved Chord node and content-based routing are extended to distributed stream index. Based on above research, the continuous approximate query are providing to improve the efficiency of query and mining processing for data stream. In meanwhile, the optimization method of Minimum Bounded Rectangle (MBR) is introduced to reduce center data and minimize the computing resource of network link. Through adjusting adaptively the high and low bounding of MBR, The precision of system are improving to achieve the query answer in time.By means of the sub-project "Web Services-based E-commerce" in 863 national project "Web Services-based database new technology", the research work in this thesis utilize large number of domestic and international references、theory analysis and experiment test. Estimated experiment and analysis illustrate that above algorithms and the middleware have good performance, which can be maked use of in running and development environment of data stream applications.

  • 【网络出版投稿人】 东华大学
  • 【网络出版年期】2007年 05期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络