节点文献

分布式数据流处理系统动态负载管理研究

Dynamic Load Management of Distributed Data Stream Processing System

【作者】 欧阳琳

【导师】 郭庆平;

【作者基本信息】 武汉理工大学 , 交通信息工程及控制, 2009, 博士

【摘要】 近年来,网络路由、入侵检测、传感器网络、股票分析、交通管理、移动通信、环境监测、健康状况监控、基于RFID的物品跟踪、电子商务交易信息以及数字化战场等应用领域的数据不断增长,给传统的数据处理技术带来了极大的挑战。这类应用具有数据量巨大、数据持续不断地到达且没有边界、具有一定的实时性以及数据具有时效性等特点,传统的数据处理方式已经难以适应这类应用的需要。近年来出现的数据流管理技术,为这一类应用提供了一条很好的途径,这一研究领域得到了研究工作者的密切关注。负载管理是保证数据流处理系统正常运行,提高数据流处理系统性能的关键技术之一。现有的数据流处理系统存在扩展性不强、适应面狭窄等问题,难以适应分布式数据流处理的需要。本文对数据流处理系统的基本结构、工作原理、实现方法以及系统特点等进行了讨论,针对单位时间内输入系统的数据元组变化引起的系统过载问题,探讨了数据流处理系统负载管理模型的构建方法,并对负载平衡、负载丢弃以及分布式多数据流的连接操作等关键技术进行了深入地研究。主要工作和研究成果如下:(1)提出了一种基于Chord扩展的层次型重叠网络vRing,通过对Chord进行扩展,充分利用网络的接近性构造vRing,从而形成一个分层的重叠网络,为系统的负载平衡提供一个合适的底层网络。在vRing的基础上,提出了一种分层的分布式动态数据流负载平衡算法vDDSLB。当某个节点超载时,负载平衡算法先在位于同一子域的节点间进行负载的迁移;当同一子域中的负载平衡仍不能满足需要时,再选择在整个系统范围内进行负载平衡。绝大部分的负载迁移活动都位于相应的子域内,减少了数据延迟和系统开销。(2)提出了一种基于线性规划的分布式负载丢弃算法LPBDLS。目前的负载丢弃算法主要侧重于单个节点内部的处理,对于节点间的负载丢弃研究较少,本文提出的LPBDLS算法是一种分布式的负载丢弃算法,除了考虑CPU约束外,还将网络连接做为一个重要约束条件,提高了网络连接受限情况下的系统吞吐量。(3)提出了一种多数据流分布式连接查询算法DMS-Join。由于数据流系统天然的分布性,将连接操作分布处理比集中式处理更适合多数据流系统的特点,在分布式环境下,网络传输带宽是一定的,本文提出的算法,在网络传输约束条件下,能够有效地完成多数据流的分布式连接操作问题。(4)提出了一个基于vRing的动态分布式负载管理系统框架。该系统框架建立在Chord重叠网络的扩展vRing的基础之上,利用vRing的网络接近性和其分层的特性,设计了一个分层的动态分布式负载管理系统框架。本文对负载管理技术的研究,为有效应对分布式数据流系统的负载问题提供了理论支持和应用借鉴,对进一步提高数据流系统性能以及更加深入的应用研究具有重要的意义。

【Abstract】 Recently, there has been much interest in building stream processing applications, such as stock markets, network monitoring, security surveillance, financial analysis, online transaction, healthy monitor, RFID-based object tracing, sensor applications and pervasive environments. In these typical applications, data are usually unbounded, continuous, huge in amount, fast arriving, time various and out bursting. The traditional data processing, which can deal with the snapshot queries perfectly, can not satisfy the requirements of these data stream applications. In recent years, researchers begin pay more attention to data stream management technologies, such as constructing and optimization of data stream management system (DSMS), data stream mining and so on.Load balancing is one of the key technologies to ensure the regular service and to improve the system performance of DSMS. The existed DSMS can not satisfy the requirement of distributed data stream processing because of the low scalability. In this dissertation, we study the basic structures, principles, realizations, characteristics and the main application fields. To deal with overload problems aroused by the variety of input data rate, we discuss the method of constructing load management system, further more study particularly the key technologies of load balancing, load-shedding and distributed multiple data stream join operation. The main work and contributions are the following:(1) A hierarchical overlay network (vRing) is proposed first. Then, a load balancing algorithm (vDDSLB) is proposed based on the vRing overlay network. vRing is extended from Chord. By using the network proximity information, vRing becomes a hierarchical overlay network. vDDSLB is a hierarchical load balancing algorithm. It constructs on the basis of vRing. When a node becomes overloaded, vDDSLB will load balancing in the sub-domain first. If the locale load balancing can not satisfy the load balancing requirement, it will launch the global load balancing. Because the most of the scheduling work are happened in the sub-domain, the system performance will be increased and the latency of data tuple will be decreased.(2) A load-shedding algorithm (LPBDLS) is proposed based on the linear programming method. The existed load-shedding algorithms focused on the query network located on a single node. LPBDLS is an inter-node distributed load-shedding algorithm. It takes the CPU power constraint not only, but also the network bandwidth constraint into account. The system throughputs are increased especially in tightly network bandwidth resource environment.(3) A distributed multiple data stream join algorithm (DMS-Join) is proposed. For the inherence of distribution of data stream, it is better to put the join operators on different node than to put them on a single node. Our algorithm can achieve higher performance under the bandwidth constraint.(4) A dynamic distributed load management system is proposed based on the hierarchical vRing overlay network. It takes advantages of the hierarchical feature and the network proximity of vRing to construct a hierarchical load management system.The study of load management technology in this dissertation provides the theory and application support for DSMS, it may have potential and important effect to improve the performance of DSMS.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络