节点文献

基于并行计算的数据流处理方法研究

Research on Processing Methods of Data Stream Based on Parallel Computing

【作者】 周勇

【导师】 程春田;

【作者基本信息】 大连理工大学 , 计算机应用技术, 2013, 博士

【摘要】 量大流速快的数据流挖掘已经成为当今国际学术界大数据处理的研究热点,与静态存储的数据相比,这些数据是连续实时获得的单次扫描数据。对于快速时变的数据流,在有限的内存资源下无法存储全部的数据流数据,如何精确地增量挖掘其连续变化趋势和发现隐藏的相关性对数据流的实时分析与处理带来了巨大的挑战,另一方面,数据流处理时滞也成为制约数据流挖掘的一个尖锐瓶颈问题。基于以上问题,本文研究了数据流趋势和相关性分析的融合并行计算模型和算法,将数据流挖掘与基于CPU (Central Process Unit)和GPU(Graphic Process Unit)的高性能计算有机地结合起来,实现动态连续的高效数据流处理方法。论文的主要研究内容可归纳如下:1、针对非线性非平稳时间序列数据流的预测能力不足问题,研究了基于HHT(Hilbert-Huang Transform)的Online-HHT分析方法,进一步结合RBF (Radial Basis Function)神经网络理论,研究了适合在线预测的时间序列数据流模型。该方法通过引入CPU多线程的并行处理方法,设计了时间序列数据流链式可重写滑动窗口的数据读写技术,实现了细粒度分段数据的并行预测分量和分段结果的合成算法。Online-HHT方法既能发挥其对时间序列数据流的时频自适应分析能力,又具有更快的计算处理速度,Online-HHT得到的数据流本征模分量也降低了RBF神经网络预测结构的输入复杂性,对时间序列数据流的趋势预测能力起到很大地提高。实验结果表明,通过与其他方法相比较,本文提出的方法能够处理数据流的短期趋势预测,并且处理速度更快,可应用于在线预测。2、针对在数据流频繁项挖掘中使用模式树造成空间复杂度过大的问题,提出了一种嵌套滑动窗口遗传算法NSWGA (Nested Sliding Window Genetic Algorithm)的数据流频繁项挖掘方法。本算法在滑动窗口中的数据流上分割出嵌套窗口,利用基于MPI的遗传算法并行处理嵌套窗口中的数据流,以及改进初始种群获得方法,实现了嵌套窗口中数据流的频繁模式快速挖掘。在数据流动过程中,采用定期删除过期数据的方法,更新滑动窗口中最新的频繁项集,进而实现增量维护,提高执行效率,快速发现数据流中的频繁项。3、针对由于资源约束造成的数据流处理时滞和效率问题,研究了最新超算技术GPU并行计算结构,根据数据流数据属性的特点和处理的高性能需求,提出了基于GPU的数据流通用处理模型。根据GPU并行计算结构的SIMT模式,采用基本窗口技术的滑动窗口模型,给出了粗粒度和细粒度两个并行计算层面的数据流处理结构,将数据流的数据划分为粒度合适的数据块,然后进行概要数据结构和各种挖掘算法的并行处理。粗粒度并行主要负责任务分工并行化,而细粒度并行负责抽取数据流概要数据结构的并行化,也负责在GPU上完成数据流挖掘和计算密集的线程网格,达到高效率的数据交换和高性能的并行算法。在这个通用数据流处理模型上,提出了基于GPU的数据流分位数并行计算方法GSQ(GPU Stream Quantiles),调用GPU内核程序,使用哈希方法对数据流的数据块并行计算生成概要数据直方图,最后查询得到数据流分位数,实验验证了从处理带宽、响应时间和加速比都有很大的提高。4、针对在CPU上多条数据流相关性分析受到资源和执行顺序的实时性约束限制问题,本文研究提出了CPU和GPU协同处理的跨总线四层滑动窗口框架,用于处理多条数据流的并行计算,把多条数据流完全映射到GPU内存空间,建立数据流SID索引,使用基本子窗口偏移量可以实现不同级别的并行操作。构造了适合多数据流的多级并行计算处理,使用s→Thread的细粒度并行计算和s→Block中粒度的方式,给出了单维多数据流的相关性分析并行算法GSSCCA(GPU Single-Dimensional Stream Canonical Correlation Analysis),实验验证了算法有很好的准确度,极大提高了计算速度。5、对由多数据属性记录实时复杂信息的高维多数据流来说,在计算准确性和性能会出现比单维多数据流处理更为复杂的资源和执行顺序约束问题。针对这个问题,进一步深入研究了高维多数据流的相关性分析数学模型,提出了GPU上的高维多数据流相关性处理的模型与实施的架构以及并行计算方法GMSCCA(GPU Multi-Dimensional Stream Canonical Correlation Analysis)。使用数据立方体和维度约简的技术,在计算资源受限和高效率要求的环境下,可以快速精确地完成计算,并且在高性能和近似精度之间能够很好地平衡。

【Abstract】 It is attracting significant attention for mining large volumes of data at high speed in the world. High performance methods are extremely demanded to achieve the continuous data stream mining. This type of dynamic data, compared with its static counterpart, exhibits such new characteristics that the data are sequentially acquired for continuous real-time access. We have to address great challenges on the accuracy and online ability of data stream trend and correlation analysis processing due to limited computational and/or storage resources. And the processing time delay has also become a sharp bottleneck problem to restrict the data stream mining. This thesis focuses on the parallel computing models and algorithms for the trend and correlation analysis on data streams. These models and algorithms are capable of efficiently working on both CPU (Central Process Unit) and GPU (Graphic Process Unit) of high performance. The main research contents are summarized as follows:Firstly, we present a new online analysis method derived from the classical Hilbert-Huang Transform (HHT) in order to process the nonlinear and non-stationary time series data streams. This method combines the neural networks with radial basis functions (RBF) for the online prediction on the streams. We design a chain-style sliding window which can be rewritten to read and write the time-series data stream. Moreover, it divides the whole data into several segments to use CPU multi-threaded parallel processing for the prediction in a parallel fashion, and then glues the segments to a final stream. The online HHT method does not only render adaptive time-frequency analysis capabilities, but also accelerates computing speed. The partitioned results given by the method also reduce the input complexity of the RBF neural networks. Compared with the existing methods, the proposed method is able to handle online short-term trend prediction of the time series data stream.Secondly, we propose a new genetic algorithm with nested sliding windows (NSWGA) to replace the complicated pattern trees widely used in frequent item mining of data stream. This improved genetic algorithm uses nested sliding windows to segment data streams, and leverage the MPI parallel processing so as to effciently discover all frequent patterns for the nested windows. It can achieve incremental maintenance of frequent item sets through the updating of new data and removing of expired data. It also makes it possible for high efficiency processing in limited storage buffer space. Thirdly, we build a GPU-based generic process framework for data streams to tackle the processing delay and efficiency issues. This framework adapts to the characteristics of data streams and meets the high-performance requirements. We construct the parallel computing architecture of stream blocks with two granularity levels (big and small) by using SIMT mode of GPU and basic window model in sliding window. The big granularity parallelism is responsible for the parallel control of divided tasks, while the small granularity parallelism is grouped by computing thread grid and responsible for extract the synopsis data for various parallel mining algorithms. Both of them aim to achieve high efficiency of data exchange and performance parallel algorithm. Furthermore, we give a new parallel data quantile computing method named GSQ (GPU Stream Quantiles) in this generic framework. It can call GPU kernel to generate synopsis data histograms by Hash functions and finally query data stream quantile. Experimental results show the significant improvements on processing bandwidth, response time and speedup.Fourthly, we address the issue of the constraints of memory resources and execution sequences for multiple data stream correlation analysis on CPU. We propose a four-layer sliding window frame for multiple data streams, which crosses the bus and collaborates between CPU and GPU. Thus, parallel computing of basic window offsets can be processed when multiple data streams are completely mapped to the GPU memory space and created SID index for each. Then, we construct correlation parallel algorithms GSSCCA (GPU Single-Dimensional Stream Canonical Correlation Analysis) by s→Thread and s→Block multi-level parallel computing. Experimental results show that the algorithm has high accuracy and faster computing speed.Fifthly, the high-dimensional data streams appear more complex constraints of resources and execution sequences than single-dimensional data stream in the calculation accuracy and performance. To address this issue, we present the high-dimensional data stream correlation analysis method GMSCCA(GPU Multi-Dimensional Stream Canonical Correlation Analysis) algorithm in basis of study of related mathematical model. This method can quickly and accurately complete the calculation in the environment of limited computing resources and high-efficiency requirements by using data cube pattern and dimensionality-reduction technique. It also can give balanced compromise between high performance and approximation accuracy.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络