节点文献

高性能数据流模式发现算法及其应用研究

High Perfermance Data Stream Pattern Discovery Algorithms and Their Applications

【作者】 周黔

【导师】 吴铁军;

【作者基本信息】 浙江大学 , 控制科学与工程, 2008, 博士

【摘要】 随着传感器技术和网络计算的发展,数据流作为一种广泛存在的数据,在网络监控、环境监测、工业控制及财经分析得到广泛应用,这些应用具有如下共同特点:要求实时或近实时连续分析这些数据,数据量特别大并且以流的形式高速到达。传统“先存储然后处理”的数据挖掘模型难于处理这种高速率、瞬息即逝的数据流,挖掘数据流对数据挖掘提出了全新挑战。数据流数据中隐含多种模式,如何快速有效发现这些模式,是很多实际应用的核心问题。近年来,数据流模式发现已经成为数据挖掘领域最具挑战性研究课题之一。本文旨在通过引入鲁棒机制及增量遗忘机制提高模式发现算法性能,并将这些算法用于分析工业生产过程,提高产品质量。取得的主要研究成果包括:1)提出一种基于系统辨识领域中的增量递推最小二乘回归参数估计方法与广义似然比检验方法有机结合的数据流实时趋势提取算法。该算法对不断到达的数据流元素,采用增量方法确定线性回归模型参数,利用广义似然比检验判断分段边界点,自动分段给出数据流趋势。与现有趋势提取算法相比,该算法不但计算速度快且精度高;2)提出一种基于数据驱动的数据流在线模式变化鲁棒检测算法。该算法首先以给定长度的两相邻时间窗口对数据流取样,然后以支持向量数据描述方法将这两相邻时间窗口取样的数据流子集映射到规范化的高维特征空间,并分别建立描述这两相邻时间窗口取样数据流子集映像的最小超球模型(排除了其中的离群点),最后通过计算两超球之间的球心矢量的夹角的余弦,度量该两相邻时间窗口取样数据流子集的相似性检测模式变化。该算法不需要先验知识,不受离群点影响,具有较强鲁棒性;3)提出一种基于偏向最近动态最小二乘支持向量回归(RBDLS-SVR)的离群点检测算法。该算法由于采用了基于RBDLS-SVR方法建模,将SVM的学习问题转化为解线性方程组问题,并采用了增量遗忘机制高精度跟踪数据流动态。因此避免了采用一般SVR建模方法应用于数据流回归建模时,每增加或减少一个样本就需要完全重新进行一次求解计算的缺陷,不但计算速度快而且精度高,能有效检测数据流中的离群点;4)提出一种基于倾斜时间窗口的数据流偏向最近聚类算法。该算法首先通过将滑动窗口中数据等长分割形成不重叠的数据块——基本窗口,然后对每一基本窗口以Haar小波变换提取窗口数据的特征,通过改变所雀骰敬翱谛〔ū浠幌凳鍪锏奖A艚隙嘧罱菹附谔卣鞯哪康?即对于越近的基本窗口保留越多的小波系数而越旧的基本窗口保留越少的小波系数,最后通过定义数据流偏向最近距离,完成基于倾斜时间窗口的偏向最近聚类算法。该算法计算速度快,能高效地实现数据流偏向最近聚类分析;5)阐述了数据流模式发现在实际生产过程中的应用。针对复杂的钢铁生产过程数据,应用本文提出的数据流模式发现算法完成两个挖掘任务:离群点检测及突变发现。理论与实践表明,本文提出的算法在大规模工业生产过程数据分析方面有广阔前景。总之,本文主要研究了高性能数据流模式发现算法及其在工业生产过程的应用,这些算法是对现有数据流模式发现的有益补充或改进。理论和实验都表明,与现有算法相比,本文提出的算法在性能(处理速度、处理精度及鲁棒性)方面有明显优势。

【Abstract】 With the rapid development of sensor and network technology, various applications generate a large number of stream data, such as network traffic management, environment monitoring, industrial control and finance analysis. These applications share several distinguishing features: the need for real-time or almost real-time continuous analysis, huge volumes of data, and high data rates arrivals. Traditional data mining models of "store and then analysis" are ill-equipped for mining high data rates and transient data stream, mining data stream poses many new challenges.There are a lot of patterns in the data stream, how to discovery and identify these patterns efficiently is the core problems of many applications. Recent year, pattern discovery in data stream has been becoming one of most challenge research topics. To improve performance of pattern discovery algorithms in data stream, the mechanism of robust and incremental are introduced in this dissertation, and these algorithms are applied to industrial process analysis. The highlights of our contributions are listed as follows:1) By combining an incremental recursive least square algorithm for regression parameter estimation with the generalized likelihood ratio test for change-point detection, a real-time trend extraction algorithm for dynamic data streams is proposed. To segment automatically and extract trend of data stream, the proposed algorithm estimates parameter of linear regression by incremental method and detects boundary points by generalized likelihood ratio test. Remarkably faster computational speed and higher trend analysis accuracy have been achieved by this algorithm compared with the best existing algorithms in the same field;2) A robust on-line data stream change detection algorithm based on data-driven is presented. Firstly, sample data stream by two neighbor windows of given length. Then the sampling data is projected to normalized high dimension feature space and the two minus hypersphere models of two window sampling data sets are constructed respectively(outliers are removed). Finally, detect change by computing cosine of inclination angle of two centrals of hypersphere. The algorithm not only is robust but also doesn’t need priori knowledge;3) A data stream outlier detection algorithm based on recent-biased dynamic least square support vector regression is proposed. The algorithm is modeled by recent-biased dynamic least square support vector regression, therefore it can solve learning problem by linear equation and track dynamic of data stream accurately by incremental and decremental learning mechanism. The algorithm overcomes the shortcoming of modeling by standard support vector regression need computes repeatedly when a sample adds or deletes, not only can achieve fast computational speed but also high accuracy, and can detects outlier in data stream efficiently;4) An recent-biased clustering algorithm of data stream based on tilted-time window is proposed. First, the algorithm segments sliding window equal in length to form no overlap data blocks(basic window). Then extract feature of every data block through Haar wavelet transform, and preserve detail feature of recent data by varying number of wavelet coefficients of data block, namely more recent data block, more wavelet coefficient preserved, and vice versa. Finally, by defining recent-biased distance of data stream, implements the recent-biased clustering algorithm of data stream based on tilted-time window. Remarkably faster computational speed and higher efficient have been achieved by this algorithm;5) Applies the proposed pattern discovery algorithms of data stream to real industrial process. According to the characters of complex process data of iron and steel making, two pattern discovery tasks have been implemented: outlier detection and pattern change detection. The results show that the proposed algorithms have promised future to analyze data generated by complex industrial process.In sum, in this dissertation, several high performance pattern discovery algorithms and their applications are studied, they are improvement and supplement of the existed algorithms. Comparing to the existing algorithms in the same field, theory and simulation results show that the proposed algorithms are higher performance(accurate, computational speed and robust).

  • 【网络出版投稿人】 浙江大学
  • 【网络出版年期】2009年 07期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络