节点文献

数据流上的异常检测

Anomaly Detection over Data Streams

【作者】 秦首科

【导师】 周傲英;

【作者基本信息】 复旦大学 , 计算机软件与理论, 2006, 博士

【摘要】 近年来,由于来自实际应用中的需求的推动,数据流上的异常检测技术的研究已经受到了学术界和工业界的越来越多的重视。数据流上的异常检测在金融风险分析、通信网监测、网络流量管理、趋势分析、Web日志分析、网络入侵检测、传感器网络管理等领域具有广泛的应用。例如,为了调节电信网络的性能,需要对电信网中的数据流进行监测,其检测异常的准确性对电信网络的正常运行是至关重要的。同样的应用场景也存在于高速公路上的交通管理,相关趋势的分析和预测,网页点击流的分析,信息系统的入侵检测以及传感器网络的管理等。在这些场景中,有相当一部分应用需要及时地对任务进行处理,以获得尽可能短的响应时间。然而,传统的数据库技术是用来管理静态数据集的,其很难直接被用于对动态数据的实时监测和挖掘。因此,为了实时地监测数据流,需要采用伸缩性强的异常检测算法在有限的时间内处理大量的数据流。在对数据流的处理中,最大的挑战就是要在有限的内存空间,需要顺序的单遍扫描算法,并且要实时返回精确的结果。本文综述了国际上关于数据流异常检测的研究成果,在分析了现有研究工作的基础上,提出了异常的定义和数据流上异常检测要研究的问题,以及异常检测系统的框架。在对数据流上异常检测的研究过程中,本文的主要贡献有如下三个方面:1.提出了自适应的突变的定义,自适应的突变更加全面地概括了数据流上的突变信息,并且排除了颠簸数据的干扰。根据该定义,本文又提出了三种突变检测方法,其中包括False Positive,False Negative和综合突变检测算法,这些算法能够保证以高于用户指定的准确率检测数据流上的突变,而且这些算法既能检测单调聚集函数值的突变,又能检测非单调聚集函数值的突变。突变检测算法所依赖的是本文提出的倒置桶序列的直方图(简称IH)。这种新颖的直方图技术具有较小的时间复杂度O(n((log n+log R)/(log(1+δ)))和空间复杂度O(n((log n+log R)/(log(1+δ))),并能为突变检测提供准确的聚集查询支持,因此与现有直方图技术相比更加适用于数据流上的突变检测。2.提出了基于单调搜索空间的突变检测算法。首先,提出了数据流上的单调搜索空间的构建算法及改进后的构建算法,从而对实际应用中的近似分形数据进行了分形变换,使得滑动窗口的错排序误差errMS为0。其次,基于单调的搜索空间设计了突变检测算法。该算法能将突变检测处理时间复杂度从O(m)降为O(log m),m为需要被检测的滑动窗口数目。最后,分析并给出了基于单调搜索空间的突变检测方法的误差界限,使得本文提出的突变检测算法具有理论上的误差上限的保证。3.提出了基于分段分形模型的无参数异常检测算法。首先,本文提出了最优的分段分形模型以及数据流上的近似最优分段分形模型。利用近似最优的分段分形模型为长为n的数据流建模的时间复杂度为O(n log n),空间复杂度为O(log n)。第二,提出了基于分段分形模型的突变检测算法,该算法在分段分形模型具有理论误差界限保证的前提下,能够准确地检测数据流上的突变。第三,本文提出了无参数的异常检测算法。该算法能够在最合适尺寸的滑动窗口上检测异常的情况,不需要用户设定任何参数,也不需要使用训练数据。使用该算法在数据流上检测异常的时间复杂度仅为O(n),空间复杂度仅为O(1)。综上所述,本文针对现有异常检测中存在的三类问题,分别提出了从问题定义、概要数据结构到异常检测算法的完整方案,并提出了以本文技术为核心的异常检测系统框架。理论分析和实验结果表明,与已有的研究成果相比,本文给出的异常检测方法具有较高的精度和较低的时间、空间复杂度,更加适用于数据流的应用场景:金融风险分析、通信网监测、网络流量管理、趋势分析、Web日志分析、网络入侵检测、传感器网络管理等。

【Abstract】 Recently, real-time monitoring and online data mining over data streams have been attracting more and more research attention in database research community because of its wide variety of real world applications. For instance, it is very important to monitor the data streams for telecommunication network performance tuning, highway traffic management, trend-related prediction analysis, web-click streams analysis, information system intrusion detection, and sensor networking, etc. Some of these tasks are mission critical which require fast response.Conventional database techniques were initially invented to manage static data collections, which can not be applied directly to monitor or mine dynamic data streams. Hence a robust and efficient anomaly detection is needed to process huge amount of information in a constrained time period, in order to online monitor the data stream. The most important constraints or requirements for the data stream processing include limited memory availability, sequentially linear scanning, and online accurate result report over evolving data streams. However, existing methods on anomaly detection suffer several problems. Our major contributions in this paper are as follows.1. A novel concept of adaptive aggregation burst is introduced for precisely modelling real-life data streams. This burst model empowers users to detect double-side relative bursts dynamically without disturbed by bumps. We propose a framework for accurately detecting such bursts under false positive and false negative constraints. The algorithms are developed based on a space and time efficient histogram, IH, which can be maintained with O(n(logn+logR/log(1+δ))) time and O(logn+logR/log(1+δ)) space. Thus IH is more suitable for burst detection than other popular histograms.2. A monotonic search space based burst detection method is proposed. Firstly, the monotonic search space building algorithms are proposed for transformingreal life data. With the fractal transforming, the miss sorting error errMS can be guaranteed 0. To construct such a monotonic search space for multi-window burst detection, a novel approach is presented, which could decrease the time overhead of query processing from O(m) to O(logm), where m is the number of windows being monitored.3. Propose a parameter-free anomaly detection algorithm in this paper. We first introduce the piecewise fractal model (PFM) for data stream processing, which consumes O(n log n) time and O(logn) space to model a n length stream. Based on this model, a burst detection algorithm is proposed which can provide accurate burst detection. A novel PFM based method is proposed for anomaly detection. This algorithm consumes only O(1) memory and O(n) time without involving training process.This paper give a survey on the state of art anomaly detection over data streams. Based on the analysis of existing research works, a more general definition of anomaly and existing problems are proposed. This paper also propose a system architecture for anomaly detection over data streams which integrates the novel anomaly detection techniques proposed in this paper. Theoretical analysis and experimental results show that this algorithm can achieve higher precision with less space and time complexity as compared with the existing methods, and it could be concluded that the proposed algorithm is suitable for burst detection over data streams.

  • 【网络出版投稿人】 复旦大学
  • 【网络出版年期】2007年 02期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络