节点文献

基于遗忘特性的数据流概要结构及其应用研究

Research on Amnesic Synopsis of Data Stream and Its Applications

【作者】 陈华辉

【导师】 施伯乐;

【作者基本信息】 复旦大学 , 计算机软件与理论, 2008, 博士

【摘要】 随着计算机网络和各类电子设备应用的越来越广泛,越来越多的数据以连续的流的形式出现,如网络路由信息,传感器网络采集的实时信号,证券交易、信用卡交易、商场购物交易等的实时记录,因特网网站点击流,电信网络的电话呼叫业务记录,聊天室、短信等的实时文本流等,均产生连续不断的各类数据。这些数据被称为流数据或数据流。因为数据流和传统数据库等系统中处理的数据的巨大差别,迫使研究人员对数据流模型和处理方法进行深入研究。数据流处理的关键是应用单趟数据扫描算法,建立流数据的概要结构,以便随时能根据该结构提供数据流的近似处理结果。数据遗忘是数据流的一种重要特性,在数据流概要结构构造中应充分考虑这种遗忘特性。本文工作利用这种遗忘特性,提出了一种基于数据流遗忘特性的概要结构的框架,称为分层遗忘概要(Hierarchical AmnesicSynopses,简称HAS)。应用HAS结构,可将原来不考虑遗忘特性的概要结构构造方法改造为结合了数据流遗忘特性的方法。本文工作将HAS结构应用于直方图、抽样、小波、sketch、随机投影等主要的数据流概要结构中,并给出了几个典型应用。本文主要贡献包括:(1)提出了一种数据流概要结构的通用框架,HAS结构。该框架嵌入了数据流的遗忘特性,并且具有遗忘速度和重构误差控制的能力。利用该框架,可将现有的多种典型数据流概要结构改造成具有数据流遗忘特性处理能力。(2)实现了基于小波数据压缩的HAS结构(W-HAS),提出了小波概要的归并方法,并讨论了在基于误差平方和(sse)和基于最大绝对误差(max_abs)两种误差度量标准下的W-HAS,以及如何进行W-HAS中的重构误差控制的方法。(3)讨论了基于加权随机抽样的HAS结构(WS-HAS),分别对有放回和无放回加权随机抽样设计了WS-HAS概要结构的维护算法。(4)提出了结合HAS结构和直方图数据压缩方法的H-HAS结构,讨论了等宽直方图下的H-HAS结构的实现,用动态规划方法实现了最优直方图下的H-HAS结构。(5)基于数据流的W-HAS结构,讨论了数据流之间的近似距离和聚类中心的计算,并进而提出了适合并行多数据流的K-means聚类方法:W-HAS-clustering。同时,利用数据流的遗忘特性,应用随机投影,构造了基于随机投影的数据流分层概要结构RP-HAS,并设计了规范化后数据流的RP-HAS结构维护的方法。提出了基于RP-HAS结构的适合并行多数据流的聚类方法RP-HAS-clustering。(6)讨论了高维数据流中HAS结构的实现,并将其应用到数据流的分类和聚类中。(7)提出了一种基于sketch的数据流概要结构EFM sketch,并用EFM sketch来估算集合的相似度。在HAS结构的基础上,应用EFM sketch分析数据流上数据的相似度和演化。

【Abstract】 With the increasingly widespread use of computer networks and electronic equipments, many real-life applications data appeared as dynamic evolving data streams which are continuous and unbounded in nature. Such applications include stock markets, network traffic monitoring, sensor networks, internet, network security, data mining, financial monitoring, and many more. Traditional data base techniques can hardly be applied to process such high-speed and unbounded data stream directly. So researches need to work out novel processing techniques over data streams.Maintaining a synopsis data structure dynamically from data stream is vital for a variety of streaming data applications, such as approximate query or data mining. In many cases, the significance of data item in streams decay with age: this item perhaps conveys critical information first, but, as time goes by, it gets less and less important until it eventually becomes useless. This feature is termed amnesic. The dissertation proposed a Hierarchical Amnesic Synopses (HAS) which includes the amnesic feature of data stream in the generation of synopses. HAS can provide a better approximate representation for data streams with amnesic feature than conventional data stream synopses.Our major contributions in the dissertation are as follows.1. The dissertation proposed HAS structure to utilize the amnesic feature of data stream. HAS structure has the ability to control the amnesic speed and the reconstruction error.2. Discrete Wavelet Transform is often used in construction of synopses for streaming data. We proposed a Wavelet-based Hierarchical Amnesic Synopses (W-HAS). To maintain W-HAS online for evolving data streams, the paper first explored the merger process of two wavelet decompositions, and then implemented the addition of data nodes in W-HAS structure based on the merger process. Using the addition of data nodes, W-HAS grows dynamically and hierarchically. The construction methods of W-HAS under sum of squared error (sse) and maximum absolute error metrics are discussed. Further, W-HAS with error control is also explored.3. We proposed Weighted Sampling based Hierarchical Amnesic Synopses (WS-HAS). The construction method of WS-HAS for weighted random sampling with and without replacement is discussed.4. We proposed Histograms Based Hierarchical Amnesic Synopses (H-HAS). The construction methods of H-HAS using Equi-width and V-optimal histograms are discussed.5. Clustering is useful in analyzing the paralleled data streams. Using the W-HAS and the RP-HAS (Random Projections based HAS), respectively, a fast computation of approximate distances between streams and the cluster center can be implemented, and an efficient online version of the classical K-means clustering algorithm is developed.6. We proposed HD-HAS (High Dimensional HAS).7. We introduced a novel sketch, EFM sketch. EFM sketch can be used to estimate the similarity of two sets. Based on HAS structure, we discussed the analyzing method of evolvement in data stream.

【关键词】 数据流数据遗忘概要结构小波聚类分析
【Key words】 data streamsamnesicSynopsesWaveletsclustering
  • 【网络出版投稿人】 复旦大学
  • 【网络出版年期】2009年 08期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络