节点文献

数据流相似性查询及模式挖掘研究

Similarity Query and Pattern Mining on Data Streams

【作者】 郭建奎

【导师】 朱扬勇;

【作者基本信息】 复旦大学 , 计算机软件与理论, 2008, 博士

【摘要】 随着数据挖掘研究领域的不断拓宽和研究内容的不断深入,人们发现应用中越来越多的数据是以流的形式产生的,例如网络流,网页点击流,交通流以及传感器网络数据等等。分析和挖掘这类数据日益成为一个热点问题。其中,分析流数据间的相似性和模式发现成为重要的研究内容。研究数据流的相似性查询对于完善数据流查询、改进数据流系统等都有着重要的应用价值,并且对于在数据流上进行分类、聚类等也有着指导意义。当前在数据流环境下相似性查询和模式发现的研究工作没有充分考虑数据流数据自身的特点,往往假定内存空间无限或者不满足增量更新。另一方面,据我们所知,目前还没有相关工作系统地解决相似性查询的问题。基于此,本文着重研究数据流环境下相似性查询及模式发现问题,主要包括如下三个关键方面:(1)基于Lp距离,提出系统解决数据流环境下相似性查询的技术。在数据流环境下,基于Lp距离函数,本文系统的提出了一个解决相似性查询的框架,用以解决数据流环境下相似性查询。在充分分析数据流数据的特点后,提出一种新颖的数据结构SDS-Tree(the Same-DirectedSlope Tree)来分层表示数据流对象,实现对原始数据流的表示。基于Lp距离,本文证明SDS-Tree的有效性,并且进一步给出一个相似性判别中更为有效的粒度。基于有效的SDS-Tree结构,文章分别给出有效处理单一固定窗口下的相似性查询算法ASQFSW(Algorithm for SimilarityQueries in Fixed Sliding Window)以及滑动窗口下的增量相似性查询算法IASQSW(Incremental Algorithm for Similarity Queries in SlidingWindows)。特别,IASQSW算法找到了窗口滑动时数据流数据变化的一个上界,根据该上界,算法只需更新有限的SDS-Tree结点,就能够完成窗口滑动时的相似性查询。详细的理论分析以及大量的实验评估表明,我们给出的技术和方法显著优于目前的研究方法。(2)针对Lp距离无法解决时间弯曲的现象,为提高在数据流环境下相似性查询的准确度,提出了基于DTW距离的相似性查询的技术。在数据流环境下,使用Lp距离无法解决时间弯曲现象。为提高在一些应用场合中相似性匹配的准确度,基于DTW距离,提出了一个解决相似性查询的算法ESDS(Estimating Similarity on Data Streams)。算法根据数据流数据的变化特性提出了数据分段的思想,每段数据仅用三个数值(最大值,最小值和差异值)来表示原始数据流的特性。为保证数据特征提取的有效性,根据数据变化的规律,提出了振荡数据的概念,并给出了判断数据流中是否存在振荡数据的算法judgeSurge。为保证对振荡数据的处理不会影响到数据流的特性,进一步提出了有效振荡和最大有效振荡幅度的概念,设计了求解有效振荡数据的算法judgeValidSurge和求解最大有效振荡幅度的算法calMaxScope。算法基于特征数据,设计了新的DTW距离函数,基于动态规划算法,设计了在数据流环境下进行相似性判别的算法ESDS。详细的理论分析以及大量的实验评估表明,文章给出的技术和方法具有很高的准确度和效率。(3)针对Web流数据,设计了两个Web流模式挖掘算法。数据流间的相似性查询在Web数据流中有重要的应用价值。文章首先分析了Web流数据的特征,然后着重研究了Web流数据的模式发现问题。在充分分析经典算法WAP-mine的缺陷后,首先针对WAP树结构设计了一个自顶向下挖掘的算法TD-WAP-mine。算法避免了在挖掘频繁模式过程中每次需要构造大量中间数据,而直接对原始的WAP树进行挖掘,节省了生成中间数据的代价,在支持度比较小或者原始Web流数据过于大的情况下,TD-WAP-mine表现出更好的性能。其次,针对WAP树存在数据冗余情况,提出了压缩WAP树的概念,在不影响挖掘结构的前提下,设计了压缩WAP树算法,并且直接对WAP树投影,设计了一个自顶向下的挖掘算法TAM-WAP,在大规模实验集上的实验表明,TAM-WAP算法表现出更好的性能和伸缩性。

【Abstract】 As the domain of data mining is becoming more and more widely,and the content of its is becoming deeper and deeper,people find that many data are produced with the manner of streams,such as Web streams,Web click streams,traffic streams and sensor datas,etc.Studying data stream has become a hot research field in data mining society,and has many important applications.Estimating similarity on data streams and finding pattern hidden in the stream are two important tasks.Much work has been done about how to estimate similarity and patterns on data streams.However,there exists lots of work to be done.We focus on the features of data in the stream context and propose a framework to deal with similarity evaluation.And furthermore we address the problem by analyzing Web data,finding information and mining patterns hidden in them with two algorithms of finding Web access patterns.The main contributions of this thesis are as follows.We propose an efficient technology,i.e.,ETSEDS(an Efficient Technology for Similarity Evaluation on Data Streams),to process similarity queries on data streams under Lp-norm.ETSEDS technology captures the main characteristics of stream data,and exploits a novel tree structure,i.e. SDS-Tree(the Same-Directed Slope Tree),to stratify and construct stream data on sliding windows.Moreover,based on the SDS-Tree structure,we propose two efficient algorithms,i.e.ASQFSW(Algorithm for Similarity Queries in Fixed Sliding Window) and IASQSW(Incremental Algorithm for Similarity Queries in Sliding Windows),to process the similarity queries on single-fixed sliding window and general sliding windows respectively. Specially,IASQSW algorithm can realize similarity queries by updating a spot of nodes on SDS-Tree structure.Furthermore,we present detailed theoretical analyses and extensive experiments that demonstrate our algorithms,which are both efficient and effective.We propose a new algorithm ESDS(Estimating Similarity on Data Streams),which can estimate similarity efficiently on data streams under the time warping distance.In order to evaluate the efficiency of our algorithm,we present a simple but efficient method of Segment to denote the original stream data and the Segment method just need three data to express the stream,which are the max data,min data and the difference between them.To improve the efficiency of Segment,we study a new type of stream which is called Surge Stream.We also propose a new method to judge whether a stream is a Surge Stream.However,Surge Stream sometime will become very regularly in the original stream.So we propose the concept of Valid Surge Stream and Max Valid Surge Scope and design algorithms which can be used to find them respectively.With the help of the above concept,our Segment algorithm can detect the features of the original streams efficiently.In computing the distance of DTW between data streams by using dynamic programming,we also design a new distance of DTW which can compute the similarity on data streams efficiently.The experiments on many real and synthetic data sets show that our algorithm can evaluate the similarity on data streams efficiently and not be studied in the previous research.Discovering interesting Web access patterns from Web logs is a Web usage-mining problem with many practical applications.For this problem, some conventional algorithms,such as GSP,HPrefix and WAP-mine,can be used to mine WAP from Web logs.WAP-mine Algorithm mines Web access patterns by storing the original Web access sequence database and then generates the frequent patterns by recursively mining the intermediate Web access pattern trees.Though WAP-mine scans the database only twice,it needs to recursively build intermediate data and will suffer strongly from such building activities especially on low support thresholds.Because in the process of mining frequent Web access pattern,WAP-Mine will generate much intermediate data,at the lower support the efficiency is very low.To address this issue,we propose two new algorithms based on the top-down manner for mining Web access pattern.First,based on the WAP-tree structure, which is proposed by WAP-mine,we propose a new method TD-WAP-mine which can be used to find Web access patterns in a Top-Down manner. TD-WAP-mine can efficiently find frequent patterns at the low support because it doesn’t need generate too much the intermediate WAP-tree. Second,we give a compressed structure of WAP according to the features of WAP and design a new method of how to find the compressed WAP-tree.We also give a new strategy which is called Projection method to find frequent patterns.Instead of stubbornly building intermediate data for each step of mining process,our algorithm selectively builds intermediate data according to the features of current area to be mined.The experimental results on various real world and artificial datasets show that our two algorithms greatly reduce the efforts to build intermediate data and in general offers a better performance than WAP-mine.

  • 【网络出版投稿人】 复旦大学
  • 【网络出版年期】2009年 02期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络