节点文献

时间序列中的知识发现

Knowledge Discovery in Time Series

【作者】 万里

【导师】 廖建新;

【作者基本信息】 北京邮电大学 , 计算机科学与技术, 2009, 博士

【副题名】基于频繁模式发现的分类和聚类方法研究

【摘要】 随着数据库技术、因特网、电信技术等信息技术的飞速发展,时间序列数据在现实生产和生活的各个领域中广泛存在(如电信运营、金融市场、工业过程、科学实验、医疗、气象、生物信息等),且存储规模呈现爆炸式增长。如何从海量时间序列数据中发现能够帮助人们决策且以前不知道或不易知道的模式、信息和知识是人们现阶段最急切的需求,也是时间序列数据挖掘研究的核心问题。目前时间序列数据挖掘的研究尚处于起步阶段,很多研究问题还极富挑战性,很多挖掘算法还有待扩充和完善。本文从预测、分类、聚类、搜索和频繁模式发现五个方面对时间序列数据挖掘的研究现状进行了综述,对目前各个研究方向的主要方法进行总结和评价,在频繁模式发现和动态复杂网络社区划分两个方面进行了深入研究。最后在总结全文的基础上,指出有待本文深入研究的问题。本文的创新性工作主要包括以下内容:(1)提出一种频繁模式挖掘算法FPM(Frequent Pattern Mining),该算法充分考虑频繁模式在时间序列中出现次数和分布,能从时间序列数据中挖掘出只在“某个”时间段内频繁出现的“异常”事件序列和在“整个”时间序列中频繁出现的“序列”事件。基于这些不同分布的频繁模式扩展MAMC (Mixed memory Aggregation Markov Chain)模型得到FMAMC (Frequent pattern based Mixed memory Aggregation Markov Chain)模型。FMAMC描述了时间序列中不同类型频繁模式之间的时序关联关系。实验表明,FPM算法时间性能优于PrefixSpan (Prefix-projected Sequential pattern mining)和WinMiner算法,FMAMC模型能够比MAMC模型更准确的预测时间序列中的事件。(2)提出一种基于频繁模式的时间序列分类框架,该框架分为特征提取、特征选择和分类模型训练三个阶段:首先利用本文提出的MNOE (Mining Non-Overlap Episode)算法挖掘时间序列中的非重叠频繁模式,基于非重叠频繁模式提出EGMAMC(Episode Generated Mixed memory Aggregation Markov Chain)模型描述时间序列。然后,根据似然比检验原理,从理论上推导出频繁模式在时间序列中出现的次数和EGMAMC模型是否能显著描述时间序列之间的关系;根据信息增益定义,选择能显著描述时间序列并且信息增益大于给定阈值的频繁模式作为时间序列特征输入传统分类算法训练分类模型。实验表明,选择频繁模式作为特征进行分类的结果优于不选择频繁模式作为特征的结果。(3)提出特征流的概念描述频繁模式实例在时间序列中的分布情况,根据特征流的频谱特征将相应的频繁模式分为三种类别,并分别与时间序列中隐藏的不同类型的事件相对应。提出EDPA (Event Detection by Pattern Aggregation)算法,将紧密关联的频繁模式聚合在一起,每个聚类中的频繁模式构成一个事件。实验表明,选择显著非重叠频繁模式输入EDPA算法进行事件探测的准确率高于选择其它类型频繁模式。(4)提出一系列基于极大团的复杂网络静态和动态社区划分方法:首先提出一种复杂网络中极大团挖掘算法CLIM (CLIque Mining),该算法利用复杂网络中节点聚集系数高的特点设计剪枝策略。实验表明,CLIM算法计算大规模复杂网络和节点聚集系数较高的随机网络中极大团的时间性能明显优于Improved BK (Bron-Kerboscht)算法。基于极大团及其重叠关系定义社区核心和附属节点组成社区。提出一种重叠社区划分算法CDPM (Clique Directed Percolation Method)。CDPM算法采用作者提出的结构轮廓系数SSC(StructureSilhouette Coefficient)衡量复杂网络社区划分质量,SSC越大,社区划分越优,算法最终输出使得SSC最大的社区划分。实验表明,CDPM算法和CPM (Clique Percolation Method)、GN (Given-Newman)算法相比能够更准确的划分社区,其F-measure和VAD (Vertex Average Degree)值更高。同时考虑复杂网络链接结构和节点附属属性信息,提出信息图社区划分算法JCCM (Joint Clustering Coefficient Method),首先采用启发式方法计算出由极大团结构重叠而成的社区核心,然后算法采用本文提出的JCC (Joint Clustering Coefficient)系数为目标函数将社区核心和附属节点聚合在一起,采用不同的距离函数度量附属节点到社区核心中不同地位节点间的距离。实验表明,JCCM算法划分信息图中社区的效果优于只考虑网络结构信息或节点属性信息的算法。在静态社区划分算法的基础上,定义相邻时刻静态社区间的演化关系并提出一种新的动态社区划分算法DCI(Dynamic Community Identification)。DCI算法首先利用本文提出的基于最短描述长度原则(MDL:Minimum Description Length)的划分算法划分静态网络社区。然后,将相邻时刻静态社区及其成员间的重叠关系抽象为一个二分图,将动态社区演化问题抽象为二分图划分问题。提出一种基于MDL的算法划分二分图。实验证明,DCI算法能够比GS (GraphScope)、GC (GroupColoring)算法更加准确的划分出动态网络中的动态社区,并且具有较好的时间性能。(5)将信息在社会网络中的传播过程抽象为节点按一定转移概率依次出现的半马氏过程。提出“空状态”的概念扩展多维半马氏过程,在此基础上提出社会网络信息流模型。社会网络中的成员和模型状态空间中的状态一一对应,社会成员间信息交互的概率即模型中状态间的转移概率,转移概率的大小由社会成员自身特性和所属的社会网络子结构所决定。提出基于社会网络结构的回报函数,并构造有偿半马氏模型计算用户价值。在信息流模型和有偿半马氏模型的基础上,综合考虑社会网络和用户自身偏好对资源选择的影响,提出协同过滤算法SMRR (Semi-Markov and Reward Renewal)。实验表明,SMRR算法的预测准确率优于传统的余弦算法和RIF(Rate Information Flow)算法。

【Abstract】 With the rapid development of database, internet and telecom technique, time series data appears widely in real world life, such as telecommunication business, financial market, manufacture process, science experiment, medical care, meteorology, bio-information etc. and the scale of data is huge and exploding. How to help users to discover hidden pattern, information and knowledge from large time series data in order to support decision is a raring demand. It’s also the core research topic in time series data mining. Recently, researchers just start to concern time series data mining, many topics are challenging, many algorithms need to be extended and perfected.This paper reviewed time series data mining from prediction, classification, clustering, search and frequent pattern discovery five aspects. The main methods in each area are summarized and reviewed, then we focus on the research topics in frequent pattern discovery and community identification in dynamic complex networks deeply. Finally based on the conclusion of this paper, we proposed some research directions in future. The contributions of this paper are as follows:(1) Proposed a frequent pattern discovery algorithm FPM (Frequent Pattern Mining), this algorithm considered not only the supports but also the distributions of frequent patterns. It can discover the "exceptions" frequently appears in a special time segment and the sequential event frequently appears in the whole time series. Based on these frequent patterns with different distributions, we extend MAMC (Mixed memory Aggregation Markov Chain) model to FMAMC (Frequent pattern based Mixed memory Aggregation Markov Chain) model. FMAMC described the temporal correlations among different types of frequent patterns in time series. Experiments demonstrate FPM performs better than PrefixSpan (Prefix-projected Sequential pattern mining) and WinMiner algorithms, FMAMC model can predict time series more accurately than MAMC model.(2) Proposed a frequent pattern based time series classification framework, it included feature extraction, feature selection and classification three phases:the proposed MNOE algorithm extracts non-overlap frequent patterns from time series and proposed EGMAMC (Episode Generated Mixed memory Aggregation Markov Chain) model based on these non-overlap episode to describe time series. After that, according to ratio likelihood test, we induced the support that can guarantee the significance of non-overlap episode. Finally, according to the definition of entropy, we selected the significant non-overlap episode whose entropy of distribution exceed a specified threshold to train classification model. Experiments demonstrate that select frequent patterns as features can improve classification models effectively.(3) Proposed the concept of feature flow to describe the distributions of frequent patterns in time series, we categorized the frequent pattern into three types according to their spectrum of feature flows and map different types of frequent patterns to different types of event hidden in time series. We proposed EDPA (Event Detection by Pattern Aggregation) algorithm to aggregate tightly correlated frequent patterns, each cluster of frequent patterns represents an event. Experiments demonstrate that inputting maximal significant non-overlap frequent patterns into EDPA algorithm to detect event can get more accurate result than selecting other types of frequent patterns.(4) Proposed a family of static community identification algorithms based on maximal cliques and a dynamic community identification algorithm:Firstly, this paper proposed an algorithm CLIM (CLIque Mining) to discover maximal cliques in complex networks. We designed the prune strategy of CLIM based on the observation that the clustering coefficient of most of the vertices in complex network is large. Experiments demonstrate CLIM performs better than Improved BK(Bron-Kerboscht) algorithm on complex network and random graphs with high clustering coefficient. Based on the maximal cliques found by CLIM, this paper defined a community consists of community core and peripheral vertices and proposed a community identification algorithm CDPM (Clique Directed Percolation Method) based on overlapping maximal cliques. CDPM utilized the proposed SSC (Structure Silhouette Coefficient) as objective function to identify communities in complex networks. The value of SSC is larger the corresponding community identification is better. Experiments demonstrate that CDPM algorithm performs better than CPM (Clique Percolation Method), GN (Given-Newman) algorithms evaluated by F-measure and VAD (Vertex Average Degree). By considering link structures and attributes attached on vertices in complex networks, this paper proposed a JCCM (Joint Clustering Coefficient Method) algorithm to identify communities in informative graph. Firstly, it utilized heuristic method to calculate community cores formed by overlapping maximal cliques. Then JCC is used as the objective function to aggregate community cores and peripheral vertices together, different distance functions are used to measure the distance between vertices in different positions. Experiments demonstrate JCCM performs better than existing algorithms which do not consider structure information and attributes attached on vertices cooperatively. Based on the static community identification algorithms, by defining evolving relationships among static communities at adjacent time points, this paper proposed a dynamic community identification algorithm DCI (Dynamic Community Identification). DCI utilized a MDL (Minimum Description Length) based method to identify communities in static complex networks. Then, we abstract the communities in adjacent complex networks and their overlapping relationship as bi-graphs and reduce the evolutions of communities in adjacent static complex networks into a bi-graph partition problem. A MDL based method is also proposed to partition bi-graphs. Experiments demonstrate that DCI algorithm can identify communities more accurately and run faster than GS (GraphScope), GC (GroupColoring) algorithms.(5) Abstract the process of information transition in a social network as a semi-Markov process model in which vertices appear following some transition probability. We proposed "idle state" to extend multi-dimensional Markov process, and proposed social network information flow model based on it. Each state in state space of information flow model represents an actor in a social network, transition probability in information flow model is the probability that two actors exchange information. This transition probability is determined by the characteristic of actors and the sub-structures of social network actors involved in. We proposed reward function based on social network structure and construct reward Markov model to calculate customer value. Based on information flow model and reward Markov model, we proposed a collaborative filtering algorithm SMRR (Semi-Markov and Reward Renewal) by considering personalized behavior patterns of actors and the influence of interacting social actors together. Experiments demonstrate SMRR algorithm performs better than traditional cosine algorithm and RIF (Rate Information Flow) algorithm evaluated by accuracy.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络