节点文献

时间序列时序关联规则挖掘研究

The Research of Temporal Association Rules Mining of Time Series

【作者】 周勇

【导师】 向蓉美;

【作者基本信息】 西南财经大学 , 统计学, 2008, 博士

【摘要】 时间序列的时序关联规则指时间序列局部变化趋势之间的具有时间约束的关联关系,这些局部变化趋势发生本身具有时间先后顺序,因此这种关联关系就体现出时序性。时间序列的数据密集性、随机波动性和数据海量性决定了只有通过数据挖掘方法才能获取隐含的时序关联规则。时间序列时序关联规则挖掘是一个系统工程,分为时间序列预处理、时间序列压缩、时间序列模式相似性度量、时间序列时序关联规则获取、解释和评价等步骤。目前关于各步骤挖掘方法的研究还不够完善,主要表现在以下方面:(1)在孤立点噪声数据的识别中,基于统计学识别法很难获得样本的分布参数,基于小波变换识别法改变了原始时间序列的真实性,基于似然比识别法的计算量较大;(2)在经典时间序列时序关联规则挖掘中,以给定长度和滑动步长的滑动窗口把时间序列离散成模式序列,然后获取频繁模式,最后生成强时序关联规则。由于滑动窗口的长度和滑动步长是由人为给定,这样时间序列的压缩结果具有很强的人为性,挖掘结果也就具有很强的不确定性;(3)时间序列模式相似性的度量是获取模式序列中频繁模式的基础,决定着时序关联规则获取。目前,元模式单调距离和元模式向量距离中对元模式表示都存在缺陷,所以元模式相似性的度量存在一定问题。而且,现有度量序列模式相似性的方法不能用距离法度量不同长度的两个序列模式的相似性。时间序列时序关联规则具有很强的实用价值,但正如上述,目前挖掘方法却不完善。因此,本文的研究重点是时间序列时序关联规则挖掘方法的改进和完善,提出理论模型与实证分析,力求从时间序列中获取更多可靠的时序关联规则,从而为决策者提供更好的决策帮助。本文以挖掘步骤为主线展开论述,共分八章,每章的结构安排为:首先综述国内外对本步骤所涉及的理论和研究现状,其次分析研究中存在的问题,然后提出相应的改进方法,并用实证进行分析和论证。文章主要内容包括:(1)时间序列预处理时间序列的预处理是时间序列时序关联规则挖掘的第一步:怎样清洗时间序列中的噪声数据。这部分首先对时间序列的噪声数据进行界定,其次综述已有时间序列孤立点噪声数据的识别方法,并且分析这些方法的优缺点,最后提出基于数据相对变化率的时间序列孤立点噪声数据的识别方法。(2)时间序列的压缩时间序列压缩是时间序列时序关联规则挖掘的第二步:如何把时间序列转化成模式序列。首先分析时序关联规则挖掘过程中数据压缩的必要性、目的和意义,其次综述已有时间序列压缩方法,在此基础上提出时间序列压缩方法的评价体系,并对已有压缩方法进行比较分析,然后选择有利于时序关联规则挖掘的时间序列压缩方法,最后对所选择压缩方法分割点的确定加以改进。(3)时间序列模式相似性的度量时间序列模式间相似性度量是时间序列时序关联规则的重要内容之一。只有很好地度量模式间的相似性,才能更好地完成模式序列中频繁模式和时序关联规则的获取。本文认为已有度量两个元模式相似性的方法存在弊端,考虑到序列模式的相似性度量涉及两个不同长度的模式,因而把度量两个不同维数的点间的距离的方法应用到序列模式相似性的度量上,提出序列模式相似性的动态时间弯曲距离度量法。(4)时间序列时序关联规则的获取时间序列时序关联规则挖掘的第三步:怎样从模式序列中获取频繁模式进而生成强时序关联规则。在一般时序关联规则中,对象或者事件的频繁性由其出现的次数决定。但由于时间序列模式的差异性,模式出现的次数不能决定其频繁性,而应由与其相似模式的数目决定。在时序关联规则的生成过程中,针对时间序列模式频繁性的特殊性,本文提出时序关联规则的分层获取方法,并用实证加以分析。(5)时间序列的相似性本文对时间序列的相似性研究从两方面展开。一方面研究一元时间序列序列的相似性。首先综述国内外关于时间序列相似性的研究,并分析存在问题,然后针对时间序列的时序性特点提出度量时间序列相似性的图形相似法,并分析该方法的优缺点;另一方面研究多元时间序列的相似性。首先分析度量多元时间序列相似性的必要性,然后分析该研究的难点所在,最后提出两种度量时间序列相似性的方法:基于矩阵范数和基于综合属性的多元时间序列的相似性度量方法。(6)时间序列时序关联规则挖掘平台时间序列时序关联规则挖掘平台以JAVA作为开发语言,共有六个模块,实现数据加载、时间序列的预处理、时间序列压缩、时间序列模式相似性度量、时序关联规则获取、时序关联规则评价和时间序列相似性度量等功能。一方面对各个步骤的改进方法进行实证分析,另一方面实现从时间序列中挖掘时序关联规则。本文的研究按照时序关联规则的挖掘步骤展开,从时间序列时序关联规则的第一步时间序列预处理到最后一步时序关联规则解释与评价。在每个步骤中,对已有研究进行梳理,对所涉及的理论模型进行推导,并提出改进方法。由于时间序列相似性在时间序列数据挖掘中起到重要作用,本文专门对时间序列的相似性进行探讨。本文的主要创新点归纳为:(1)在时序关联规则挖掘的时间序列预处理中,提出基于数据相对变化率的孤立点噪声数据识别方法。时间序列一般都含有噪声数据,其存在对时序关联规则的挖掘有很大影响,因此,在挖掘前必须去除噪声数据。但由于时间序列压缩对孤立点噪声数据不具有容忍性,而且孤立点的存在会影响时间序列的分割和时间序列模式表示,所以识别和删除时间序列中的孤立点噪声数据便成为时间序列预处理的重要工作之一。数据是否是时间序列的孤立点,关键是看它与周围数据的跳跃程度。本文以时间序列数据相对变化率作为判断其跳跃程度的标准,提出新的孤立点噪声数据识别方法。(2)在时间序列模式相似性度量中,提出度量两个元模式相似性的加权距离法以及可以度量两个不同长度序列模式相似性的动态时间弯曲距离法。在时序关联规则的挖掘中,元模式单调距离法、元模式向量距离法度量两个元模式相似性都不适合频繁模式的获取。因此,本文针对时间序列模式的特点提出元模式的加权距离,并在此基础之上提出度量两个序列模式相似性的动态时间弯曲距离法。(3)在时间序列时序关联规则的获取中,提出分层时序关联规则获取方法。时序关联规则的时间约束、关联规则的前、后件长度决定时序关联规则的获取。为了降低获取中的难度,只有把时序关联规则的前件分成不同的长度,由此提出分层时序关联规则的获取方法。由于频繁模式界定上的差异,这种方法有别于一般的获取方法;但也由于这种方法考虑到各种长度的关联规则前件,所以具有其他时序关联规则获取方法所不具有的优点。(4)在度量两个时间序列的相似性时,因为已有一元时间序列相似性的度量方法忽略了时间序列是以时间为变量的函数,本文经研究论证提出度量两个一元时间序列相似性的图形相似法。同时,在多元时间序列的相似性度量中,因考虑到多元时间序列的存储结构是矩阵,本文提出度量两个多元时间序列相似性的基于矩阵范数的多元时间序列相似性度量方法和基于综合性的多元时间序列相似性度量方法。

【Abstract】 Temporal association rules of time series are the temporal constraining association among partly changes of time series. Partly changes of time series themselves have time sequence, so time order is a characteristic of the association. Time series have the characteristics of data denseness and stochastic fluctuation, and temporal association rules of partly changes are implied in the large data set, so the rules can be obtained only through data mining.The mining of temporal association rules of time series is a systematic engineering, which can be divided into time series data pre-processing, time series data compression, time series data similarity measure, the requirement of temporal association rules and the interpretation and evaluation of temporal association rules. The research on mining methods of temporal association rules has gained a lot, but is far from perfection. The main points are as follows.(1) In recognizing outlier, the method based on statistics is hard to gain the sample’s distribution parameter, the method based on wavelet transform will change the authenticity of original time series, and the method based on likelihood ratio has a large amount of calculation.(2) In mining the classical temporal association rules, the time series are discredited into sequential patterns by the sliding window with the given length and steps. The frequent pattern will be acquired and it will end up with strengthened temporal association rules. Because the length and step for the sliding window are arbitrary, there is a lot of uncertainty in the result from the time series compression.(3) Similarity measure of time series is the base for acquiring the frequent pattern in sequential patterns, and also decides the obtainment of temporal association rules. The meta-pattern monotony distance and the meta-pattern vector distance both have some flaws in defining the meta-pattern, so the similarity measure of meta-pattern has some problems. And the existing methods of measuring series pattern’s similarity cannot measure the series pattern’s similarity of two different lengths.Temporal association rules of time series are practical valuable, but the existing mining methods have some flaws. So, the dissertation focuses on the improvement and perfection of the mining method of temporal association rules of time series, offering the theoretical models and empirical analysis, in order to gain more reliable temporal association rules from time series and help decision-making.The dissertation addresses the mining of temporal association rules. Aiming at the faultiness of every step, the author summarizes the existing relative research, then offers solutions and carries out empirical analysis. The dissertation can be divided into 8 chapters, the main content are as following.(1) Time Series Data Pre-processingTime series data pre-processing is the first step of mining temporal association rules that is how to clean the noise data in time series. In this part, the author first defines the noise data, and then sums up the existing recognition methods of outlier of time series, as well as analyzes their advantages and disadvantages. At last comes up with the recognition method of outlier of time series based on relative variance rate of time series.(2) Time Series Data CompressionTime series data compression is the second step of mining temporal association rules, which means how to transform time series into sequential patterns. Firstly the author analyzes the necessity, objective and meaning of compressing data in mining temporal association rules. And then analyzes the existing compressing ways, and then offers estimating system to value time series data compression. After comparative analysis, chooses time series data compression method, which is in favor of mining, and finally improves the reorganization of division point.(3) Time Series Data Similarity MeasureSimilarity measure of sequential patterns is the important content of temporal association rules of time series. Only the similarity among patterns is properly measured, the acquirement of frequent patterns in sequential patterns and temporal association rules can be successfully accomplished. The existing two methods have more or less disadvantages. Because the similarity among sequential patterns comes down to two models of different length, by using the method of measuring different dimensions distance, the author puts forward dynamic time warping distance means of sequential pattern.(4) Acquirement of Temporal Association RulesThe third step of mining temporal association rules is how to get frequent patterns from sequential patters, and then to build strengthened temporal association rules. In common temporal association rules, the objects may appear or not, and its frequency depends on the appearing times of objects and incidents. Because of the difference of time series pattern, the frequency cannot be decided by single model’s appearing times, but by the amount of similar patterns. During the process of creating temporal association rules, according to the particularity of time series patterns, the author offers the layered means of getting temporal association rules and proves it.(5) Similarity of Time SeriesThe dissertation clarifies similarity of time series from two aspects. On the one hand, the dissertation studied the similarity of one-variety time series. Based on the summary of existing research on time series, the author puts forward the graphic similarity measure-to-measure similarity of time series and analyzes the method. On the other hand, the dissertation researches similarity of multivariate time series. Firstly the author analyzes the necessity of researching it, and then the difficulty in it, finally comes up with two ways to measure similarity of time series, based on matrix and synthesis attribution.(6)Mining Flat of Temporal association rules of time seriesThe mining flat of temporal association rules of time series uses JAVA as exploiting languages, and has 6 modules. It has several functions, such as loading data, time series data pre-processing, time series data compression, time series data similarity measure, the requirement of temporal association rules and the interpretation and evaluation of temporal association rules, etc. The dissertation proves every improvement by empirical analysis, and also realizes to mine temporal association rules from time series.Combining with theories of mining temporal association rules, the dissertation carries out systemic research on every step, from the first step, time series data pre-processing, to the last step, the interpretation and evaluation of temporal association rules. In every step, the author combs the existing research, tests the relative theoretical models, offers improvement and proves it. Because the mining of multivariate time series is a hot issue, the author discusses it in the last part. The innovations of the dissertation can be included as follows.(1) In time series data pre-processing, the author puts forwards recognition method of outlier noise data based on data variance ratio. Time series usually contains noise data, which will affect the mining temporal association rules, so it should be cleaned out before mining. Because time series compression is intolerant to outlier noise data, meanwhile the existence of outlier will affect the division of time series and representation of time series patterns, so identifying and deleting the outlier in time series will be one of the important works in time series data pre-processing. Whether a datum is the outlier, depends on its vibrancy with surrounding data. The author uses data variance ratio of time series data to estimate the vibrancy, and then offers recognition of outlier noise data.(2) In time series and similarity measure, the author comes up with Euclid distance method to measure the similarity of two meta-patterns and, and also brings forward dynamic time warping distance means to measure the similarity of two time series patterns. In mining temporal association rules, the meta-pattern monotony distance method and the meta-pattern vector distance method both are not suitable for getting frequent pattern when measuring the similarity between two meta patterns. Aiming the specialty of time series pattern, the dissertation offers weighted distance method of meta-pattern, and then comes up with dynamic time warping distance means, which can measure the similarity between two sequential patterns.(3) In the acquirement of temporal association rules, the author puts forwards the layered means. The time restriction of temporal association rules and…of association rules determines the difficulty of acquiring temporal association rules. In order to decrease the difficulty, we can divide the beforer of temporal association rules into different length and then mine, that is so called the layered mining of temporal association rules. Because of the difference in defining the frequent patterns, the method is different from other mining ways. Meanwhile because the method considers the beforer of different length, it has the unique advantages compared with other methods(4) When measuring the similarity between two time series, because the existing measure of one-variety time series ignores that time series is the function of time, the dissertation puts forward the graphic similarity measure. Meanwhile, in measuring similarity of multivariate time series, considering the storing way of multivariate time series is matrix, the dissertation offers two methods based on matrix norm to measure the similarity of multivariate time series and based on comprehensive attribute to measure the similarity of multivariate time series.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络