节点文献

不确定时间序列的相似性匹配问题研究

Research on Uncertain Time Series Similarity Matching

【作者】 左彦飞

【导师】 刘国华;

【作者基本信息】 东华大学 , 计算机软件与理论, 2012, 硕士

【摘要】 时间序列,就是按照时间先后顺序排列的记录序列。相似性匹配是时间序列的聚类、异常检测、模式发现等任务的基础操作之一。目前对时间序列相似性匹配的研究主要针对确定性数据,随着物联网、隐私保护等技术的发展,不确定时间序列将大量涌现,时间序列的相似性匹配技术面临新的挑战。在不确定时间序列的情况下,两条序列之间的距离也是不确定的,所以无法直接利用确定性时间序列的相似性匹配方法。为了解决不确定时间序列相似性匹配问题,我们建立了一种描述不确定时间序列的数据模型,在该模型下,不确定时间序列在每一时刻的数据点均由一个取样点(sample observations)的集合组成,并且每个取样点出现的概率相等,即服从离散型均匀分布;并且,时间序列中不同时刻的点相对独立。在此模型下,两条不确定时间序列之间的真实距离是由大量的可能距离(以一定的概率值出现)组成的,并且这些可能距离的数量为指数大小。所以,直接计算所有的可能距离的效率将非常低。因此,在所提出模型的基础上,本文提出了两种不确定时间序列相似性匹配算法:α-PRQ(均值法)和k-PRQ(聚类法)。(1)α-PRQ根据查询序列和数据库中所存储的时序数据是否为确定性数据,将不确定时间序列相似性查询分为三种不同的类型;然后,对于每种类型,通过均值法(averaging method)从不确定序列中提取出一条确定性序列来代表原序列,然后,采取确定性时间序列相似性匹配方法进行查询。(2)k-PRQ此算法主要通过两个步骤进行剪枝以降低计算复杂性:1)通过聚类减小取样大小(sample size),以聚类后的每一个簇为单位计算距离,从而大大降低了计算复杂度。2)通过预先计算出小于给定阈值ε的距离个数的上界与下界,就能够得到这些距离出现概率的上下界,从而通过概率的上下界过滤掉不必要的计算,减少计算量。实验表明,我们提出的两个不确定时间序列相似性匹配算法具有较好的性能和准确性。

【Abstract】 A time series is a sequence records according with the chronological order. Similarity matching is one of the underlying operations for time series clustering, outlier detection and pattern discovery tasks.Currently, study of time series similarity matching mainly focuses on deterministic data. With the development of the Internet of things and privacy protection technology, uncertain time series will be in large numbers and time series similarity matching technology is facing new challenges.In the case of uncertain time series, the distance between the two sequences is uncertain, so the way of similarity matching on deterministic time series cannot use directly.In order to solve the problem of uncertain time series similarity matching, we have established a data model to describe the uncertain time-series. Under this model, the data point at each time slot was built up by the set of one sample observations. Each sampling point has the same probability of occurrence, that is uniformly distributed and different time points of the time series is relatively independent. In this model, the true distance between two uncertain time series are consisting of a large number of possible distance (with a certain probability value). Therefore, on the basis of the model proposed by this paper, two algorithms have be proposed for uncertain time series similarity matching:a-PRQ (mean method) and k-PRQ (cluster method).(1) a-PRQAccording to the query sequence and time series data stored in database are whether deterministic, The uncertain timing sequence similarity query is divided into three different types; Then, for each type, by the means method (averaging method) extracted from the sequence of uncertainty out of a deterministic sequence to represent the original sequence take the deterministic time series similarity matching the query. (2)k-PRQThis algorithm is mainly through a two-step pruning to reduce the computational complexity:1) Through the cluster to reduce the sample size (sample size) to calculate the distance to each cluster after clustering as a unit, thereby greatly reducing the computational complexity.2) Pre-calculated a given thresholdε, from the number of upper and lower bounds, we can get the distance to the probability of the upper and lower bounds through probability of the upper and lower bounds, it filters out unnecessary calculations and thus reduce the computational complexity.The experiments show that the two uncertain time series similarity matching algorithm has better performance and accuracy.

  • 【网络出版投稿人】 东华大学
  • 【网络出版年期】2012年 07期
节点文献中: