节点文献

交易序列数据挖掘研究

Research on Transaction Sequence Data Mining

【作者】 汤春蕾

【导师】 朱扬勇;

【作者基本信息】 复旦大学 , 计算机软件与理论, 2011, 博士

【摘要】 交易序列数据描述的是在各类交易过程中商品或证券价格随时间的变化规律,分析这些数据能为商家或投资者制定营销策略或选择价值投资方法提供量化依据,由此交易序列数据挖掘技术成为当前研究和应用的热点。交易序列数据挖掘的目的是识别商品或证券交易价格变化规律,主要任务有分类、聚类、关联分析和异常检测等,还可以进行各种扩展的数据分析与挖掘,如允许有时间间隔约束的关联规则、数据有缺失值存在的模式分析等。目前,针对交易序列数据的大量研究使用的是其他序列数据挖掘与分析方法,比如将其离散时间的序元序看作连续的、使用时间序列结构化或非结构化模型与各种复杂算法相结合的方法,又如忽略其数值型序元值、使用特征构建成事件序列进行频繁模式挖掘方法;再如将其数值型的序元值进行字符表示、使用字符序列模式查找的方法。这些研究方法存在以下两方而问题:一方而,没有同时兼顾交易序列数据本身固有的离散时间序和数值型元素值两大特性;另一方而,没有利用可用的经济与金融领域知识。兼顾交易序列本身原有特性并有效找到各种符合领域意义的频繁相似模式,能使数据分析与挖掘结果更有效。本文从交易序列基本模式出发,定义了5种交易序列原子模式(包括:趋平模式、头部模式、底部模式、增长模式和下降模式)及其关联关系,即交易序列复合模式,着重研究了交易序列模式挖掘、交易序列模式查询与预测和基于交易序列模式的聚类三方面问题,主要研究成果如下:(1)针对交易序列模式挖掘问题,在原子模式快速查找及其TOP K频繁项挖掘两个算法的基础上,提出了一种频繁的交易序列复合模式挖掘算法。频繁的交易序列复合模式是由多种满足一定时间约束及其周期循环关系的交易原子模式频繁集组成的,在此项挖掘任务中,由于候选原子模式空间是呈指数级增长的,因而效率问题成为一个瓶颈。首先,根据领域知识定义了5种交易序列原子模式,提出了一种伸缩距离函数的序列模式通用相似性度量及其趋势融合和对称使用距离函数的计算方法,将“缩放”相似的4种交易序列原子模式(除趋平模式)分别转化为相似性无向图进行谱聚类;然后,在以结果簇近似代替最大团的基础上,引入时间约束代替趋平模式找到由各种交易序列原子模式频繁集构成的频繁复合模式。在真实股票交易序列集上,采用多种相似性计算方法比较得到算法准确性,并且所求得的频繁复合模式有较好的应用解释。(2)针对交易序列模式查询问题,提出了两种有效的相似性查询算法。在现实应用中,交易序列有一种重要的相似性——“缩放”相似性,这是交易序列模式在时间维度上的“弹性”拉长或缩短但会保留在数值维度上整体变化趋势的一种相似性。因而如何定义合理的相似性度量来捕捉这种相似性是一个需要解决的重要问题。针对序列间的细微变化,先对待查询的序列进行单调区间的“融合”处理,然后根据各区间的长度和幅度比例进行序列模式的候选产生,最后使用伸缩距离函数作为相似性度量进行计算并返回最后结果;针对交易序列的价格区间变化,先将所有序列进行规范化,在改进伸缩距离函数定义的基础上进行计算并得到查询结果。实验结果表明,“趋势融合”和“价格融合”两种相似性查询算法都能找到在总体形状上与给定序列模式“放大”或“缩小”的所有模式结果。(3)针对交易序列预测问题,提出了一种有较高准确率的序列模式趋势预测算法。预测是根据给定交易序列数据集,对给定待查序列的后续时间进行数值属性上的估计。由于数据变化的复杂性,在交易序列中进行趋势预测比精确预测更有意义,因而提高对给定序列趋势预测准确度成为预测问题的关键。基于“价格融合”的相似性查询,本文使用Parzen窗密度和KNN的估计两种方法分别证明了将查询结果候选集的TOP k个结果的后续长度为τ的模式加权平均,能近似替代全部查询结果,进而综合出预测结果。在真实股票交易序列集上的实验结果表明,趋势预测有较高的准确率。(4)针对交易序列聚类问题,提出了一种考虑时限约束目标函数的聚类算法。交易序列进行聚类选择何种对象进行很关键。在一定时问范围内,总体呈增长或下降趋势更能反映商品或证券的价格规律,因而从原始的交易序列中提取了这种反映局部信息的增长或下降模式进行特征创建并进行聚类的意义大于直接使用原始交易序列。首先,从商品或证券价格及其变化趋势等角度研究了交易序列集的内在结构,定义了一种反映价格变化趋势的增长或下降模式及其错位组合距离和角度向量距离两种递进的相似性度量,在此基础上,设计了一个考虑时限约束的目标函数进行先划分再层次合并的聚类研究。实验结果表明,在时限约束的条件下,增长或下降模式这种特征提取方式及其模式间的两种距离函数能较好地产生聚类结果,并且这些聚类结果能得到较好地解释。

【Abstract】 Transaction Sequence Data describes the way prices of commodities or securities change with time in various transactions. The analysis of such data could provide the quantitative basis for merchants or investors in making marketing strategies or choosing investing methods, thus making the mining technology for Transaction Sequence Data a topic of great interest to both research and application.The mining of Transaction Sequence Data aims to identify the price changing patterns of commodities or securities and its major tasks include classifying, clustering, association analysis, anomaly detection and so on. It can also conduct a variety of expanded data analysis and mining, such as association rules allowing for time constraints, pattern analysis for data with missing values and so on.At present, extensive research on Transaction Sequence Data has adopted other sequence data mining and analyzing approaches, such as the one which treats the discrete chronological order as continuous and combines time-series structured or non-structured patterns with various complex algorithms or the one which neglects the numeric sequence elements and constructs characteristics into sequence of events for the mining of frequent patterns. There is also one approach which denotes numeric sequent elements in characters and search through character sequence patterns. Such research approaches are confronted with the following two types of problems. On one hand, they fail to give equal weight to the discrete chronological order and numeric value elements, two intrinsic features of Transaction Sequence Data. On the other hand, they do not utilize the available domain knowledge in the areas of economy and finance. To effectively locate various frequent patterns that fit the domain significance with the innate features of Transaction Sequence factored in will make the results of data analysis and mining more valid.Departing from basic Transaction Sequence Patterns, this paper defines five Transaction Sequence Primitive Patterns, i.e. Flat Pattern, Top Pattern, Bottom Pattern, Growth Pattern and Decline, and their correlations, i.e., Transaction Sequence Composition Patterns, with emphasis on three issues, namely, the mining of Transaction Sequence Patterns, the inquiry and prediction of Transaction Sequence Patterns and clustering based on Transaction Sequence Patterns. Below are the major findings of this paper:(1) In response to the mining of Transaction Sequence Patterns, this paper proposes an algorithm for the mining of Frequent Transaction Sequence Composition Patterns based on the algorithms for primitive pattern quick search and the mining of their TOP K frequent itemsets.Frequent Transaction Sequence Composition Patterns are composed of multiple Transaction Primitive Pattern Frequent Sets which satisfy certain time constraints and a cycling relationship. For this mining task, as the space of candidate primitive patterns grows exponentially, the efficiency issue poses as a bottleneck.Five Transaction Sequence Primitive Patterns are firstly defined according to the domain knowledge. A universal similarity measure for sequence patterns employing a stretchable distance function is put forward as well as its trend merging and symmetric calculation. Thus, Transaction Sequence Patterns (Flat Pattern excluded) with "stretchable" similarities are converted into similarity undirected graphs for spectral clustering. Then, based on the substitution of result cluster approximation for Maximal Cliques, time constraints are introduced to replace Flat Patterns so as to locate Frequent Composition Patterns composed of multiple Transaction Sequence Primitive Pattern Frequent Sets. Several similarity calculations are compared in real stock transaction sequence sets to obtain the calculation accuracy and the Frequent Composition Patterns achieved have good application interpretations.(2) In response to the inquiry of Transaction Sequence Patterns, this paper proposes two effective similarity inquiry algorithms.In practical applications, Transaction Sequence Patterns embody an important type of similarity, called "stretchable" similarity, which causes Transaction Sequence Patterns to lengthen or shorten "strecthably" in the time dimension but maintain the overall trend in the value dimension. To define appropriate similarity measures to capture such similarities is an important issue demanding solutionsRegarding the slight variations between sequences, firstly "merging" is conducted on monotonous intervals of sequences to be inquired, then candidates of sequence patterns are generated according the ratios of length to amplitude of each interval, and lastly calculation is done employing the stretchable distance function as the similarity measure with final results returned. For the changes in price intervals of Transaction Sequence, firstly all sequences are normalized, then calculation is done based on the improved definition of stretchable distance function and inquiry results are obtained. As shown by the experiment results, both "trend merging" and "price merging", the two similarity inquiry algorithms, can find all the pattern results whose overall shapes are the "enlargement" or "reduction" of the given sequence patterns.(3) In response to the prediction of Transaction Sequence Patterns, this paper proposes a sequence pattern trend prediction algorithm with high accuracy.Prediction, based on given Transaction Sequence Datasets, estimates the numerical attributes of successive intervals of given sequences to be inquired. Due to the complexity of data variation, conducting trend prediction in Transaction Sequence is more meaningful than accurate prediction and thus to increase the accuracy of trend prediction for given sequences is key to prediction.Based on the similarity inquiry of "price merging", utilizing the methods of Parzen window density and KNN estimation, this paper proves that to conduct weighted average on the patterns with a length of succeeding the TOP K results of the candidate sets of inquiry results can approximately replace all the inquiry results and subsequently produce the prediction results in a comprehensive way. The results of experimenting with real stock transaction sequence sets show that trend prediction has a high accuracy level.(4) In response to Transaction Sequence clustering, this paper proposes a clustering algorithm which factors in the goal function with time constraints.The choice of objects is crucial to the clustering of Transaction Sequence. Within a certain time span, an overall growing or declining trend can more faithfully reflect the price patterns of commodities or securities. Therefore, to extract, from the original Transaction Sequence, such growth or decline patterns with local reflection for characteristic creation and clustering is more valuable than the direct use of the original Transaction Sequence.Firstly, the inner structure of Transaction Sequence Sets is studied from the perspectives of commodity or security prices and their trends, and then a growth or decline pattern reflecting price trends is defined along with two progressive similarity measures, i.e. shifted window combined distance function and angle vector distance function. On this basis, a goal function with time constraints taken into account is designed for research which firstly studies partitioned clustering and then hierarchical clustering. The experiment results show that, under time constraints, such a characteristic extraction method for Growth or Decline Pattern and the two distance functions between patterns can effectively produce clustering results, which can be satisfactorily interpreted.

  • 【网络出版投稿人】 复旦大学
  • 【网络出版年期】2012年 08期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络