节点文献

数据流模式挖掘算法及应用研究

Pattern Mining Algorithms Over Data Streams

【作者】 王乐

【导师】 冯林;

【作者基本信息】 大连理工大学 , 计算机应用技术, 2013, 博士

【摘要】 随着各行业对数据越来越重视和信息技术的快速发展,产生的数据越来越全面,同时数据量也在快速的增长;并且各行业又要求能及时对已产生的数据进行挖掘和分析,这使得数据流挖掘技术愈发重要。由于数据流具有海量性、实时性和动态变化性的特点,这就要求数据流上的挖掘算法有较高的时空效率。尽管数据流上数据挖掘技术取得了一定的进展,但是挖掘算法的时空效率仍然是当前数据挖掘领域中的研究焦点之一。本文主要研究了数据流模式挖掘算法,包括传统数据集类型中的频繁模式挖掘以及大数据集下的频繁模式挖掘、不确定数据流中的频繁模式挖掘、和高效用模式挖掘。本文首先对已有的频繁模式和高效用模式挖掘算法进行了回顾,详细的介绍了算法Apriori和FP-Growth等;然后在对典型的挖掘算法和最新研究成果进行分析研究的基础上,深入研究了传统数据中的频繁模式挖掘、不确定数据上的频繁模式挖掘和具有效用值的数据中的高效用模式挖掘算法。本文取得了如下的创新性研究成果:(1)在传统数据的频繁模式挖掘算法研究中,提出新的尾节点数据结构和一种最多两次MapReduce的并行挖掘算法。针对数据流中的频繁模式挖掘问题,采用尾节点和尾节点表来提高窗口内数据更新的时间效率和维护的空间效率;并通过提高窗口内频繁模式挖掘算法的时间效率,进而提高数据流中模式挖掘的整体时间效率。针对大数据下的数据流频繁模式挖掘问题,首先通过一次MapReduce找到局部频繁模式做为候选项集,然后通过给出的剪枝策略对候选项集进行剪枝,最后进行第二次MapReduce对候选项集中剩余项集进行支持数统计;在多数情况下,该算法不需要第二次MapReduce就可以有效的挖掘到所有的频繁模式。(2)在不确定事务数据的频繁模式挖掘算法研究中,提出具有更高压缩率的树结构来改进不确定数据集及数据流上的频繁模式挖掘算法。首先利用数组来存储事务项集的概率,然后将事务概率在数组中的索引和事务项集映射到一棵树上,从而可以有效的降低维护不确定数据集的树节点个数。在此基础上,结合滑动窗口技术,同时给出两种新的树结构分别来维护窗口中数据和挖掘过程中的子数据集,保证在挖掘的过程中使窗口中事务项集的信息不会从树上丢失;从而使频繁模式挖掘算法的时空效率得到较大的提升。另外,本文还提出一种新的具有权重的频繁模式挖掘模型和算法;该模型主要是将项的权重值引入到频繁模式的挖掘过程中,将权重值大的模式考虑到挖掘结果中。(3)在高效用模式挖掘算法研究中,提出避免使用高估效用值的不产生候选项集的挖掘算法。首先本文提出一个新的树结构来维护事务项集及效用值信息,通过该树结构可以得到项集的准确效用值,而不是高估效用值,从而保证不通过候选项集就可以挖掘到所有的高效用模式,因此可以提高算法的时空效率。在此基础上,结合滑动窗口技术,同时给出一个新的树结构维护窗口中数据,可以使算法通过一遍数据集扫描,在不产生候选项集的前提下就可从数据流中挖掘高效用模式。相对KDD会议和TKDE期刊上最新发表论文UP-Growth算法,新提出的算法的时间效率提高1到2个数量级。

【Abstract】 Along with the rapid development of information science and the explosive accumulation of industrial data, data mining and analyzing is attracting more and more concern from all areas of industry. The demand for timely processing and analyzing of ongoing data makes it an important research topic that focuses on the mining techniques on data streams. Because data streams are massive, real-time and volatile, mining algorithms on data streams should be more efficient on both space and time. A number of valuable academic works have been performed on data mining over data streams, yet the improvement of mining efficiency is still a scientific hotspot that needs further research.This dissertation focuses on pattern mining algorithms over data streams, including frequent pattern mining over traditional datasets, frequent pattern mining over big data datasets, frequent pattern mining over uncertain data streams, and high utility pattern mining. Firstly we reviewed existing algorithms on frequent pattern mining and high utility pattern mining, with detailed introduction for Apriori and FP-Growth algorithm; then based on the study of classical mining algorithms and state of the art research works, performed in-depth researches on frequent pattern mining over traditional datasets, frequent pattern mining over uncertain datasets, and high utility pattern mining over datasets. Innovative research achievements of this thesis are presented as the following.(1) In the study on algorithms of mining frequent pattern over traditional datasets, this thesis proposes a tail-node data structure and a parallel algorithm with no more than two rounds of MapReduce. For the problem of frequent patter mining over data streams, tail-node&tail-node table are utilized in a sliding window approach to improve the time efficiency of data updating within the window, as well as the space efficiency of data structure maintenance; together with strategies to improve the time efficiency of within-window frequent pattern mining process, the overall performance of the mining algorithm is improved.For the problem of mining frequent patterns over streams of big data:firstly, one round of MapReduce is performed to find local frequent patterns as candidate itemset; secondly, a pruning strategy is utilized to prune the candidates; finally a second MapReduce is performed to count the support numbers of the remaining itemset; in most cases, only one round of MapReduce is needed for our algorithm to discover all frequent patterns.(2) In the study on algorithm of mining frequent pattern over uncertain transaction datasets, a tree structure with higher compression ratio is proposed to improve the efficiency of the mining algorithms. The probability values of transaction itemsets are stored in arrays and then mapped to a tree to reduce the number of nodes needed for maintaining transaction dataset; together with the sliding window approach, two new tree data structures are proposed to maintain the within-window data and sub datasets created in the process of mining, ensures no loss of information during the mining process, and significant improvement of time&space efficiency of the mining algorithm. Furthermore, this thesis also proposes a new weight based frequent pattern mining model; this model takes into consideration the item’s weight factor of a transaction itemset in the mining process, and includes high weight patterns into the mining result.(3) In the study on algorithms of mining high utility patterns, new mining algorithms are proposed without over-estimated utility values and without generating candidate itemsets.An innovative tree structure is proposed for maintaining transaction itemsets and their utility information; accurate utility values instead of over-estimated values can be retrieved from this tree, so high utility patterns can be discovered without generating candidates, and the algorithm efficiency is improved.Based on this algorithm and the sliding window method, together with a new tree structure for maintaining the within-window data, the proposed algorithm mines high utility patterns from the data stream without generating candidate itemsets by one scan of dataset. Compared with the state of art algorithm UP-Growth published in KDD and TKDE, the proposed algorithms can be one or two orders of magnitude better in terms of running time.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络