节点文献

数据流频繁模式挖掘关键算法及其仿真应用研究

Research on Key Algorithms for Mining Frequent Patterns in Data Streams and Their Application in Simulation System

【作者】 敖富江

【导师】 黄柯棣;

【作者基本信息】 国防科学技术大学 , 控制科学与工程, 2008, 博士

【摘要】 系统仿真技术综合集成了计算机技术、网络技术、图形图像处理技术、信息处理技术、自动控制技术等多个领域的知识,是系统分析和研究的重要手段。数据挖掘技术是获取仿真数据中隐藏知识的有力工具。随着仿真系统复杂程度的提高和规模的增大,仿真时间越来越长、仿真所产生的数据量越来越大。这使得仿真数据具有数据流的特征。因此有必要采用数据流挖掘技术处理仿真数据。数据流是一种连续、高速、无限、时变的有序数据序列。数据流的特征对数据流的挖掘提出了严峻的挑战。传统面向静态数据集的算法无法直接用于挖掘数据流,而现有数据流挖掘算法存在时空效率不高的缺陷。因此,针对仿真中常用的数据挖掘任务,研究时空效率高效的相应数据流挖掘算法具有重要意义。关联规则挖掘是仿真中最常用的一类数据挖掘任务,而频繁模式挖掘是生成关联规则的关键步骤。为此,论文研究了数据流中频繁模式挖掘的关键算法,重点研究了数据流中最大频繁项集、频繁闭项集和Top-K最频繁项集的挖掘算法,以及基于频繁闭项集的数据流分类算法和基于Top-K频繁模式的高维数据流聚类算法。论文最后研究了如何将数据流挖掘算法快速集成到不同的仿真系统中,着重考虑了数据流挖掘算法资源在仿真中的重用。论文的主要研究工作及创新包括以下六个方面:(1)提出了一种数据流最大频繁项集挖掘算法。相对于完全频繁项集和频繁闭项集,最大频繁项集的数目最少,挖掘最大频繁项集的算法具有较高的时空效率。为此,论文研究了数据流中最大频繁项集的挖掘技术,旨在提供一种能够在任意时刻都快速维护数据流滑动窗口中最大频繁项集的算法。主要研究内容包括三个方面。首先提出了一种面向数据流的最大频繁项集剪枝技术,即子集等价剪枝技术。接着,提出了一种最大频繁项集单遍挖掘算法FPMFI-DS。其中,FPMFI-DS算法中应用了子集等价剪枝技术以降低算法的搜索空间大小,从而提高算法效率。最后,基于FPMFI-DS算法,提出了一种能够在线更新挖掘数据流滑动窗口中最大频繁项集的算法FPMFI-DS+。实验表明,对于稠密数据集子集等价剪枝技术能够缩小约40%的搜索空间;FPMFI-DS算法的挖掘速度快并具有良好的可扩展性;FPMFI-DS+算法更新挖掘速度快并具有良好的稳定性。(2)提出了一种数据流频繁闭项集挖掘算法。频繁闭项集的数目介于完全频繁项集和最大频繁项集之间,并保存了所有项集的支持度信息。因此挖掘数据流中的频繁闭项集既具有较高的时空效率,又保证了信息的完全性。为此,论文提出了一种频繁闭项集挖掘算法FPCFI-DS。该算法能够在有限的存储空间中高速挖掘数据流滑动窗口中的频繁闭项集,并且能够在任意时刻都维护数据流当前窗口中的频繁闭项集。实验表明,FPCFI-DS算法的时空效率显著优于同类经典算法Moment。(3)提出了一种数据流Top-K最频繁项集挖掘算法。Top-K最频繁项集挖掘的优点是不需要用户指定最小支持度阈值,仅指定需要寻找的项集数目k。已有Top-K最频繁项集挖掘算法存在初始项目数目过多、初始边界支持度过高的问题。为此,论文首先提出了一种基于混合搜索方式的高效Top-K最频繁项集挖掘算法MTKFP。该算法综合利用宽度优先搜索和深度优先搜索挖掘Top-K最频繁项集。然后基于MTKFP算法,提出了一种基于Chernoff不等式的数据流Top-K最频繁项集挖掘算法MTKFP-DS。实验表明,MTKFP算法所获得的初始项目数目至少低于已有算法70%,初始边界支持度高于已有算法,从而MTKFP算法的性能优于已有最好算法1倍以上;MTKFP-DS算法适合于对数据流数据的挖掘。(4)提出了一种基于频繁闭项集的数据流分类算法。相对于某些传统分类算法,基于关联规则的分类具有更高的精度。此类算法通常采用频繁项集作为生成类关联规则的依据。但挖掘频繁项集易遭受组合爆炸问题,从而影响算法效率;另外,数据流的出现也对分类算法提出了新的挑战。为此,论文提出了一种高效的基于频繁闭项集的数据流分类算法CBC-DS。在该算法中,设计了高效的频繁闭项集单遍挖掘算法和有效的分类器构建方法。实验表明,CBC-DS算法的平均分类精度比经典算法CMAR高1.09%左右,分类速度快于CMAR算法。(5)提出了基于Top-K频繁模式的高维数据流聚类算法。高维数据聚类是聚类问题中的研究难点。基于密度和基于网格的综合方法能够较好地解决该问题,该方法的关键在于发现高密单元格。传统方法采用挖掘频繁项集的方式发现高密单元格,该方式的不足是需要用户指定最小密度阈值,而且不利于发掘稀疏子空间中的高密单元格。为此,论文分别提出了基于Top-K最频繁项集、基于N-most interesting项集和基于Top-K项目的高维数据流聚类算法。这些算法不需要用户指定最小密度阈值。第二种算法有利于特定维的子空间分组的高密单元格发掘,第三种算法有利于特定子空间的高密单元格的发掘,从而解决稀疏子空间中高密单元格的发掘。实验表明,所提出的算法适用于对高维数据流的聚类。(6)研究了数据流挖掘技术在仿真中的应用。论文提出了基于数据流挖掘技术的仿真应用框架。并且为了能够将数据流挖掘算法快速集成到基于HLA体系结构的仿真系统中,采用模块化开发思想设计了通用性强的数据流挖掘构件和通用数据流挖掘成员,以提高算法资源的重用性。并以“导弹突防仿真系统”为例,介绍了通用关联规则挖掘成员的设计思想。

【Abstract】 System simulation takes advantages of a lot of technologies from other domains, such as computer technology, network technology, graphics and image processing technology, information processing technology, automatic control technology and so on. It is a significant step for system analysis and research. Data mining technology is a powerful tool for discovering hidden knowledge from simulation data. With the increasing of complexity and scale in simulation system, simulation time becomes much longer and simulation data amount becomes much larger, which suggests simulation data exhibits the characteristics of data streams. So it is necessary to process simulation data with data stream mining technology. A data stream is an ordered sequence of data items with the characteristics of continuity, high data rates, infinity, and the distribution of data items changing with time. Those facts bring tremondous challenges to data-stream mining. Traditional data mining algorithms aiming at static datasets can’t be used to mine data streams directly, neither do they have the time and space efficiency. Thus, it is important to research data stream mining algorithms having higher time and space efficiency, and to aim at resolving data mining tasks often used in system simulation.In all data mining tasks, Association rule mining is the one mostly used in simulation. Mining frequent patterns is the key step to generate association rules. So, the research focuses on the key algorithms for mining frequent patterns in data streams. Particularly, three important aspects are researched and implemented, including the algorithms for mining maximal frequent itemsets, closed frequent itemsets, Top-K most-frequent itemsets in data stream, a classification algorithm based on closed frequent itemsets for data stream and a clustering algorithm based on Top-K frequent patterns for high dimensional data stream. Lastly, this paper also discusses how to quickly integrate the data stream mining algorithms into various simulation systems, with the emphasizing on the reusability of the data stream mining algorithms in simulation systems. The main innovative achievements of this research can be summarized as follows:(i) An algorithm for mining maximal frequent itemsets in data stream is proposed. The number of maximal frequent itemsets is much less than that of frequent itemsets or closed frequent itemsets. So, mining maximal frequent itemsets can get higher time and space efficiency. Thus, this paper researches on the technology of mining maximal frequent Itemsets in data stream, aiming at presenting an algorithm that can maintain maximal frequent itemsets in the current sliding window over data streams at any time. The contributions of the research lie in the following three aspects. First, the paper presents a novel pruning technique for mining maximal frequent itemsets in data steam, namely Subset Equivalence Pruning. Second, the paper proposes an one-pass algorithm for mining maximal frequent itemsets, called FPMFI-DS, in which Subset Equivalence Pruning is used to reduce the size of searching space, thereby to improve the efficiency of the algorithm. Lastly, based on FPMFI-DS algorithm, FPMFI-DS+ algorithm is proposed which can mine maximal frequent itemsets in sliding window over data streams in online updating fashion. Experiments show that for some dense datasets, the size of searching space can be reduced by about 40 percent by Subset Equivalence Pruning, FPMFI-DS achieves high performance and good scalability, and FPMFI-DS+ has high updating-mining speed and good stability.(ii) An algorithm for mining closed frequent itemsets in data stream is proposed. The number of closed frequent itemsets is less than that of frequent itemsets, but is more than that of maximal frequent itemsets. Closed frequent itemsets also contain the support of all frequent itemsets. So, mining closed frequent itemsets in data stream is efficient which ensures the perfectibility of information. The paper presents an algorithm for mining closed frequent itemsets, called FPMFI-DS, which can efficiently mine closed frequent itemsets over a stream sliding window with limited memory space, and maintain exact closed frequent itemsets in current window at any time. The experimental results show that the algorithm FPCFI-DS exhibits more tremendous potential than that of the state-of-the-art algorithm Moment in terms of time and space efficiency.(iii) An algorithm for mining Top-K most-frequent itemsets in data stream is proposed. One of the advantages of mining Top-K most-frequent itemsets is that users don’t need to specify a minimum support, instead of specifying an integer, k, which is the number of itemsets required. The existing algorithms for mining Top-K most-frequent itemsets have the problems of massive initial items and higher initial border support. To solve these problems, the paper presents an efficient mixed-searching-based algorithm for mining Top-K most-frequent itemsets, MTKFP. The MTKFP algorithm adopts both breadth-first searching and depth-first searching to mine Top-K most-frequent itemsets. Moreover, based on MTKFP algorithm, the paper presents a Chernoff-based algorithm for mining Top-K most-frequent itemsets in data stream, MTKFP-DS. The experimental results show that the number of initial items of MTKFP algorithm is 70 percent lower than that of the existing algorithms, but the initial border support of MTKFP algorithm is higher than that of the existing algorithms, consequently the performance of MTKFP algorithm is superior to that of the best existing algorithm by over one time; results also show that MTKFP-DS algorithm is suitable for mining data streams.(iv) A closed-frequent-itemsets based classification algorithm for classifying data stream is proposed. In contrast to some traditional classification algorithms, the algorithms based on association rules have higher classification precision. These algorithms generally generate classification association rules by frequent itemsets. As mining frequent itemsets often suffers from the problem of combination explosion, the efficiency of algorithm is low. Moreover, the emergence of data streams has posed new challenges to those classification algorithms. To solve these problems, the paper proposes an efficient closed-frequent-itemsets based classification algorithm, CBC-DS, for classifying data stream. In CBC-DS, an efficient one-pass algorithm for mining closed frequent itemsets and an effective method for constructing classifier are designed. The experimental results show that the average precision of CBC-DS is about 1.09 percent higher than that of CMAR algorithm, but CBC-DS is much faster than CMAR.(v) The Top-K-frequent-patterns based clustering algorithms for clustering high dimensional data stream are proposed. Clustering high dimensional data is a more difficult problem in the research domain of Clustering. The synthetical method with density-based clustering and grid-based clustering can be used to solve the problem effectively. Traditional method adopts the procedure of mining frequent itemsets to identify dense units. It has two deficiencies. First, it requires user to specify a minimum density threshold. Second, it identifies dense units with the same minimum density threshold for all subspace, so that the units in sparse subspace can’t be identified as dense units. To solve these problems, the paper presents three clustering algorithms for clustering high dimensional data stream. These algorithms base on Top-K most-frequent itemsets, N-most interesting itemsets and Top-K frequent items, respectively, and don’t require user to specify a minimum density threshold. The second algorithm is in favor of identifying the dense units in subspace group with specific dimension. The third algorithm is in favor of identifying the dense units in specific subspace. So, the problem of identifying the dense units in sparse subspace is solved. The experimental results show that these algorithms are suitable for clustering high dimensional data stream.(vi) The application of data stream mining technology in simulation system is researched. A simulation application framework based on data stream mining technology is proposed. Moreover, to simplify the process of integrating the data stream mining algorithms into the HLA-architecture-based simulation system, the paper designs universal data stream mining component and general data stream mining federate based on the idea of modularized development so as to improve the reusability of the algorithms. Lastly, the paper introduces the general federate for mining association rules with the example of Missile-Breakthrough simulation system.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络