节点文献

最大频繁项集挖掘算法及应用研究

Algorithms of Maximal Frequent Itemset Mining and Their Applications

【作者】 王卉

【导师】 李庆华;

【作者基本信息】 华中科技大学 , 计算机软件与理论, 2004, 博士

【摘要】 频繁项集的挖掘是数据挖掘中的一个基础和核心问题,具有广泛的应用领域。由于它是数据挖掘过程中最耗时的部分,挖掘算法的好坏直接影响数据挖掘尤其是关联挖掘的效率和应用范围。因此,最大频繁项集挖掘算法的研究具有重要的理论和应用价值。在对数据挖掘中的核心问题,即频繁项集的挖掘算法及其并行化技术,进行深入研究的基础上,围绕最大频繁项集的挖掘算法和应用,研究了高效的挖掘最大频繁项集的串行算法和并行算法,并将最大频繁项集挖掘算法应用于入侵检测。频繁项集的挖掘是一个搜索问题,剪枝优化技术是提高频繁项集挖掘效率的一个重要手段。文献中在频繁项集挖掘算法中用到的剪枝优化策略可归纳为:根部剪枝、频繁扩展和不扩展三种策略。在分析与研究传统剪枝策略的基础上,提出了新的剪枝策略——多步回退剪枝策略。多步回退剪枝策略在发现一个最大频繁项集后最多可一次回退k层(k为所发现的这个最大频繁项集的长度),最好情况下可将要扩展的节点数量从降低为。与文献中深度优先搜索中逐层回退策略相比,可大幅度削剪搜索空间,达到提高解决问题效率的目的。最大频繁项集的挖掘是频繁项集挖掘中的重要研究分支。在分析了现有最大频繁项集挖掘算法的基础上,针对其不足,提出了一个改进的挖掘最大频繁项集的算法MinMax(Mining Maximal)。MinMax采用了垂直的数据库表示形式,按照自顶向下深度优先的策略对项集空间进行搜索,采用了多步回退剪枝、根部剪枝、频繁扩展和不扩展等多种剪枝优化策略,大幅度削剪了搜索空间。提出了频繁项的不频繁度的概念,通过对频繁项进行适当的排序发挥了各种剪枝优化技术的优势。垂直的数据库表示形式使得项集的支持度计算可以通过简单的集合交集运算来完成,从而避免了对数据库的多次扫描。实验和分析表明,在长模式密集的情况下,MinMax的性能优于目前同类算法。并行处理是提高解决问题效率的有效办法,在研究了挖掘最大频繁项集挖掘的并行化策略地基础上,基于分布存储结构,将算法MinMax并行化,提出了挖掘最大频<WP=4>繁项集的并行算法P-MinMax(Parallel MinMax)。为了异步执行MinMax,减少处理机之间的制约和等待,P-MinMax基于前缀关系划分等价类,以等价类长度的指数函数为权值,并利用因子项集的完全包含关系在处理机之间贪心分配等价类,根据等价类的需要相应地划分和复制数据库记录,使各处理机得以异步计算,达到了较好的负载平衡、较高的剪枝效率和较少的数据库记录复制,减少了算法的执行时间。分析和实验表明, P-MinMax有较好的可扩展性,其性能优于已有同类算法。 从以数据为中心的观点来看,入侵检测问题实际上是一个数据分析问题。用以入侵检测的数据是主机的审计轨迹数据和网络的审计轨迹数据,这些审计数据中记录了系统和网络上发生的所有活动。基于此种思想,提出了一个基于最大频繁项集的入侵检测系统模型MMID(Mining Maximal for Intrusion Detection)。模型中,针对入侵检测的特点,设计了新的最大频繁项集的挖掘算法MinMax_for_IDS。通过挖掘训练数据中的最大频繁项集建立系统和用户的正常行为模型以及攻击模型,用一个滑动窗口来检测是否有不被正常行为模型覆盖的频繁模式发生,以此达到检测入侵的目的。实验表明,MMID对在短时间内频繁发生的攻击类型有较高的检测速度和精度。

【Abstract】 Progress in digital data acquisition, distribution, retrieval and storage technology has resulted in the growth of massive databases. One of the greatest challenges that organizations and individuals encountered is how to convert their collections of rapidly expanding data into accessible and actionable knowledge. The attempts to overcome these hurdles have gathered researchers from different areas, such as statistics, machine learning, and databases, which resulted in a new research field, so called Data Mining.Frequent itemset mining plays a crucial role in many data mining applications. It occurs in the discovery of association rules, strong rules, correlations, multidimensional patterns, and many other important discovery tasks. Frequent itemset mining dominates the time complexity of the discovery algorithms. Most existing work focuses on mining all frequent itemsets. However, as any subset of a frequent itemset is also frequent, the size of the set of all the maximal frequent itemsets is much smaller than that of the entire frequent itemsets. It has been observed that it suffices to mine only the set of maximal frequent itemsets instead of every frequent itemsets. There has been some recent research work that applied data mining techniques to an intrusion detection system for discovering new types of attacks. An in-depth analysis of data mining methods used in an intrusion detection system for discovering new types of attacks is presented and some open problems are addressed as well. This dissertation work focuses on maximal frequent itemset mining and its application. A number of new algorithms for mining maximal frequent itemsets with novel pruning strategies and parallelization solutions are presented for more efficiently , and a system model for intrusion detection based maximal frequent itemset mining is proposed for improving the accuracy and performance of an intrusion detection system. This thesis research makes contributions to both data mining and intrusion detection fields.Frequent itemset mining is a time-consuming task. Pruning strategies are widely used to improve the efficiency of searching. However, most of them still explore unnecessary redundant nodes. The classical pruning strategies are thoroughly studied and categorized into three main classes of root pruning, frequent extending and non-extending. Based on the in-depth analysis, a novel and powerful pruning strategy, called multi-level backtrack strategy, is presented in this dissertation. Compared with all the previous pruning strategies, which backtrack step by step, the multi-level backtrack strategy can backtrack best up to k levels when a k-length maximal frequent itemset has been found. The number of nodes need to be extended can be best reduced from to .Maximal frequent itemset mining is an important branch. However, it has not received much attention so far. A novel algorithm for mining maximal frequent itemsets, called MinMax (Mining Maximal), is presented in this dissertation. MinMax employs a vertical database layout scheme. Along with depth first search strategy, it uses a number of optimization techniques to prune the search space such as the multi-level backtrack pruning, root pruning, frequent extending and non-extending pruning, thus cutting down the search space efficiently. By introducing a new concept of no-support of a frequent item, the frequent items are ordered in the way that they lead to a maximal frequent itemset pruning as much as possible. <WP=6>The vertical database layout scheme makes it available to count the support of itemsets simply by set intersection operations instead of scanning database repeatedly. The experiment results show the effective performance, which are better than previous work especially in database with intensive long patterns.Parallel techniques are widely used to improve the efficiency of mining algorithms. A novel and powerful parallel algorithm, P-MinMax (Parallel MinMax), is proposed in this dissertation. Based on its serial version MinMax, the new algorithm decom

节点文献中: 

本文链接的文献网络图示:

本文的引文网络