节点文献

最大频繁项集挖掘算法的研究

Research on Mining Algorithms of Maximal Frequent Item Sets

【作者】 颜跃进

【导师】 陈火旺; 李舟军;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2005, 博士

【摘要】 随着信息技术尤其是网络技术的快速发展,人们收集、存储和传输数据的能力不断提高,导致数据出现了爆炸性增长。与此形成鲜明对比的是,对人们决策有价值的知识却非常匮乏。知识发现与数据挖掘正是在这一背景下诞生的一门新科学。 关联规则是数据挖掘当前研究的主要模式之一,它用于确定数据集中不同域或属性之间的联系,找出有价值的多个域之间的依赖关系。频繁项集挖掘是生成关联规则的关键步骤,其效率问题是关联规则挖掘中的一大难点和热点。频繁项集挖掘可分为完全频繁项集挖掘、频繁闭项集挖掘和最大频繁项集挖掘三类。论文基于数据集和最大频繁项集的不同表示结构,从剪枝策略、尾项集的项排序策略和超集存在判断方法等角度对最大频繁项集的挖掘问题进行了深入的分析和研究。 位图是—种有效的数据集和项集的表示结构。论文基于位图提出了深度优先挖掘算法DFMfi。算法DFMfi充分利用位图的字节特性,优化了项集的匹配和合并操作,并首次在其中引入了基于局部最大频繁项集的超集存在判断方法。论文证明了算法DFMfi的正确性,并通过实验说明其在运行时间上少于同类算法。 近几年来,数据集的另—种压缩表示结构—FP-Tree结构越来越受到研究者们的青睐,论文第二部分研究基于FP-Tree结构的最大频繁项集挖掘问题,其中使用FP-Tree表示数据集及其投影,并利用MFI-Tree保存已有最大频繁项集。分析和实验说明已有算法中的超集存在判断为耗时操作,针对这种情况,论文在单棵MFI-Tree表示下基于最大频繁项集投影提出一种新的超集存在判断方法,并证明了多棵MFI-Tree表示下存在一种简单的超集存在判断方法,二者均可有效降低超集存在判断的时间开销。相应于两种超集存在判断方法,论文分别提出了算法FPMFI和FIMFI。在算法FIMFI里,论文分析了尾项集的项排序策略对压缩搜索空间的影响,提出了一种高效的、基于FP-Tree和MFI-Tree信息的尾项集项排序策略。通过使用新的前瞻剪枝方法,算法FIMFI拓展了前瞻剪枝的范围,加大了前瞻剪枝成功的可能性,尽可能地压缩了搜索空间。此外,FPMFI算法中的非冗余子树结构是寻求高效数据集压缩结构的一次尝试。实验表明,在稠密数据集上,这两个算法相对于同类算法均具有一定的优越性。其中FIMFI算法比同类算法中性能最优的FPMax~*算法平均快30%-40%。 论文最后提出一种能同时压缩表示数据集和最大频繁项集的新的数据结构—CFP-Tree,基于CFP-Tree结构定义了最大化子集,并提出了CfpMfi算法。通过其与FPMax~*

【Abstract】 With the development of information technology, especially the emerging of the network technology, our abilities to collect, store and transfer data have been improved dramatically. Comparing to the explosive growth of data, the needs for decision-relevant knowledge are not satisfied yet Knowledge discovery and data mining technology is an important approach to address this problem.As one of the main patterns in the field of data mining, association rules are used to determine the relationships among the attributes or objects, to find out valuable dependencies among the fields. The efficiency of mining frequent item sets is the key problem in association rules generating. Frequent item sets can be divided into three types: complete, closed and maximal. This dissertation focuses on the mining of maximal frequent item sets. The prune strategy, the ordering policy of tail item set and superset checking are studied thoroughly based on representation of datasets and the sets of maximal frequent item sets.Bitmap is an efficient representation structure of datasets and the sets of maximal frequent item sets. A depth-first mining algorithm-DFMfi which is based on bitmap is proposed in the dissertation. By utilizing the byte characteristic, DFMfi can optimize the mapping and unifying operations on the item sets. Moreover, for the first time a method based on bitmap which uses local maximal frequent item sets for fast superset checking is employed. DFMfi’s correctness is proved. And experimental comparison with previous works indicates that DFMfi can obviously accelerate the generation of maximal frequent item sets.After near 10 years research and development, researchers begin to pay more attention to another compress representation structure of datasets, namely FP-Tree. The second part of the dissertation thus concentrates on the mining of maximal frequent item sets based on FP-Tree. Analysis and experimental results show that the superset checking operation is time-consuming and is frequently used in the mining process. Based on the observations, a new superset checking method based on projections of maximal frequent item sets is presented for the single MFI-Tree. And a simple method for fast superset checking is proposed for multiple MFI-Trees. Two algorithms, FPMFI and FIMFI, which are based on single and multiple MFI-Trees, are proposed respectively. In FIMFI, the item ordering policy of tail item sets is discussed and a new efficient ordering policy based on the

节点文献中: 

本文链接的文献网络图示:

本文的引文网络