节点文献

基于位表的关联规则挖掘及关联分类研究

Research on Association Rules Mining and Associative Classification Based on Bit Table

【作者】 董杰

【导师】 韩敏;

【作者基本信息】 大连理工大学 , 控制理论与控制工程, 2009, 博士

【摘要】 随着人们利用信息技术生产和搜集数据的能力大幅度提高,数据资料的规模急速膨胀。如何快速有效的从海量数据中发现隐藏的、预先未知的信息和知识显得尤为重要,数据挖掘是解决这一问题的有力工具。关联规则(Association Rules)获取是数据挖掘研究的一个重要领域,从某种意义上来讲,关联规则挖掘就是数据挖掘的本质。近年来相关的研究与应用一直占有重要的比例并得到了迅速发展。研究如何快速有效的从海量数据库中挖掘蕴含其中的关联规则,并将挖掘得到的关联规则合理利用,具有十分重要的理论和实际意义。本文在分析现有挖掘算法中存在问题的基础上,提出基于位表的完全频繁项集和事务间频繁闭项集的挖掘算法,并进一步研究关联规则在分类问题中的应用,利用其解决遥感影像分类问题。本文的研究工作可概括为如下三个方面的内容:1、研究事务内完全频繁项集的快速挖掘算法。现有的完全频繁项集挖掘算法多基于Apriori算法,称为Apriori类算法。其在生成候选集须逐个比较两个项集的前n-1项,并且在计算支持度需对全部或部分数据库进行逐条扫描,占用大量的计算时间和I/O操作,成为此类算法的主要瓶颈。针对以上问题,本文首先提出了位表(BitTable)数据结构及其相应的二进制操作。利用其对事务数据库进行压缩,同时通过二进制“与”、“或”操作快速计算候选项集的支持度,改善了低效率的数据库扫描操作;利用其对候选项集和频繁项集进行横向压缩,可直接生成候选项集,避免了逐项比较的复杂操作。该数据结构及其操作可以直接应用于现有的Apriori类算法中,有效地改善此类算法的效率问题。在位表数据结构的基础上,本文进一步提出了基于位表的关联规则挖掘算法BitTableFI。对常用数据集的仿真试验表明了该算法的有效性。2、研究事务间频繁闭项集及其快速挖掘算法。相对于事务内频繁项集,事务间频繁项集能够有效的揭示各属性在不同时刻的关联性,是事务内频繁项集的扩展。然而事务间频繁项集的数量随滑动时间窗口的增大而迅速增加,造成挖掘效率降低,利用闭项集来表示事物间频繁项集能够在不丢失信息的情况下有效的减少项集的数量。本文通过分析事务内频繁闭项集和事务间频繁闭项的内在关系,提出了一种利用事务内频繁闭项集生成事务间频繁闭项集的算法。算法采用分割和条件数据库技术,有效的避免了生成庞大的扩展事务数据库,利用扩展的位表结构压缩事务从而提高支持度的计算效率。此外,采用动态排序和哈希技术极大地减少了频繁闭项集的测试次数。该算法为挖掘事物间频繁闭项集提供了一种有效而快速的算法。3、研究模糊关联分类算法,并利用其解决遥感影像分类问题。关联分类将挖掘获取的频繁项集应用于解决分类问题,将关联规则的挖掘和应用问题紧密结合。将模糊方法引入到关联分类问题中,能够较好的解决规则的“尖锐边界“问题。然而,现有的模糊关联分类算法多采用固定模糊隶属度函数对连续型属性进行模糊划分,没有考虑数据本身的特性。基于此,本章提出一种基于自适应区间划分的模糊关联分类算法—FARC(Fuzzy association rules classification),利用模糊c均值聚类算法根据数据本身的特点自适应地建立模糊区间,并在挖掘模糊关联规则时采用了新的剪枝策略,极大地减少了候选集的数量。新的规则权重度量方法能够更好的利用多模糊关联规则进行分类。对UC Irvine Machine Learning Repository测试数据的实验表明,FARC不仅是具有高精度的分类精度,同时具有对训练样本数量的不敏感性,在训练样本减少的情况下仍能保持较好的分类精度,是一种有效的分类方法。同时,本文将模糊关联分类算法引入遥感图像分类问题的研究中,在实际遥感分类问题中,训练样本往往较难获取,训练样本的不足会导致分类精度的下降,本文提出的FARC算法能够较好的适应训练样本较低情况下的分类问题,从而能够很好的应用于实际遥感分类问题。

【Abstract】 With the increase of human ability of using information technology to produce and collect data,the scale of data inflates rapidly.It is very important to discover the hidden and unknown knowledge in the databases.Data mining is a powerful tool to solve these problems. Association rules mining is an important filed of data mining.In a sense,association rules mining is the essence of data mining.The research and application of it occupy an important proportion of data mining research and have been developed rapidly.The research on how to mine the association rules from the massive databases efficiently and use them reasonably is of great theoretical and practical significance.Based on the analysis of current mining algorithms,an inter-transaction frequent itemsets mining algorithm and an intra-transaction frequent itemsets mining algorithm are proprosed to solve the problem of remote sensing image classification.In this dissertation,the research work can be summarized as the following three aspects:1.The research on the fast mining algorithm of complete frequent itemsets.Most of modern complete frequent itemsets mining algorithms are based on the Apriori algorithm, called Apriori-like algorithms.When generating candidate itemsets,they need to check if any two itemsets have the same n-1 items and when counting the support,the whole or part of the databases needs to be scanned one by one,which wastes a lot of CPU time and I/O operations. The two problems are the main bottlenecks of the Apriori-like algorithms.According to the two problems,the dissertation proposes a special data structure named BitTable and its bitwise operation.BitTable is adopted to compress databases and generate candidate itemsets quickly by the bitwise And/Or operation to avoid scaning databases.It also horizontally compresses the candidate itemsets and frequent itemsets,and generates candidate itemsets directly to avoid the operation of comparing each item.This data structure can be applied in Apriori-like algorithms directly and improve their performance effectively.Moreover,an association rules mining algorithm named BitTableFI is proposed based on BitTable.The experiment results demonstrate the effectiveness of the BitTableFI algorithm.2.The research on inter-transaction frequent closed itemsets and its fast mining algorithm. Compared with intra-transaction frequent itemsets,the inter-transaction frequent itemsets can effectively reveal the relevance of various attributes at different moments,and are the expansion of intra-transaction frequent itemsets.However,the amount of inter-transaction frequent itemsets increases rapidly with the increase of sliding time window,which will reduce the efficiency of the mining algorithm.It can effectively reduce the amount of itemsets without loss of information to utilize closed itemsets to represent inter-transaction frequent itemsets.This dissertation proposes an inter-transaction frequent closed itemsets mining algorithm,by analyzing the internal relation between the inter-transaction and the intra-transaction frequent itemsets.The proposed algorithm adopts division and condition database technology to avoid the generation of huge extended database,utilizes the extended BitTable to compress the transaction and improves the counting efficiency of the support. Dynamic ordering and hash table decrease the testing times of the candidate closed inter-transacation itemsets.Simulations show that the algorithm is a fast and efficient inter-transaction frequent closed itemsets mining algorithm.3.The research on fuzzy associative classification and its application on remote sensing image classification.Associative classification utilizes association rules to solve the classification problem.Fuzzy concept is introduced to associative classification,which can avoid the problem of "sharp boundary".However,most of fuzzy associative classification algorithms adopt the fixed membership function to generate fuzzy sets,without considering the intrinsic characteristic of data.To address this issue,the dissertation proposes a fuzzy associative classification algorithm FARC based on the adaptive interval partition.According to the intrinsic characteristic of data,FARC employs fuzzy c-means to partition continuous attributes,adopts new jointing and pruning technique to avoid generating unuseful candidate itemsets and introduces a weighted parameter to score the fuzzy association rules.The experiments on UCI datasets show that the method proposed in this dissertation not only has a higher classification accuracy,but also is insensitive to the variation of amount of the training data set.In this dissertation,the fuzzy associative classification is introduced to the research on remote sensing image classification.However,in the actual remote sensing applications, training data is hard to obtain,which affects the classification accuracy of traditional classifiers greatly.The proposed algorithm FARC can effectively overcome the problem of lacking training data set in the actual remote sensing classification and get high classification accuracy.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络