节点文献

高效用关联规则的挖掘

High-utility Association Rule Mining

【作者】 余光柱

【导师】 邵世煌;

【作者基本信息】 东华大学 , 控制理论与控制工程, 2008, 博士

【摘要】 关联规则的挖掘就是要发现大量数据中项集之间的关联或相关联系,它是数据挖掘研究的重要内容之一,在科学研究、电信网络、市场与风险管理、客户关系管理(CRM)、存货控制、军事等方面得到了广泛应用。但是,传统的关联规则以支持度衡量项集的重要性,会丢失一些支持度不高但效用很高、用户很可能感兴趣的规则。本文研究的高效用关联规则弥补了传统关联规则无法表达项集效用的不足,能反映用户偏好,更好地满足决策需求。本文主要研究高维大数据集中高效用关联规则的挖掘算法,弥补了现有的基于效用关联规则挖掘算法不能有效处理高维大数据集的不足。文中还结合效用与支持度的特点,提出了基于效用与支持度的关联规则挖掘问题及算法,可发现更多的用户感兴趣的规则。本文的主要研究有:(1)提出了一种新的在高维大数据集中挖掘高效用长项集的算法Inter-transaction。该算法基于行枚举,通过长事务的交集运算,直接得到长项集,不必从短项集逐步扩展得到长项集。在高维数据集中,长事务间共同项目很少,事务进行交集运算后变短的速度很快,因此这种行枚举方法具有很好的收敛性。Inter-transaction算法还把划分的方法引入到效用挖掘中,仅扫描数据库两次,能很好地适应高维大数据集环境。同时,由于采用了新的剪枝策略,避免了大量的候选集的生成、检验。(2)提出了一种双向搜索高效用项集的混杂算法。现有的基于效用的关联规则挖掘算法采用类似Apriori的搜索策略,需要多次扫描数据库。当模式很长且数据集很大时,I/O负担太重。本文提出了一种从上下两个方向搜索高效用项集的混杂算法。该算法把发现所有高效用项集的任务分解为发现高效用长项集和高效用短项集两个相对容易解决的子问题,然后再选择不同的算法完成挖掘任务,避免了从短项集逐步扩展到长项集的冗长过程。(3)提出了一种优化长事务交集运算的方法。我们提出的挖掘高效用长项集的算法同时以水平项目向量(Horizontal item-vector,简称HIV)和水平项目列表(Horizontal item-list,简称HIL)两种格式存储事务,并利用HIL格式数据提供的信息减少比特级逻辑“与”运算的次数,使逻辑“与”运算的次数等于HIL格式数据的长度,与比特向量(HIV格式)的长度无关。这种以空间换时间的方法解决了事务交集运算的性能随比特向量长度的增长而降低的问题,保证了在高维环境下的高性能。这种优化方法也可有效提高垂直挖掘算法挖掘频繁长模式的效率。(4)提出了基于效用与支持度的关联规则挖掘问题。支持度与效用分别反映了项集的统计特性与语义特性,但人们对事物的兴趣度(或事物对人们的重要性)不但取决于事物本身的客观因素(如项集的支持度),与人们的主观因素(如人们对效用的不同理解)也密不可分。为克服单个度量(支持度或效用)的不足,本文提出了一种衡量项集重要性的新的度量:激励。项集的激励定义为支持度与效用的乘积,反映了用户获得某种效用的可能性或以某种可能性可获得多大的效用。在基于效用与支持度的关联规则挖掘中,高激励项集的挖掘避免了那些支持度不高但效用较高、或效用不高但支持度较高的项集的丢失,能发现更多的用户感兴趣的规则。(5)论证了激励具有两个重要的数学性质:上界特性和事务权重激励向下封闭特性。根据这两个特性,设计了两种挖掘高效用频繁集的算法HM-Miner和HM-Two-Phase-Miner。两种算法都采用了类似Apriori的自下而上的搜索方式,适合于短模式数据集的挖掘。HM-Miner利用激励的上界特性剪枝,HM-Two-Phase-Miner则利用事务权重激励向下封闭特性剪枝。(6)给出了一个高效用关联规则挖掘的应用系统,并用于购物篮分析中。该系统能同时输出关联规则(项集)的支持度、效用与激励,以比较基于支持度的关联规则与高效用关联规则挖掘的区别与联系。实际挖掘结果表明,高效用关联规则的挖掘能发现一些基于支持度关联规则无法发现的有趣模式,帮助商家找出高效用商品组合,促进高利润商品的销售。经过数据的转换处理,该系统还可应用于其他领域。例如,在网页分析中,把网页被访问的次数与浏览时间作为评价网页受欢迎程度的尺度,将网页挖掘问题变成高效用项集的挖掘问题。

【Abstract】 The association rule mining(ARM) aims at extracting interesting correlations between items,and is one of the most important tasks of data mining.ARM has been widely used in many fields such as scientific research,telecommunication networks, market and risk management,customer relationship management,inventory control, military and so on.However,traditional ARM,which uses support to measure the significance of a rule,can not discover some rules that have a low support but a high utility,and these rules maybe very important to users.High-utility association rule mining(HUARM),which is the main content of this paper,can overcome the shortcomings of traditional ARM in reflecting the semantic importance of an itemset.It can reflect users’ preference,providing a better support to decision making.In this paper,we studied utility-based association rule mining in large high dimensional data, and our approach solved the problem that existing algorithms for utility mining can not handle large high dimensional datasets.By integrating the advantages of support and utility,we also proposed a new task of utility and support based association rule mining (USBARM),which can find more interesting rules.The main contributions of the paper include:(1) Proposed a novel algorithm,Inter-transaction,to mine high-utility long itemsets from large high dimensional data.The algorithm is based on row enumeration, which can find high-utility long itemsets directly by intersecting long transactions, without extending short itemsets step by step.In high dimensional data,there are few common items between long transactions,and the intersection of multiple long transactions usually leads to a very short itemset(intersection transaction),so this kind of row enumeration method has a good convergence.By adopting the partition method, the algorithm needs to scan the database only twice,and can work well in large high dimensional data.In the meantime,new pruning strategies are used,and large number of candidates can be filtered off. (2) Proposed a hybrid algorithm which search high utility itemsets from two directions.Existing algorithms for utility mining adopt an Apriori-like bottom/up search and need to scan the database multiple times.If the data is very large and contains many long patterns,the I/O burden will be too high.This paper proposed a hybrid algoritm which searches high utility itemsets both from the top and from the bottom.The hybrid algorithm firstly decomposes the mining task into two easy parts (mining short high utility itemsets and mining long high utility itemsets),and then choose different algorithms to finish the two subtasks separately,without the expensive process of extending short itemsets step by step to find long itemsets.(3) Proposed a method to optimize the intersection operation of long transactions. Our algorithm stores a transaction in both HIV format and HIL format,and uses the information contained in HIL format to reduce the number of bit-wise "AND" operation.In this way,the number of "AND" operation in HIV format is equal to the number of non-zero figures in HIL format,having nothing to do with the length of the data in HIV format.The optimization method can solve the problem that the performance of the intersection of long transactions decreases rapidly with the increase of the length of bit-vectors,and can guarantee a high performance in high dimensional data.The optimization method of time-space trade-offs can also improve the the efficiency of vertical algorithms for frequent itemsets mining.(4) Proposed the task of utility and support based association rule mining (USBARM).Support and utility can reflect the statistical importance and semantic importance of an itemset respectively.However,whether an itemset is interesting to a person not only depends on its objective factor such as the support of the itemset,but also on person’s subjective factor such as the preference of the user.In order to overcome the shortcomings of individual measure(support or utility),we proposed a new measure,motivation,to measure the importance of an itemset.Motivation is defined as the product of utility and support,and it shows the possibility of a user’s obtaining a certain utility or the utility a user can obtain in a certain possibility.In USBARM,all high motivation itemsets can be found,without missing itemsets whose motivatin is high but the support(or utility) is less than a user-defined threshold. USBARM can find more interesting patterns.(5) The paper also proved that motivation has two important properties:upper bound property and transaction-weighted downward closure property.Based on the properties,we designed two algorithms,i.e.,HM-Miner and HM-Two-Phase-Miner,to discover all high utility frequent itemsets which satisfy the support threshold,utility threshold and motivation threshold respectively.Two algorithms adopt Apriori-like bottom/top search,and are suitable for short pattern datasets.HM-Miner uses the upper bound property to cut down search space,whereas HM-Two-Phase-Miner uses transaction-weighted downward closure property to achieve the same goal.(6) Developed a real application system to mine high utility association rules.The application system can output the support,utility and motivation of every association rule(itemset) discovered so that users can compare the difference between support-based association rule mining and high utility association rule mining.We applied our application system to shopping basket data and the mining result show that high utility ARM can discover some interesting patterns missed by traditional ARM, and can help trade people find combinations of some high utility commodity, promoting the sales of high margin productions.By proper data preprocessing,the application system can also be used in other fields.For example,in webpage analysis, the frequency of visits of a page and the time spent on the page can be used to measure the popularity of the page.If we view the time as the utility of one page,we can convert the web mining task into high utility itemsets mining task.

  • 【网络出版投稿人】 东华大学
  • 【网络出版年期】2009年 10期
节点文献中: