节点文献
不确定数据流中频繁数据挖掘研究
Study on Frequent Data Mining from Uncertain Data Streams
【作者】 汤克明;
【导师】 陈崚;
【作者基本信息】 南京航空航天大学 , 计算机应用技术, 2012, 博士
【摘要】 随着计算机技术与通信技术的快速发展,传感器网络、Web服务和RFID技术得到了广泛应用,从而使得不确定性数据管理得到广泛的重视.在许多现实的应用中,例如经济形势预测、金融信息分析、生态环境监测、网络安全监控、物流管理等等,不确定数据流扮演着关键角色.在这些应用中,传统的数据管理技术却无法有效地管理新型的不确定数据流,这就引发了学术界和工业界对研发新型的不确定数据流管理技术的兴趣.因此,不确定数据流上的数据挖掘已经成为当前数据挖掘领域的研究热点.当前对于不确定数据流上的挖掘主要集中在不确定数据流上的聚类、不确定数据流上的频繁模式挖掘、Skyline查询、数据世系分析、异常分析等.本文在深入研究国内外的各种不确定数据流挖掘技术的基础上,讨论了目前国内外有关不确定数据流频繁数据挖掘的研究现状.由于不确定数据流上的频繁数据挖掘是不确定数据流上的关联规则、分类、聚类等挖掘的基础,在不确定数据流挖掘中具有重要的地位.因此,本文在不确定数据流上频繁数据挖掘方面进行了深入的研究,提出了有效的频繁数据挖掘算法.本文的主要工作有:(1)提出了一种基于滑动窗口的不确定数据流中频繁项查询算法SWBUFIM.本文根据频繁项的本质特性以及马尔科夫不等式,给出了两个裁剪规则,用于对不确定数据流进行预处理,裁剪掉不可能成为频繁项的元组.在此基础上我们:一方面利用动态规划方法计算期望概率,保证在时间内完成期望概率的计算;另一方面,根据不同数据项相互独立性原理,针对不同数据项开辟子滑动窗口,并且根据数据项的组合数目进行行列划分来处理频繁项挖掘问题,并在动态规划方法的基础上,进一步改进期望概率计算方法,只需要动态规划滑动窗口中前k-1项即可保证在时间内有效地完成期望概率的计算.实验结果表明,所提出的查询算法SWBUFIM具有较快的处理速度,其空间复杂度随着处理数据规模的增加成线性增长.(2)提出了一种基于滑动窗口的不确定数据流中top-k查询算法MPTopKTS.本文针对top-k查询的定义,根据不确定数据流及其滑动窗口的特性,研究基于滑动窗口top-k查询问题,提出了所有可能世界中元组集成员相对得分值高并且具有最大出现概率的top-k元组集(MPTopKTS)的查询算法.该算法基于滑动窗口建立概要表,然后在每一时刻对概要表进行修改,有效地减少了top-k查询问题的复杂性;能够在查询准确性与查询开销之间取得平衡,较小的计算开销获得高质量的近似结果.实验结果表明,所提出的查询算法在时间与空间复杂性方面优于其他类似的算法.(3)提出一种基于滑动窗口的不确定数据流中频繁闭项集的采样挖掘算法MFCIFUDS.本文针对不确定数据流频繁闭项集的挖掘问题,首先使用采样的方法,基于随机采样概率,把由不确定数据组成的事务转换成由确定性数据组成的事务,再利用基于确定性数据模型的频繁闭项集挖掘技术完成不确定数据流中频繁闭项集的挖掘任务.本文不但从理论上证明了基于采样技术利用确定性数据挖掘算法解决不确定数据挖掘问题的可行性,而且提出了一种改进频繁模式树生成与修改技术,有效地提高了基于FP-tree频繁模式树的频繁闭项集挖掘速度.实验结果表明,所提出的查询算法MFCIFUDS有较高的挖掘精度和处理速度.(4)提出了一种基于滑动窗口的不确定数据流中频繁数量区间模式的挖掘算法MFIPatFUS.不同于处理常规二进制项集事务不确定数据流,数量区间事务不确定数据流使用数量区间来表示事务属性,其不确定性在于属性数量区间范围的波动性,数量区间分布体现某种分布概率.本文借鉴常规的基于频繁模式树的不确定数据流频繁模式挖掘算法,设计一种频繁数量区间模式生成树FIPatTree,用于捕获不确定数据流中所有事务的数量区间信息.我们把原始数量区间边界值作为基元素,根据基元素的分布情况建立基数量区间,从而一方面基于基数量区间对原始数量区间进行重新划分;另一方面根据基数量区间数值范围在原始数量区间中所占比例决定其基数量区间概率.算法MFIPatFUS采用滑动窗口模型,使用FIPatTree树作为概要数据结构,事务属性以基数量区间结点保存在FIPatTree树中.建立树的过程类似常规频繁模式生成树的建立过程,不同点在于当属性基数量区间与出现概率均相同时,结点方可共享.对于共享结点设立频次与局部概率统计数值,为了方便遍历与修改,增设了与FIPatTree树相关联的属性索引与基数量区间索引.基于频繁数量区间模式生成树FIPatTree的频繁数量区间模式挖掘过程采用基于投影基与条件树的递归挖掘方法.实验结果表明,所提出滑动窗口模型的挖掘算法MFIPatFUS对处理数量区间事务组成的不确定数据流频繁数量区间模式挖掘是有效的.
【Abstract】 With the rapid development of computer and communication technology and wide application ofwireless sensor networks, Web service and RFID, uncertain data management has gained a lot ofattention. Uncertain data management plays an important role in many practical scenarios, such aseconomy situation prediction, financial information analysis, ecological environment observing,network security monitoring, logistics management, etc. But the traditional data managementtechnology cannot handle such new type of uncertain data streams effectively. Therefore, designingnew management technology for uncertain data streams draws significant interest from industry andacademia. Thus, the data mining on the uncertain data streams has already become a research hotpot.The data mining on the uncertain data streams mainly focus on the clustering, frequent patternmining, skyline query, data lineage analysis, outlier analysis. Based on the thorough understanding ofrelated work, this paper discusses the state-of-the-art of the frequent data mining of the uncertain datastreams. As the basis of several mining tasks such as association rule, classification and clustering,frequent data mining had an important position in the uncertain data streams mining. Therefore, thispaper investigates the frequent data mining on the uncertain data streams deeply and proposeseffective algorithms for frequent data mining algorithms. The main contributions of the paper are asfollows:(1) We propose a sliding window-based algorithm SWBUFIM for frequent item query on theuncertain data streams. According to the characteristics of frequent item and Markov’s inequality, wegave two prunning rules to omit the items which cannot become frequent on the uncertain datastreams. We employ the dynamic programming method to compute expected probability intime. Since different data items are mutually independent, we present the model which will open subsliding window for different data item and handle the frequent item mining problem by the processingdivisions according to the number of combinations. In addition to using the dynamic programmingmethod, we improve the probability computing algorithm to efficiently compute the expectedprobability in time by only processing k-1rows in the sliding window.(2) We propose a sliding window-based top-k query algorithm MPTopKTS for the uncertain datastreams.According to the characteristics of uncertain data streams along with its sliding window, weinvestigate the top-k query problem and propose the query algorithm for the Top-k tuple sets with therelatively high score value and maximal probability of occurrence. To reduce the time complexity oftop-k query algorithm, this algorithm builds synopsis tables based on sliding window. We alsoadvanced a effective method to update the the synopsis tables in each time step. This algorithm canalso balance between the query accuracy and the time cost, namely, it can gain the high qualityapproximate result at the price of minimal computing overhead. The experiment results demonstratethat our query algorithm is more efficient than the previous work in time and space complexity.(3) We propose a sample mining algorithm MFCIFUDS for detecting the frequent closeditemsets in uncertain data streams based on sliding window.Based on the character of uncertain datastreams frequent closed itemsets, we first transfer the transactions comprised by uncertain data intothe one with certain data by random sampling, and then detect the frequent closed itemsets byemploying the mining technology for the certain data streams. We theoretically prove the feasibility of the algorithm based on the sample technology. In addition, we also propose an improved algorithm forconstructing and updating the frequent pattern tree to speed up the frequent closed itemsets mining.The experiment results demonstrate that our query algorithm, MFCIFUDS has high mining accuracyand proceeding speed.(4) We propose a frequent quantitative interval pattern mining algorithm MFIPatFUS foruncertain data streams based on sliding window. Instead of the regular binary item set transactionuncertain data streams, in the quantitative interval transaction uncertain data streams value of eachtransaction attribute is a quantitative interval. Here, the quantitative interval’s uncertainty is reflectedby the fluctuation of its range, and its distribution demonstrate some type of distribution probability.Utilizing the basic idea of uncertain data streams frequent pattern mining algorithm based on regularfrequent pattern spanning tree, we present an algorithm based on a frequent quantitative intervalpattern spanning tree FIPatTree which is used to capture the quantitative interval information in all ofthe uncertain data streams transactions. FIPatTree uses the boundary values of the initial interval asbase elements, and then establishes quantitative intervals according to the base element distribution.On the other hand, the probability of each base quantitative interval can be computed by theproportion of base numerical range in the original numerical range. Algorithm MFIPatFUS employsthe sliding window model and utilizes FIPatTree as the synoptic data structure. The transactionattributes are stored in FIPatTree where each node represents a base quantitative interval. Theconstruction of tree is similar to that of the regular frequent pattern spanning tree with the exceptionthat nodes can share only when base quantitative intervals and occurrence probabilities of theattributes are identical. We set frequency and partial probability statistics for the shared nodes. Wealso set indexes of the attributes and the base quantitative intervals on the FIPatTree so as to updateand traverse on the FIPatTree. The experiment results demonstrate that the sliding window modelbased mining algorithm MFIPatFUS can effectively detect the frequent quantitative interval patternmining for the uncertain data streams with quantitative interval transactions.