节点文献

数据流频繁模式挖掘关键算法及其应用研究

Research on Key Algorithms for Mining Frequent Patterns in Data Streams and Their Application

【作者】 毛伊敏

【导师】 杨路明; 李宏;

【作者基本信息】 中南大学 , 计算机应用技术, 2011, 博士

【摘要】 随着计算机技术的高速发展和信息技术的广泛应用,数据流已在商务管理中的性能检测、网络流量管理中的异常检测及报警、零售业中的事务处理等领域中得到广泛的应用。数据流的分析和挖掘已成为一个热点研究问题。其中,数据流频繁模式的挖掘是数据流挖掘中最基本的问题之一,因此数据流频繁模式挖掘的研究更具有挑战意义。现行的基于数据挖掘技术的入侵检测系统不仅对新的攻击或特征未知的入侵无能为力,而且检测的实时性和准确性均达不到实际应用的需求。研究高效的、实时性强的数据流频繁模式挖掘算法并将其应用于入侵检测系统中,将会推动入侵检测走向实用,因此,基于数据流挖掘技术的入侵检测系统的研究在理论上与实际应用上都具有重要意义。针对现有最大频繁项集挖掘算法中存在压缩存储结构复杂、结点维护量大、时空消耗偏大等问题,本文提出了一种基于前缀模式树的最大频繁项集挖掘算法MMFI-DS.该算法设计的压缩存储结构——SEFI-tree结构简单,捕获数据流重要信息元素的能力强,结点维护开销小,采用自顶向下和自底向上双向搜索策略,可尽早剪掉较短非频繁项集的超集和较长最大频繁项集的子集,减少项目集支持度计算,降低了算法开销。针对现有闭频繁项集挖掘算法存在搜索空间大、产生中间结果多、时间效率不高等问题,提出了一种基于二进制位的闭频繁项集挖掘算法A-NewMoment。该算法采用二进制位图技术设计了BV-DFIlist数据结构来记录数据流信息及闭频繁项集,提出“不需扩展”与“向下扩展”搜索策略,快速挖掘出频繁1-项集所产生的支持度为最大、最长闭频繁项集外的其余闭频繁项集,避免生成大量中间结果,提高了算法的时间效率;研究“动态不频繁剪枝策略”从存储闭项集的HTC表中快速删除非闭频繁项集和“动态不搜索策略”维护所有发生变化的闭频繁项集,降低闭频繁项集的维护代价,提高算法的时间效率。针对现有的算法精确度不够高,结点维护代价高,时空效率低等问题,提出了挖掘Top-K闭频繁项集算法MTKCFI-SWo该算法设计了结构紧凑的前缀模式树CFP-tree来压缩存储数据流滑动窗口内的有效信息,结点存储信息量少,降低结点维护代价,提高算法的时间效率。CFP-tree无需固定滑动窗口尺寸大小,在任意滑动窗口尺寸大小的情况下及时捕获新增数据流信息。采用指针操作技术,不需遍历整个CFP-tree,从CFP-tree中删除大量不频繁项目,提高算法的时间效率。采用“动态确定”挖掘阈值与剪枝阈值,提高算法的精确度与时间效率。为提高现有的入侵检测系统实时在线挖掘效果和检测精度,论文提出一种基于数据流最大频繁项集的入侵检测系统模型MMFIID-DS。该系统设计各种剪枝策略,挖掘经过训练学习后的正常数据集、异常数据集和当前检测数据流的最大频繁项集,建立系统的正常行为模式、异常行为模式和用户行为模式,将误用检测和异常检测两种入侵检测方法有机地结合,实时在线检测入侵,达到提高检测精度和系统响应速度的目的。

【Abstract】 With the rapid development of the computer technology and wide application of information technology, data stream has been widely used in performance detection of business management, abnormal detection and alarming of network flow management as well as the transaction processing in retail industry. Analysis and mining of data stream have become a hot issue for research, of which frequent pattern mining of data streams is one of the most fundamental. Consequently, the research in this field is more challenging.Current intrusion detection system based on data mining technology can neither detect new and unknown attacks, nor meet the actual requirements of application in terms of real-time response and accuracy. Research of data stream frequent pattern mining algorithm with high efficiency and real-time response by applying it into IDS will put the intrusion detection into practice. Therefore, research of IDS based on data stream mining technology is significant both in theory and practical use.In consideration of the existing problems such as complexity in compression storage structure, mass maintenance of nodes and higher consumption of space and time, a typical algorithm for mining maximal frequent itemsets based on prefix pattern tree, called MMFI-DS, is then presented. The compression storage structure designed by this algorithm, SEFI-tree, is simple in structure and has a strong competence in capturing key information of data streams with less expenditure in maintenance of nodes. By adopting the top-down and bottom-up approaches, the shorter supersets of infrequent items and the longer subsets of maximal frequent itemsets can be promptly pruned to reduce calculation of itemsets support and expenditure of the algorithm.In consideration of the existing problems such as larger search space, redundancy in intermediate results and inefficiency in time consumption, a closed frequent itemsets mining algorithm called A-NewMoment based on binary bit is presented. By adopting bitmap technology on the basis of binary bit, the algorithm has designed BV-DFIlist data structure to record data stream information and closed frequent itemsets with presentation of "WSS" and "CSS" search strategies for rapid mining of closed frequent itemsets other than the longest closed frequent itemsets with maximum support produced by frequent1-itemset. Thus, time efficiency is greatly improved by avoiding massive intermediate results. Studying "DNFIPS" for rapid deletion of no closed frequent itemsets from the HTC in storage closed itemsets as well as the "DNSS" for maintenance of all changed closed frequent itemsets can reduce the maintenance cost of closed frequent itemsets and improve time efficiency of this algorithm.In consideration of the existing problems such as inaccuracy of the current algorithm, higher maintenance cost of nodes and inefficiency in time and space, a mining Top-K closed frequent itemsets algorithm MTKCFI-SW is presented. The algorithm has designed a compact prefix tree called CFP-tree to compress effective information in sliding windows for storage of data stream, which features in less information storage and lower maintenance cost for improvement of time efficiency. The CFP-tree, which is capable of promptly capturing newly added data stream information, does not need to fix the size of sliding windows. By adoption of pointer operations, large quantities of infrequent itemsets can be deleted from the CFP-tree to improve time efficiency of this algorithm without traversing the whole CFP-tree. Besides, the adoption of "DD" of mining threshold and pruning threshold also improves the accuracy and time efficiency of the algorithm.For the purpose of improving the real-time online mining effect and detection precision of current intrusion detection system, the dissertation has presented an intrusion detection system model MMFIID-DS based on maximal frequent itemsets over data stream. By designing different pruning strategies and mining of trained normal and abnormal data sets as well as maximal itemsets of the current detection data stream, the system has established normal behavior and abnormal behavior and as well as user behavior patterns, with organic combination of misuse detection and anomaly detection for the purpose of achieving online detection intrusion as well as improving detection precision and response speed of system.

  • 【网络出版投稿人】 中南大学
  • 【网络出版年期】2012年 12期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络