节点文献

基于树结构的精简序列模式挖掘算法研究

Research on Condensed Sequential Pattern Mining Based on Tree Structure

【作者】 解玉洁

【导师】 任家东;

【作者基本信息】 燕山大学 , 计算机软件与理论, 2010, 硕士

【摘要】 现有的序列模式挖掘算法能有效地在大型数据库中挖掘出完整的序列模式集,然而在很多实际应用中,用户更希望找出感兴趣的、更简洁的模式,而不是所有的模式。本文主要研究了如何挖掘精简序列模式,如何有效的增量挖掘精简序列模式,以及如何精确的挖掘重复间隔精简序列模式等问题,这些问题的研究在顾客购物分析,交易分析,Web页面的访问模式预测,DNA序列分析,软件行为模式分析中具有重要的意义。本文首先设计了一种基于改进前缀树的最大序列模式挖掘算法CSMS,算法利用纵向、横向结合搜索位置信息表的序列扩展匹配方法找到潜在最大序列模式,同时,把每个找到的潜在最大序列模式存储在改进的前缀树PStree中,最后通过对PStree进行剪枝,得到由最大序列模式组成的前缀树MPStree。该算法具有较好的时间效率和扩展性。其次,提出了一种基于重复链接WAP-Tree结构的闭合重复间隔序列模式挖掘算法MRCGP,算法首先为频繁1项集构建一个位置信息表,然后通过搜索位置信息表找到所有由不同项组成的2序列模式,最后构建一个重复链接WAP-Tree维护所有的频繁项集,通过逐步挖掘已存在模式的投影树,得到所有的闭合重复间隔序列模式集。该算法的性能优于CloGSgrow。最后,设计了一个面向软件漏洞特征提取的闭合序列模式挖掘算法MSPT和更新算法UMSPT。算法MSPT首先搜索半频繁和频繁2模式,然后为半频繁和频繁项构建一个漏洞序列树,利用投影技术,逐步找到半闭合和闭合序列模式。算法UMSPT插入新的序列到漏洞序列树,搜索树中新插入的分支找到新序列中的闭合和半闭合模式。最后通过检查已存在模式的包含关系以及支持度信息得到更新数据库中的所有半闭合和闭合序列模式集。本文使用现实数据集进行挖掘,通过实验对本文所提出的CSMS算法、MRCGP算法、MSPT算法以及UMSPT算法进行验证。

【Abstract】 The existing sequential pattern mining algorithm can efficiently mine the complete set of sequential pattern in a large database. However,in many application, the user may want to find the more succinct patterns, rather than all of the patterns. The paper mainly study on how to mine compress sequential patterns,how to mine compress sequential patterns incrementally, and how to mine compress repetitive gapper sequtnial pattern, the study of those patterns have an import significance in analysis of customer purchasing lists, analysis of web log access, analysis of DNA sequences, credit card usage histories traces, program execution traces, and the analysis of any other sequences corrlete to the time.Firstly, an efficiently maximal sequential patterns mining algorithm CSMS is presented. This algorithm makes use of the method of matching sequences extension based on a positional table, at the same time, we build a PStree to maintain the candidate sequential patterns, we obatian the final maximal sequential patterns through pruning for the PStree. The algorithm CSMS has better time efficiency and scalability.Secondly, a closed repetitive gapped sequential pattern mining algorithm based on repetitive linked WAP-Tree is proposed. The algorithm build a postitional table for all the frequent items, then searching the postitional table for all the 2 sequential patterns which are composed of the different items, algorithm also construct a repetitive linked WAP-Tree, through mining the project tree of the existing patterns in RLWAP-Tree gradually, we can get all of the closed repetitive gapped sequential patterns.At the end, a close sequential pattern mining algorithm for discovery the feature of software bugs MSPT and an incremental algorithm UMSPT is proposed. Algorithm MSPT searches for the semi-frequent and frequent 2 patterns based on the positional information of the items, than an BStree is constructed to maintain the semi-frequent an frequent items, through the technology of projection, we can find the semi-closed and closed sequential patterns. Algortithm UMSPT inserts the new bugs sequences to the BStree, through searching the new branch to find the new semi-closed and closed sequential patterns. At last, all of the final semi-closed and closed sequential patterns can be obtained by checking the inclusion relation of the existing patterns. Algorithm MSPT and UMSPT has better time efficiency.

  • 【网络出版投稿人】 燕山大学
  • 【网络出版年期】2012年 02期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络