节点文献
序列模式挖掘的并行算法研究
Research on Parallel Algorithm for Sequential Pattern Mining
【作者】 王宇;
【导师】 周丽娟;
【作者基本信息】 哈尔滨理工大学 , 计算机应用技术, 2007, 硕士
【摘要】 序列模式挖掘是指从序列数据库中发现相对于时间或者其它顺序的高频率序列。其最初动机是想通过在带有交易时间属性的交易数据库中发现频繁项目序列以发现一个时间段内的客户购买行为规律。近年来序列模式挖掘已经成为数据挖掘的一个重要方向,其应用范围已不局限于交易数据库,在DNA分析等尖端科学研究领域、Web访问等新型应用数据源等众多方面都得到了针对性研究。文中结合R.Agrawal和R.Srikant提出的序列模式挖掘的有关定义和R.Wille提出的概念格理论,提出了频繁概念的定义;根据序列模式并行挖掘的需要提出了搜索空间划分理论,其中包括搜索空间、子搜索空间、等价子搜索空间和最大子搜索空间的定义。序列模式挖掘的数据有如下特点:数据量大,数据分布存储。已有的大部分序列模式挖掘算法没有综合考虑到数据的上述特点。本文针对序列模式挖掘数据的这些特点,结合并行理论,提出了一个分布式并行算法SPP(Sequential Pattern Parallel)。本算法遵循模式缩减的原则,利用分治策略实现并行操作,在每台处理器上运用搜索空间划分理论和频繁概念构造频繁项集,运用图深度优先搜索方法构造频繁序列。算法只需扫描数据库两遍,不需要生成候选序列,大大减少了数据库访问时间,提高了序列模式挖掘的效率。不过本文采用是静态负载平衡,还有待改进。基于自己设计的随机数据生成程序和不同的消息结构,本文在具体的并行环境中模拟了算法SPP,并在单机中实现了AprioriAll算法。实验证明,相比于AprioriAll,SPP算法具有良好的加速比和效率。
【Abstract】 Sequential pattern mining is the mining of frequent seqences related to time or other orders from the sequence database. Its initial motivation is to discover the laws of customer purchasing in a time section by finding the frequent sequences. In recent years, sequential pattern mining has become an important direction of data mining, and its application field has not been confined to the transanction database and has extended to new data sources such as Web and advanced science fields such as DNA analysis.Combining with the relevant definitions put forward by R.Agrawal and R.Srikant and Concept Lattice Theory brought forward by R.Wille, this paper proposes the term of frequent concept. To fulfill the needs of sequential pattern parallel mining, there proposes search space partition theory which includes the definitions of search space, sub search space, equivalence sub search space and maximum search space.The data of sequential pattern mining has characteristics as follows: mass data amount and distributed storage. Most existing sequential pattern mining algorithms havn’t considered the above-mentioned characteristics synthetically. Contraposing the traits mentioned above and combining the paralle theory, this paper put forward a new distributed parallel algorithm SPP(Sequential Pattern Parallel). The algorithm abides by the principal of pattern reduction and utilizes the divide-and-conquer strategy for parallelization. The first parallel task is to construct frequent itemsets applying frequent concept and search space partition theory and the second task is to structure frequent sequences using the depth-first search method at each processor. The algorithm only needs to access the database twice and doesn’t generate the candidated sequences, which abates the access time and improves the minging efficiency. But this algorithm adopts the static loading balancing, so it is also needed to improve.Based on the random data generation procedure and different information structure designed, this paper simulated the SPP algorithm in a concrete parallel environment and implemented the AprioriAll on a computer. The experiments demonstrated that compared with AprioriAll, the SPP algorithm had excellent speedup factor and efficiency.
- 【网络出版投稿人】 哈尔滨理工大学 【网络出版年期】2008年 01期
- 【分类号】TP301.6
- 【被引频次】1
- 【下载频次】208