节点文献
一种频繁子树挖掘算法在Web日志挖掘中的应用研究
Research on the Application of a Frequent Sub-tree Algorithm in Web-log Mining
【作者】 刘振诚;
【导师】 徐丽萍;
【作者基本信息】 华中科技大学 , 计算机软件与理论, 2007, 硕士
【摘要】 随着互联网(Internet)的迅速发展,尤其是基于互联网的Web站点的广泛应用,Web已经成为目前世界上最丰富、最密集的信息来源。而日趋成熟的数据挖掘技术正好为Web数据的挖掘提供了技术基础。Web挖掘作为数据挖掘技术在Web数据分析与处理中的延伸,自然成为了当今数据挖掘领域中比较活跃的研究课题。Web挖掘技术主要包含了Web的内容挖掘、结构挖掘和使用挖掘。它们分别挖掘Web站点页面文件的内容、结构和用户对站点的使用信息。频繁模式挖掘是数据挖掘的核心任务之一,国内外学者在频繁项、序列模式挖掘方面已有较深入的研究。但是新兴的生物信息、数字图书馆、电子商务等领域提出了在复杂结构化数据中挖掘频繁子结构的要求。特别地,从有序标签树数据库挖掘频繁子树可为Web日志挖掘中的Web用户行为模式分析及Web用户分类、聚类等应用提供重要知识。频繁子树挖掘的一个重要研究方向是从标签树数据库中挖掘频繁子树。此前的研究表明,基于模式增长方法的序列模式挖掘算法在大型数据库上表现出较高的效率。可扩展频繁子树挖掘算法(SFTM)把模式增长方法运用到有序标签树数据库的频繁子树挖掘,并在此基础上改进了对搜索空间树的剪支方法。通过设计实现一个以频繁子树挖掘算法为核心的Web日志挖掘工具Webloger,把SFTM算法应用到Web日志数据的挖掘。在Webloger提供的框架下,把SFTM算法与一般算法分别在人工数据集和真实数据集上进行实验对比。实验结果表明SFTM算法是有效的,并且其搜索空间比一般算法有较好的收敛性,尤其在Web日志数据上较传统算法具有一定的优势。
【Abstract】 With the rapid development of Internet, especially the popularity of Web sites, the World Wide Web has become the most abundant and mass information source all over the world. The sophisticate Data Mining technologies could properly satisfy the requirement of mining over Web data. Web Mining as the extension of Data Mining technologies to Web data analysis and process, naturally become one of the most active research topics.Web Mining technologies include Web Content Mining, Structure Mining and Usage Mining. They respectively mine in the content of Web pages, structure of inner-/inter-Web pages and Web user’s usage information. Frequent patterns mining is one of the premier tasks of Data Mining, researchers have dig into frequent items and sequential patterns mining. However lately, complex frequent structure mining technology is required by those rising fields like Bio-information, digital library and e-commerce. Particularly, mining frequent sub-trees in forest could provide important knowledge for user pattern analysis, Web user classification and clustering in Web-log mining.Mining frequent sub-trees in labeled tree database is an important study direction of frequent sub-tree mining. Previous study indicates that, sequential pattern mining algorithms based on pattern growth method have prominent performance. Scalable Frequent sub-Tree Mining algorithm (SFTM) uses pattern growth method in mining frequent sub-trees in labeled tree database, and improves the pruning method for the searching space. By designing and implementing a Web-log mining tool Webloger that based on frequent sub-tree mining algorithm, we apply the SFTM algorithm to Web-log mining. Under the architecture of Webloger, we compare SFTM with usual algorithms by experiments in generated dataset and real dataset respectively, the experiment result demonstrates that the SFTM algorithm is effective and efficient, and its searching space will shrink rapidly during the mining process, especially in the real Web-log data, it makes obviously advantage over conventional algorithms.
【Key words】 Data mining; frequent pattern; frequent sub-tree; sequence database; labeled ordered tree;