节点文献

基于频繁模式树的XML数据挖掘

XML Data Mining Based on Frequent Pattern Tree

【作者】 吴雅双

【导师】 张东站;

【作者基本信息】 厦门大学 , 计算机软件与理论, 2009, 硕士

【摘要】 数据挖掘是指从大量的、不完全的、有噪声的、模糊的数据中提取出隐含在其中的、人们事先不知道的但又潜在有用的知识的半自动化的方法,它是解决“数据丰富、信息贫乏”的有效方法。XML是由SGML发展而来的一种简单、灵活的文本格式。它已经成为Internet上数据描述和交换的标准,越来越多的数据以XML文档进行存储,在这些数据中隐含着大量的知识信息与各类模式,因此,人们迫切需要一些有效的方法来从中提取出一些潜在的、有价值的知识,这就是XML挖掘。但是,作为一种树形的半结构化数据,XML非常复杂且具有异构性,它不能轻易地被映射到关系模型,这样,传统的面向关系型数据的挖掘方法如Apriori算法等,并不能直接应用到XML挖掘上。因此,研究一种有效的针对XML的数据挖掘方法成为数据挖掘领域和XML技术领域的一项重要课题。本文首先介绍了传统的数据挖掘基本理论、XML的基本理论、XML的特点以及XML有关技术规范。其次介绍了频繁子树挖掘的相关概念和现有的一些频繁子树挖掘算法。接着在分析了现有频繁模式树挖掘算法FREQT和Freqttree的基础上,提出了一种新的频繁模式树挖掘算法—PDOM算法。PDOM算法采用最右路径扩展的思想,然后利用递推式的候选节点集更新技术来压缩候选节点集,使产生的候选模式数量大大减少,并且在计算候选模式树的支持数时,采用增量式技术,提高算法效率。通过定理证明了PDOM算法的正确性,并对其进行了实验分析。最后,考虑到XML的树形结构,提出了基于频繁模式树的XML文档分类算法—BFPC算法。BFPC算法基于XML内容和XML结构两方面。它首先利用tf*idf权值法提取XML文件中非结构的信息即XML内容的特征代表,接着利用PDOM算法提取各个类别的频繁模式树,作为该类别的结构特征,并赋予每个模式树一定的权值。同时,本文还提出了一种模式树匹配算法—PMatch,通过最右匹配集来实现模式树的匹配。最后测试阶段,利用PMatch算法以及关键字匹配,计算测试文档的得分,判断该文档所属的类别。通过实验证明,BFPC算法有较高的查准率。

【Abstract】 Data mining is defined as a non-trivial process of extracting valid, novel, potentially useful, and ultimately understandable patterns from a large number of incomplete, noisy and ambiguous data. It is an efficient method of resolving the problem of "data rich-information poor".XML is a simple, very flexible text format derived from SGML. XML has become the standards for data representation and exchange over the Internet. More and more datas are stored in XML format, and a lot of information and various of patterns are hidden in the datas. Hence, there have been increasing demands of efficient methods that extract potential and valuable d information from XML data, namely XML data mining.However, as a kind of semi-structured data, XML data are a huge amount of complex and heterogeneous data modeled by trees, and cannot be easily mapped into a relational framework. Thus, we cannot directly apply to XML data traditional data mining methods for relational databases, such as Apriori. Hence, it is a important challenge to develop efficient and scalable methods for XML data mining.This paper first introduces the basic theory of the traditional data mining, the basic theory of XML, the features of XML documents and technical specifications related to XML.Second, it introduces the concepts related to frequent subtree mining, and some of the existing frequent subtree mining algorithm.Third, it proposes a novel algorithm PDOM, based on the analysis of the FREQT and Freqttree algorithm, which are the frequent subtree mining algorithm. The algorithm adopts the technology of the rightmost expansion. Then it uses a method of recursive updating the set of candidate nodes to reduce the number of candidate nodes. Thus, the number of the candidate patterns is small. And, it adopts incremental method to compute the support of candidate pattern trees, which improves the efficiency of algorithm. It proves the correctness of the algorithm PDOM through theorem, and analysis the algorithm through the experiment.At last, taking into account the tree structure of xml, the paper proposes an algorithm named BFPC for classifying xml documents based on frequent pattern tree. It makes use of both document content and structure. First, it use tf * idf method to extract the representative of characteristics from the non-structural information that is the content of xml in other words. Then, it uses the frequent pattern tree algorithm to extract frequent pattern tree of each class to be the representative of the class and give some weights to each frequent pattern tree. Simultaneously, we propose a pattern tree match algorithm-Pmatch to implement pattern tree match by rightmost match set. In the testing phase, it uses Pmatch algorithm and keyword matching method to calculate the scores of the test document, and judges which class it belongs to. Experimental results show that BFPC algorithm has higher accuracy.

  • 【网络出版投稿人】 厦门大学
  • 【网络出版年期】2009年 12期
节点文献中: