节点文献
基于XML的Web文本挖掘及关联算法的研究
Research on Web Text Mining Based on XML and Association Rule Mining Algorithm
【作者】 王燕;
【导师】 苏勇;
【作者基本信息】 江苏科技大学 , 计算机应用技术, 2011, 硕士
【摘要】 近年来,随着计算机技术的发展和互联网的普及,各级网站服务器中的数据量越来越庞大,数据的种类也越来越繁杂,如何更好地有效利用这些数据,从中挖掘出对各个领域有价值的信息成为现如今的热点研究。尽管传统的数据库技术和数据挖掘技术已取得了飞速的发展且也在日益完善,但由于Web数据的数据类型是半结构化或无结构化,传统技术对Web数据的信息挖掘而言,就存在诸多的困难。XML是一种半结构化的数据模型,随着XML的不断发展,用XML表示Internet上的信息开始广泛应用。XML具有可扩展性、平台无关性、灵活性等特点,还具有强大的数据表达能力,这使得XML能够在信息数据的表示和交换方面的作用日渐增强。因此,对于数量巨大的XML数据,如何能够有效提取其中有价值的信息迫在眉睫。Apriori算法是关联规则挖掘的经典算法,在关联规则领域有很大的影响力,然而由于其需要过于频繁的扫描数据库及较大的空间消耗,许多人已经通过多种方法对其进行改进。现有的基于XQuery的Apriori算法仍存在需要改进的地方,例如,某些情况下由于XML文档的数据量太大,相关的数据就被存放在多个文档中,这些文档又没有必然的联系。而目前的关联规则算法则主要是对单个XML文档进行挖掘,若要对多个文档进行挖掘,就必须对算法进行改进。本文将XML的查询语言XQuery与关联规则挖掘算法结合起来实现了基于XQuery的Apriori算法,对多个XML文档的关联规则挖掘进行研究。在不降低挖掘效率的前提下,通过对算法进行改进,引入XQuery语言中的collection函数,由于此函数具有可以访问多个XML文档集合的特点,实现了对多个XML文档进行挖掘的目标。将改进的算法运用在基于XML的Web文本挖掘模型中,验证了其可行性及有效性。
【Abstract】 In recent years, with the development of computer technology and the popularity of the Internet, the data quantity in all levels of website server is getting more and more huge, the data type is also getting more and more numerous and diverse, how to use these data more effectively and dig out valuable information in all areas now become a hotspot research.Although traditional database technology and data mining technology has acquired rapid development and also consummates day by day, but because the data type of Web data is semi-structured or unstructured, traditional technology have many difficulties in mining information of Web data. XML is a semi-structured data model, with the continuous development of XML, more and more Internet information are indicated by using XML. XML have the Characteristics of extendibility, platform independency, flexibility and so on, also has strong data expression skills, which make XML have stronger role in representing and exchanging information day after day. Therefore, regarding the huge quantity of XML data, how to effectively extract valuable information is imminent.The Apriori algorithm is a classical algorithm for mining association rules and has great influence in association rules domain, however, as a result of its need to scan database frequently and the large space consumption, many people have made the improvement with it through many kinds of methods. Existing Apriori algorithms realized by the XQuery language still have the place needs to be improved, for example, in certain circumstances, because of the XML documents’large data quantity, the related data is stored in many documents which have no inevitable relation. But the present association rule mining algorithms are mainly mining the single XML document, the algorithms must be improved if they mining several documents.This article unifies XQuery which is XML’s query language and the association rule mining algorithm to realize the Apriori algorithm based on XQuery as to study mining association rules of several XML documents. It makes the improvement to the algorithm through introducing the collection which belongs to the XQuery language and has the characteristics of accessing sereral XML documents, which realizes the aim of mining several XML documents on the premise without reducing the efficiency of mining. The improved algorithms will be used in Web text mining model based on XML and its feasibility and validity will be verified.
【Key words】 XQuery; Apriori; XML documents; association rules; data mining;