节点文献
基于权重边集比较法的XML语义聚类研究
Research of XML Semantic Clustering Based on Weighted Edge Set Comparison Algorithm
【作者】 刘磊;
【导师】 郑永清;
【作者基本信息】 山东大学 , 计算机软件与理论, 2010, 硕士
【摘要】 XML(eXtensible Markup Language)即可扩展的标记语言,由于具有简单、可扩展、互操作性强、开放性强等特点,正迅速成为一种与技术无关的数据交换的标准和传输格式。与HTML相比,XML具有更大的灵活性。它不仅可以用来标记无结构的文本信息,还可以标记高度结构化的规则数据(如数据库中的数据)。随着Web上XML数据的快速增长,如何帮助用户快速有效地检索大量的XML数据,得到想要的信息,便成为亟待解决的课题。文档聚类是一种帮助人们检索信息的有效手段。为了有效的分析XML文档中的信息,XML文档聚类研究也就成了当前研究的热点。对XML文档聚类的关键点是文档间相似性的度量,由于XML文档是一种半结构化的文本,其信息可以通过文档结构得以描述,所以并不是所有的文本相似性算法都适合于XML文本。目前XML文档相似性计算方法主要有:元素比较法、边集比较法和编辑距离法。元素比较法简单,速度快,但是只是考虑节点的个数但是没有考虑XML文档树的结构复杂性,聚类结果不是很理想。树编辑距离法考虑了XML文档树的结构复杂性和节点相似行,有着良好的聚类结果,但是时间复杂度较高。边集比较法的性能介于二者之间,因此本文对边集比较法进行了扩展,提出带权重的边集比较算法,通过消除XML文档树中的嵌套和重复节点有效的简化了XML标记树,并结合语义信息度量XML文档之间的相似度。得到XML概要树间的相似度后,利用划分聚类法,对XML文档进行聚类。基于经典的边集比较算法,本文做出了以下创新:一、提出了带权重的边集比较法的概念,对XML概要树上每一条边都根据结构复杂性和所处的层次,赋予一定的权重,加强了XML中结构和层次的重要性。二、结合语义信息计算XML概要标记树中有向边的相似性,得到在语义上等价的边的集合,以此确定两个XML概要树之间的相似度,增加了聚类的精确度。实验结果表明,基于语义的带权重的边集比较法有较好的聚类结果。
【Abstract】 XML (eXtensible Markup Language) with the simple, scalable, strong inter-operable and open features is becoming a kind of standards and transmission format for data exchange, which is unrelated to the technology. Compared with HTML, XML has greater flexibility. It not only can be used to tag the text of unstructured information but also can be used to mark highly structured data (e.g. data in the database) With the rapid growth of XML data on the Web, how to help users quickly and efficiently retrieve a large number of XML data and get the useful information will become an urgent issue to resolve.Document clustering is an effective means to help people retrieve information. In order to effectively analyze the information in the XML document, so the research of XML document clustering has become a hotspot in current research. The key point of XML Document Clustering is measure of the document similarity. As XML documents is Half-Structure text, and its information Can be described via documents structure. Thus, not all the text similarity algorithm is available for XML documents clustering.The current calculation methods of XML document similarity are:the method of elements comparison, edge set comparison algorithm and tree edit distance method. The elements comparison method is simple and fast, but it only considers the number of nodes, it does not take into account the structural complexity of XML document tree, so the clustering results are not very satisfactory. The tree edit distance method takes into account the complex structure of XML document tree and nodes similarity, and it can get a good clustering result, but it has a higher time complexity. The performance of edge set comparison method is between elements comparison method and edit distance method. This paper just extends edge set comparison method, and proposes the weighted edge set comparison algorithm, which eliminates the nested and repeated nodes of the XML document tree, and gets the effective simplified the XML labeled tree. It combines semantic information to measure the similarity between XML documents. After getting the similarity among the XML trees, it uses classified clustering method to cluster XML documents.Based on the classic edge set comparison algorithm, this paper makes the innovation as following:1. The idea of edge set comparison algorithm with weight is proposed. It gives some weight for each side of the XML summary tree according to the structure complexity and the level, so it strengthens importance of the structure and levels of the XML tree.2. The new algorithm calculates the edges similarities of XML labeled tree combined with semantic information, then gets the set of semantically equivalent edges so as to determine similarity between the two XML labeled trees.The experiments show that the semantic-based weighted edge set comparison algorithm has better clustering results.
【Key words】 Data Mining; XML Cluster ing; Edge Set Comparison Algorithm; Semantic Similarity;