节点文献

TJDirect:一种基于IFED编码的XML文档小枝模式匹配算法

TJDirect: An Holistic Twig Pattern Matching Algorithm of XML Document Based on IFED Encoding

【作者】 石隽锋

【导师】 陶世群;

【作者基本信息】 山西大学 , 计算机应用技术, 2007, 硕士

【摘要】 XML已成为Web上数据表示、集成和交换的标准,它的格式简单、自我描述能力强,实现了内容、结构和表现三者的分离,更适合于数据表示和交换。近年来,XML在各个领域得到了广泛的使用,Web上已经涌现了大量的XML数据。如何对XML文档进行有效地查询成为一个重要的研究课题。XML文档的结构查询通常被转化为两个节点列表之间的结构连接操作。结构连接主要有两种思想:一种是路径分解法,将查询的路径表达式分解成多个简单路径表达式,进行匹配,然后将结果连接起来。这种方法不可避免的要产生无用的中间结果。另一种是小枝模式匹配方法,将查询表达式用树的形式建模,称为“小枝模式”(twig pattern),再将小枝模式到XML文档树中进行匹配,这种方法使得算法的I/O代价、CPU时间复杂度和空间复杂度降低,提高了查询的效率。最近提出的基于ExtendedDewey编码的小枝模式匹配算法——TJfast能较有效地处理XML文档查询,但是ExtendedDewey编码不支持动态插入,并且TJfast算法的执行效率有待进一步提高。本文针对以上两个不足进行改进,主要工作如下:(1)提出了IFED(Insert-Friendly ExtendedDewey)编码方法,它从ExtendedDewey编码的基础上改进而来,既支持有效的查询,又支持动态插入。(2)TJfast算法的I/O代价和CPU时间复杂度与n个输入列表的大小和最后匹配结果大小的总和线性相关,但是,输入列表中有些不参与连接的节点是可以被跳过的。本文提出了TJDirect算法,它采用一种更为灵活的小枝模式匹配方法,跳过了一些不参与连接的节点,提高了算法的执行效率,并且算法中不再用集合保存匹配的节点,使得空间复杂度降低。(3)通过实验比较了基于IFED编码的TJDirect算法和基于IFED编码的TJfast算法的执行时间,结果显示TJDirect算法的执行效率较高。(4)为XML文档编码,考察IFED编码的存储空间超出ExtendedDewey编码的比值,实验证明IFED编码的存储空间不是很大。

【Abstract】 XML has already become the standards of the expression, integration and exchange of the data on web. It has simple form and strong self-describing ability. Beside, it realizes the separation of the content, structure and expression, so that it is more adapt to data expression and exchange. During recent years, XML is widely used in various fields, and its data has abundantly appeared on web. How to query XML document become an important research topic.Querying XML document is usually transformed to join two nodes’ list. There are two ideas about joining: one is to decompose it to several simple paths, match them and then join the results. The method inevitably produces useless result. The other is to match holistic twig, that is, to convert query path to a tree structure named twig pattern and then find its match in XML document tree. The method reduces algorithms’ I/O and CPU complexity and space complexity. It makes query more efficiently. TJfast: a holistic twig matching algorithm proposed recently based on Extended Dewey encoding can efficiently deal with XML document query, but ExtendedDewey encoding doesn’t support dynamic insertion and the performance of TJfast need improve.In this paper, we make improvement aiming at the two shortcomings mentioned above. The main contributions are summarized as follows:(1)We propose a new encoding scheme named IFED(Insert-Friendly ExtendedDewey),which improved from ExtendDewey. It supports both efficient query and dynamic insertion.(2)Algorithm TJfast has I/O and CPU time complexities linear in the sum of sizes of the n input lists and the output list. But some nodes in input lists which do not contribute to final results may be skipped. In this paper, we propose a more flexible twig pattern matching algorithm named TJDirect, which skip some nodes which won’t be joined. Algorithms’ performance is improved. Furthermore, the algorithm doesn’t use sets associated with matching nodes, so the space complexity is reduced.(3)We compare the execution time of TJfast based on IFED and TJDirect based on IFED through experiments and conclude that TJDirect is more efficient than TJfast.(4)We encode for XML documents to make a survey of the space of IFED encoding and ExrendedDewey encoding and make a conclusion that the space of IFED is not very large.

【关键词】 XML文档查询小枝模式编码结构连接
【Key words】 XML documentqueryholistic twig patternencodingstructural joins
  • 【网络出版投稿人】 山西大学
  • 【网络出版年期】2008年 05期
  • 【分类号】TP312.2
  • 【下载频次】77
节点文献中: