节点文献

XML结构索引技术及查询优化研究

Study on Structural Index Technology and Query Optimization for XML

【作者】 郭松涛

【导师】 朱征宇;

【作者基本信息】 重庆大学 , 计算机软件与理论, 2003, 硕士

【摘要】 为了实现XML的查询优化,近年来人们相继提出了很多索引技术和连接算法[12,13,14,15,16,23,24]。这些索引主要是根据边标签和元素值建立的。然而有的索引不包含所有的元素结点,因而在进行查询时许多路径仍需要检测;有的在向前或向后遍历时产生了大量的冗余数据,从而造成查询代价较大。另外,在所提出的算法中,尽管有的算法,如MPMGJN算法[23]优于标准的RDBMS连接算法,但是该算法为匹配基本的结构关系,特别是在父子关系情况下,执行了大量不必要的计算和占用了大量的I/O资源;有的算法虽然代表了结构连接算法的先进水平,如Stack-Tree-Desc[24]连接算法,但是它没有利用索引结构而是顺序浏览输入列表。这样,必然浪费I/O资源,影响连接的速度。针对以上情况,本文做了以下几个方面的工作:① 由于采用传统的Numbering Schema方法来表示XML文件结构不便于元素更新,本文在改进的基础上提出了Sparse Numbering Schema方法。与传统方法相比,其优点在于:由于在插入新结点时不需要重新计算其结点的start和end值,树结构更新效率得到提高;树的创建只需遍历一次文档,进一步地节省了建树开销;此外,它还能为索引提供一个相对持久和稳定的参考。② 鉴于目前关于Numbering Schema存储方法的研究较为少见,本文针对Sparse Numbering Schema进行研究,给出了在关系数据库中的存储方法。该存储方法不仅有利于根据start值快速建立索引,而且可以节省存储空间。③ 本文将关系数据库中B+树索引技术与Sparse Numbering Schema相结合,提出了一种新的XML文件索引结构——B~+树结构索引,它对XML查询中连接操作和元素定位操作的优化有着重要作用。进而,通过引入指针对该索引进行改进,提出了一种带有Sibling Pointer的B~+树结构索引(简称B~+-SP)。利用这种索引可以克服元素查找总是从树的根部开始进行的缺陷。④ 基于B~+-SP索引,本文还研究给出了Anc-Desc-B~+-sp连接算法。经理论分析,其算法的时间复杂度O(|A|+log|A|)比没有采用该索引的Stack-Tree-Desc算法[24]的时间复杂度O(|A|+|D|+|outlist|)明显降低,因|D|≥|A|,故|D|+|outlist|>>log|A|。经初步实验表明,本算法是一个有效、快速的连接算法。⑤ 在XML查询中,影响查询时间的另一个重要因素是对涉及的XML数据源的定位问题。为解决XML数据源的快速定位问题,本文提出了一种分布式XML数据源定位系统框架,协作式XML搜索引擎(CXSE)。CXSE通过基于站点选择搜索和对XML数据源计分等方法来缩短收集时间,来实现对XML数据源的快速、准确定位。特别地,当在XML查询中同时涉及多个XML数据源时,该并行搜索技术也能起到一定的效果。

【Abstract】 Various index techniques and join algorithms [12,13,14,15,16,23,24] have been recently proposed, in order to realize query optimization for XML. The indices are built on the tags and the element values. Nevertheless, some indices do not contain all element nodes, many paths need to still be examined in the query; other indices produce redundant data in the preorder or postorder traversal, this makes the cost of query much more. In the proposed join algorithms, although some algorithms such as MPMGIN algorithm [23], outperform standard RDBMS join algorithms, they perform a lot of unnecessary computation and I/O for matching basic structural relationships, especially in the case of parent-child relationships; other algorithms such as the Stack-Tree-Desc algorithm [24], represent the state-of-the-art in structural joins, however, they do not utilize indexed structures but sequentially scan the input lists. Thus, I/O’s can be wasted for scanning element that do not participate in the join, and join speed can be influenced.According to this situation, the main works and contributions in the paper are as follow:① As it is inconvenient to update that the conventional numbering schema is used to represent the structure of XML document, a sparse numbering schema based on improvements is proposed in this paper. By comparing with the conventional method, the sparse numbering schema has some merits as follow: the values of start and end do not recomputed, when a new node is inserted, the updating efficiency of tree structure is improvement, and the XML documents are traversed only once, when the schema is constructed, this further decrease the cost of building tree, and the schema can provide a durable conference for index.② As the storage approach of the numbering schema is scarce, this paper proposes a new approach that the sparse numbering schema is stored in the relational database. By utilizing this storage approach, the indices can be easily built on the start column, and the storage space can be mostly reduced.③ This paper refers to the indexing technologies of B~+-tree in DBMS, and combine it with the sparse numbering schema, and proposes a new indexed structure — B~+-tree structural index. It is very important for the optimization of join operation and element location in XML query. By introducing the pointers for further improving the indexed structure, this paper proposes a B~+-tree structural index with sibling pointer (B~+-sp for shorten). This structural index can avoid the defect of always traversing B~+-tree from the boot.④ Based on B~+-sp, this paper proposes Anc-Desc- B~+-sp join algorithm. It is theoretically analyzed that the time complexity of the join algorithm (O(|A|+log|A|)) is obviously less than that of the Stack-Tree-Desc algorithm (O(|A|+|D|+|outlist|) [24], because of |D|≥|A|, |D|+|outlist|>>log|A|. Experiment results primarily prove that the join algorithm is more efficient and quick join algorithm.In XML query, the other important factor of influencing the query time is the problem of the location of XML data source. In order to resolve the problem, this paper proposes a distributed XML data source location system frame, called Cooperative XML Search Engine (CXSE). CXSE can shorten collection time by searching based site selection and<WP=7>⑤ scoring Web document. Accordingly, CXSE really realize to quickly and correctly locate the URL of XML document needed. Moreover,the retrieval system is available to several XML data source in XML query.

  • 【网络出版投稿人】 重庆大学
  • 【网络出版年期】2004年 02期
  • 【分类号】TP311.1
  • 【被引频次】4
  • 【下载频次】294
节点文献中: 

本文链接的文献网络图示:

本文的引文网络