节点文献

XML信息检索关键技术研究

Research on XML Information Retrieval

【作者】 温延龙

【导师】 袁晓洁;

【作者基本信息】 南开大学 , 计算机科学与技术, 2012, 博士

【摘要】 随着XML技术的广泛应用,XML已成为Web上表示和交换数据的标准格式,各个领域不断涌现出大量的XML数据。如何有效检索大量的XML数据,是当前数据库与信息检索等领域中一个亟待解决的热点研究问题。传统信息检索技术在处理非结构化数据的检索上,取得了大量卓有成效的研究成果。XML数据具有半结构化特性,既有结构又有内容,给信息检索领域的研究带来了新的挑战。将数据库技术与信息检索技术巧妙结合,用于解决XML检索问题,已在研究人员中达成共识,为XML检索提供了新思路。本文在深入分析XML检索研究现状的基础上,以XML检索方式为主线,结合数据库技术与信息检索技术,对XML检索的若干关键技术进行了深入研究,包括XML关键字检索、模糊结构上下文的XML内容与结构检索、基于关系数据库的XML全文检索等内容。具体的创新和贡献如下:提出了一种基于候选片段语义的XML关键字检索方法。该方法首先根据XML文档树中节点所包含的属性类型数量以及节点的后裔节点数量选择候选节点,以候选节点为中心创建候选片段,将候选片段作为回答XML关键字检索最基本的语义单元;然后,针对候选片段建立倒排索引,在回答关键字查询时,根据XML数据集自身特点和用户的选择返回包含全部关键字的候选片段集合或存在祖先后裔关系的候选片段集合。实验结果表明以候选片段作为XML关键字检索的基本语义单元,能够为用户返回粒度适中、信息比较完整、拥有实际意义的检索结果,并且检索效率也比较理想。提出了一种模糊结构上下文的XML检索方法。该方法将查询与文档中的结构化约束条件定义为结构上下文,以结构化词项集合表示XML查询和XML文档。在上下文相似度计算方面,综合考虑了上下文之间的最大匹配部分以及各元素的层次权重、元素间的层次相似性等因素,提出了查询上下文与文档上下文相似度计算方法。为有效实现XML内容与结构检索,扩展了向量空间模型,设计了模糊结构上下文的XML内容与结构检索算法。实验结果表明,该方法在检索效率、检索结果上均有较好性能。提出了一种基于关系数据库的XML全文检索方法ReXFT。ReXFT采用基于模型映射的XML数据存储方案NXRel,能够在关系模型之上自然的体现出XML数据逻辑模型。提出了基于全文检索元素节点的XML全文索引方案,允许用户自定义全文索引路径。ReXFT以W3C推荐的XML全文检索标准作为XML全文检索提交形式,检索语法符合国际标准。综合考虑XML数据层次特性以及检索词之间的逻辑关系、距离、出现频率等因素,提出了一种基于文本覆盖密度的检索结果计分方法。实验结果表明,ReXFT可以有效处理XML全文检索。

【Abstract】 With the rapid spread of XML technology, XML has become the standard formatfor data representation and data exchange on the Web. There are a huge number ofXML documents in many domains. It becomes a hot research topic that how toretrieve XML data efficiently and effectively among database and informationretrieval research communities. There are rich solutions in unstructured data retrievalwith traditional information retrieval techniques. But XML data is semi-structuredwith both content and structure, and brings new challenges to information retrievalresearch. It becomes a novel research idea that XML data is retrieval with databaseand information retrieval.This paper analyzes research status of XML information retrieval, considerssolutions with database and information retrieval, and addresses some crucialproblems which are related with XML data retrieval, include XML keyword search,XML content and structure search with vagued structure context, and XML full textsearch based on relational database. The main contributions and innovations include:This paper proposes an approach of keyword search over XML documentsbased on Candidate Fragment semantic. This method first filters candidatenodes according to number of descendants and attribute type numbers ofXML tree nodes, and then constructs candidate fragments centered fromcandidate nodes. After indexing these candidate fragments by inverted list,this method answer user queries with candidate fragments or candidatefragments with ancestor-descendant relationship which satisfy all keywordsand adapt the characteristic of XML dataset. Experiments show thatCandidate Fragment semantic can provide users compact, meaningful andproper size results and have good performance on XML keyword search.This paper proposes an approach to retrieval XML data with vague structuralcontext. We processes user query and XML documents as structural term set. Context resemblance is computed based on level weight of element incontext, level similarity between elements of longest matched context, andother factors. We extends Vector Space Model to answer XML content andstructure search. Experiments show that our method has good performanceon XML content and structure search.This paper proposes an approach of XML full-text search method based onrelational database, named as ReXFT. ReXFT maps XML data into relationalstorage based on NXRel, and can naturally reflect the logical model of XMLdata. ReXFT allows users to create XML full text index on user defined pathsbased on full text element nodes. W3C Recommendation is adopted inReXFT to submit user XML full text search to fit the international standards.ReXFT scores search results based on cover density ranking schema, takinginto account the logical relationship between search terms, distance,frequency and other factors. Experimental results show that ReXFT has goodperformance in the processing of XML full-text search.

  • 【网络出版投稿人】 南开大学
  • 【网络出版年期】2014年 06期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络