节点文献
XML内容路由关键技术研究
Research on Key Techniques of XML-based Content Routing
【作者】 王桐;
【导师】 刘大昕;
【作者基本信息】 哈尔滨工程大学 , 计算机应用技术, 2006, 博士
【摘要】 随着信息高速公路的发展,互联网上出现了大量采用事件-驱动模式的应用,如主动服务中的发布订阅系统、基于内容的XML路由、XML文档分发以及新闻传递等。这类应用中,信息以XML流的形式由一系列生产者经过事件代理传递到另一些消费者手中;消费者通过过滤引擎进行订阅。由于仅与XML的内容本身有关,而与信息在何处发布无关,这种路由方式常被称作内容路由。然而,现有的内容路由技术在高效匹配算法、对异构事件处理等方面尚存一些问题。扩展标记语言XML作为一种数据表示和交换的标准,具有自描述性、可扩展性、利于异构数据交换等诸多优点。本文以XML为事件模型、XPath作为多用户订阅模型来研究内容路由的若干关键技术。本文提出了一种基于hedge文法的HXFA机来处理XML发布流事件,并给出了HXFA机的过滤优化算法及算法正确性分析。最后,将多个HXFA机合并作为系统的过滤引擎。从算法的效率和可扩展性方面进行实验分析,提出的方法优于著名的内容过滤引擎YFilter。分析了现有XML相似性模型的优缺点,针对这些模型的不足,扩展了向量空间模型,提出了基于语义和支持度的层次路径模型,并给出其生成算法及复杂度分析。模型首先挖掘文档集中频繁出现的路径,通过文档中的语义信息来合并重复节点、路径,同时对文档特征向量进行维数规约。最后给出基于语义和支持度的距离测度方法。该方法兼顾了XML文档的结构信息和语义信息两个方面的相似性。与树编辑距离模型相比,不但每个文档具有“类原型”描述,而且在时间开销上有较大优势。根据H path模型,提出一种基于改进粒子群优化的XML文档聚类方法。首先将文档集映射到粒子群模型问题空间,然后利用粒子群聚类方法进行聚类,最终权衡了时间和准确性两方面因素,进一步提出混合的粒子群聚类方法,增强了聚类收敛程度和准确程度。尽管提出的模型在提取时已进行了数据归约,然而对于冗余的、异构的XML文档而言,高维灾难问题仍然存在。针对此问题,提出一种独立分量分析的预分类方法。该方法首先对文档矩阵进行维数归约,随后在独立分量张成的空间中进行聚类分析。采用本方法有两个优点:第一,去除相关冗余,挖掘更具有区分能力的特性并尽量刻画潜在的数据分布,从而增加聚类准确性。第二,通过有效降低向量空间的维数,大大压缩了搜索空间规模,减小开销。最后,提出了一个支持异构事件处理的XML发布/订阅系统体系结构。该系统反应了本研究中提出的内容路由技术是如何应用的。
【Abstract】 With the development of the Internet techniques, there are lots of Event-driven Applications such as Content-based publish/subscribe system, selective dissemination of information, content-based XML routing and news distribution. In these applications, a stream of XML documents is sent from a set of data producers to a set of data consumers. Consumers subscribe to the data by means of filters, and then receive a copy of all contents that satisfy the filters. This style of routing is called content-based routing, because the contents are routed based on their contents, and not based on any destination address. However, the existing content-based technologies suffer many problems on the efficient filtering method and the support to the heterogeneity events.XML has become the de facto standard of data exchange over the Internet, due to that XML is characterized by self-described, scalable and convenient for exchange. In this thesis, supposed that XML as publishing events, whereas XPath as multiuser subscriptions, some key techniques of content routing were focused on.In order to deal with the XML publishing events, a novel HXFA method is presented with the optimized rewriting method and then, the theoretical analysis is given. Finally, thousands of HXFAs are combined as a filtering engine. The proposed method shows the satisfactory results compared with the YFilter engine on efficiency and expansibility.The advantages and defects of the existing XML similarity models are analysed, based on which this paper extends the Vector Space Model and proposes a novel H_path model using ontology and supports. In addition, the constructing algorithm and complexity analysis are given. The approach at first extracts the frequent sequences from the document collection as the features, and then judges the semantic features between the tags of XML documents using ontology. Furthmore, the model combines repeated node and path through the semantic features. Finally, the distance calculation based on supports is put forward. Compared to the tree edit script model, this model has not only the description of each document but also the priority of the time expense.Based on the H_path model, the clustering method using improved PSO is given. Firstly, the document collection is mapped into the problem space of the particle model. Then, the CIP method is applied for clustering. Furthermore, weighing the time and accuracy factors, the mixed clustering method based on PSO is applied into the XML category to improve the clustering constringency and accuracy.When extracted from a large scale of heterogeneous documents collection, the H_paths have been dimension-reduced to some extent; however, the high dimensionality curse still exists. Aimed at the problem, a novel preprocessing strategy is proposed. Independent Component Analysis is applied to reduce the dimensionality of document matrix. Then, document vectors are clustered on this reduced Euclidean Space spanned by the independent components. It has two merits: the method can at first delete the correlative redundancy and find the underlying latent variables of XML structures to improve the quality of the clustering, and secondly reduce dimensionality to compress the search space with low cost.Finally, the architecture of Publish/subscribe System is presented. We can also find how these proposed key techniques works in this system.
【Key words】 XML; Content Routing; Publish/subscribe; Particle Swarm Optimization; Hedge Automata;