节点文献

对等计算系统中的相似查询处理研究

Similarity Search in Peer-to-Peer Systems

【作者】 徐林昊

【导师】 周傲英;

【作者基本信息】 复旦大学 , 计算机软件与理论, 2005, 博士

【摘要】 对等计算(peer-to-peer computing,简称P2P)已经成为了计算机科学领域的研究热点。在对等计算系统中,每个节点都是完全自治的,拥有相同的责任,扮演着双重角色—既可以是客户机(服务消费者),也可以足服务器(服务提供者),而且任意一个节点都可以随意地加入或退出系统。因此,对等计算系统是一个完全动态的、没有任何集中控制的分布式系统。对等计算模型具有许多潜在的优势,如扩展性强、鲁棒性好、资源可用性高等特点,特别适用于具有地理分布、资源异构、扩展性要求高、局部自治等特征的分布式系统。因而,对等计算模型推动了“以主机为中心(host-centric)”的传统互联网向“以数据为中心(data-centric)”的未来互联网的发展,被学术界和工业界公认为是重构基于互联网应用的关键技术之一。虽然,学术界已经取得了不少对等计算环境下的查询处理研究成果,但仍然存在着许多有待研究与解决问题。本文研究了对等计算环境下的相似查询问题,探索了对等计算环境下的基于路由索引、数据空间划分、协作缓存和概率模型的相似查询处理技术,旨在为现有的对等计算系统提供基于语义或者相似度的查询处理功能。本文的主要贡献有如下四个方面:1.将多维数据空间中的相似查询处理(similarity search)技术引入到无结构(unstructured)对等计算系统中,利用近似向量(vector approximation)技术和路由索引(routing index)技术,为系统中的每个节点建立基于近似向量的路由索引,使得用户查询能够准确地路由到并且有效地查询拥有相关数据资源的节点,实现无结构对等计算系统中的相似查询处理。另外,利用无结构对等计算系统中的网络自配置(self-reconfiguration)特性,通过动态调整节点在网络中的位置,使得与相似查询相关的节点保持位置邻近,进一步提高了系统的查询处理性能。仿真实验表明,该方法对无结构对等计算环境下的相似查询处理非常有效。2.将数据空间划分(space partitioning)技术引入到结构化(structured)对等计算系统中,通过选定的代表点(reference point),将整个数据空间划分成没有任何重叠(overlap)的数据子空间。通过将代表点线性化,在节点、代表点和数据子空间三者之间建立起一一映射关系。利用传统的高维索引技术和基于分布式散列表(distributed hash table,或DHT)的资源查找和定位机制,使得高维数据空间中的相似查询处理在结构化对等计算系统上得以实现。此外,通过维护数据子空间之间的物理邻近(physical proximity)特征,降低了系统的查询路由代价;通过调整数据子空间的粒度,达到均衡系统负载(load balance)的目的。仿真实验表明,该方法能够有效地适应数据维度的增长和系统规模的扩展。3.针对关系查询处理,探索了基于协商(negotiation)的协作缓存技术(collaborative caching),提出了一种基于网络传输代价的查询代价模型,用于评价不同查询计划的执行代价。在对等计算环境下,一个查询计划的执行代价可以被分解为子查询计划的执行代价。结合代价模型,利用协调重叠网络(collaborative overlap network),通过查询请求节点(requester)和协调节点(coordinator)之间的协商,确定协作缓存的逻辑查询表达式和参与数据缓存的查询请求节点,实现了对等计算环境下的基于语义的查询处理。仿真和真实实验表明,该方法能够确定较优的数据缓存放置策略,降低系统的查询处理开销。尤其是在单个节点仅能贡献有限的存储资源的情况下,该方法的优势更为明显。4.针对基于主题(topic)的对等计算文件共享系统,研究了一种基于概率的相似查询处理技术。该技术的核心思想是利用概率模型(probabilistic model)描述共享主题之间的语义重叠度(overlap)以及节点对主题的信息覆盖度(coverage),为节点建立起概率路由信息。相似查询处理算法以每个节点已有的概率信息为基础,依据推导出的邻居节点对查询主题的覆盖度,决定主题查询的搜索路径。此外,利用查询反馈的信息,通过更新路由查询的节点上的概率信息,使得这些节点能够为将来的主题查询选择更准确的查询搜索路径。模拟实验表明,该方法能够利用基于自反馈的概率更新算法,逐步改善查询处理的效果,提高查询处理的效率。总之,本文详细地介绍了四种相似查询处理方法的算法设计与实现,以及测试结果。这些方法是对现有对等计算环境下的查询处理技术的有益补充和改进。本文的研究工作建立在对当前已有技术的详尽分析与理论研究,以及大量的实验测试的基础上。实验和分析表明,与当前对等计算环境下的查询处理技术相比,上述方法在查询效率和资源利用率等方面具有优势。

【Abstract】 Peer-to-peer computing (P2P) has become an extremely popular topic in computer science. In a P2P system, each peer is fully autonomic, has equal responsibility and plays the role of both a service costumer and a service provider. Moreover, peers can enter or leave the P2P network at any time. Thus, a P2P system is a type of fully dynamic distributed system without any central administration. The P2P computing paradigm has many advantages, such as, scalability, robustness, resource availability etc, and is especially suitable for distributed applications with properties of geographical distribution, heterogeneous resource, scalability and local autonomy. Hence, the P2P computing paradigm is driving the evolution from the host-centric Internet to the data-centric future Internet, and is thought of as a promising technology to reconstruct the future Internet-based applications.Despite of current achievements on P2P-based query processing, there are still a lot of problems need to be studied. To this end, this thesis is devoted to the issue of similarity search in P2P systems. It studies routing index, space partitioning, collaborative caching and probabilistic model-based similarity search techniques, in order to support semantic or similarity-based query processing methods above existing P2P systems. Major contributions of this thesis include:1. Similarity search in multi-dimensional data space is introduced to unstructured P2P systems. A simple yet effective routing index structure, called EVA-Index, is designed by combining both vector approximation and routing indexing techniques together. With the aid of EVA-Index, each peer can process similarity query with its local dataset, and route queries to promising peers with the desired data objects. Furthermore, each peer can reconfigure its neighboring peers to keep the relevant peers close by so as to improve system performance. Simulation shows that the proposed approach is efficient and effective to similarity query processing in unstructured P2P environments.2. An efficient space partitioning strategy is introduced to a structured P2P system. The whole data space is first partitioned based on a set of pre-generated reference points. Each reference point has an associated subspace and any pair of subspaces does not overlap with each other. As such, through building one-to-one mapping between nodes and reference points (and hence subspace), similarity search in high-dimensional space can be implemented by using the traditional high-dimensional indexing technique and distributed hash table-based resource location mechanism. Moreover, the routing cost of similarity search can be greatly reduced through capturing physical proximity of sub-spaces, and the load balance at nodes can be obtained by tuning the granularity of subspaces. Simulation shows the efficiency of the proposed method can be kept in term of dimensionality and network size increasing.3. The technique for collaborative cache, based on negotiation, is studied for supporting SQL query processing, and a cost model for evaluation of the network transmission cost of query plans is given. The cost of the query plans are estimated by using the cost of sub-query plans. The cost model is combined with collaborative overlap network, through negotiation between requesters and coordinators, to determine the logical expression and physical maintenance nodes of data cache. Thus, based on collaborative caching technique, the P2P system can implement semantic-based query processing. Simulation and real experiments show that the proposed method usually obtains near-optimized cache plans and reduces the cost of query processing, especially in the case that a single node can contribute limited storage space.4. A similarity search technique based on probabilistic information is investigated for the P2P file sharing application with hierarchical topics. This approach uses probabilistic model to describe the overlap between topics and the coverage of nodes to topics, and hence builds routing indices at nodes. Based on existing probabilistic information at nodes, the similarity search algorithm can deduce the coverage of neighboring nodes to the queried topic, so that a promising routing path can be determined. Further, using feedback of the previous queries, nodes that were responsible for routing topic queries can update their probabilistic information to guide future ones more efficiently. Simulation shows the proposed approach can gradually improve the search efficiency and effectiveness in a feedback-based way.In summary, this thesis gives a detailed description of the algorithm, implemen-tation and experimental evaluation of the above four similarity query processing approaches. These approaches are a kind of useful complement to and improvement on current query processing methods in the P2P environment. The work is based on complete survey and theoretical analysis along with extensive experimental evaluation. The experiments and analysis show that, compared with existing query processing methods for P2P systems, the approaches mentioned above have advantages in efficiency of both query processing and resource utilization.

【关键词】 对等计算相似查询索引缓存概率
【Key words】 peer-to-peer computingsimilarity searchindexcacheprobability
  • 【网络出版投稿人】 复旦大学
  • 【网络出版年期】2007年 02期
  • 【分类号】TP393.01;TP301
  • 【被引频次】1
  • 【下载频次】251
  • 攻读期成果
节点文献中: