节点文献
若干典型网格应用的容错及性能研究
Achieving Fault-Tolerance and High-Performance in Grid Applications
【作者】 陈益峰;
【导师】 何炎祥;
【作者基本信息】 武汉大学 , 计算机软件与理论, 2004, 博士
【摘要】 网络技术的飞速发展和可用带宽的高速增长极大改变了我们计算、通讯和协作的方式。计算网格的出现有望为普通用户提供一个近乎与用电一样简单易行的高性能计算机资源访问方式。着眼于大规模资源共享和动态问题求解,网格从概念上使得创建比当今分布并行应用更为复杂的应用程序成为可能,甚至包括先前未能求解的计算任务。然而,基于如下理由,很难给出一个满足各种网格应用容错和性能要求的通用解决方案:网格的规模、异构性和动态性降低了网格进程通信、协作和同步的效率,使得高效可靠的可扩展网格应用设计变得更加复杂;网格应用基本上依赖标准的网格协议在网格资源上执行计算任务,但网格协议采用的“沙漏”设计原则只能保证网格协议的通用性,而不对容错计算提供有效支持;不同的网格应用具有不同的性能目标,各应用的性能目标取决于其自身特征,难以使用通用的解决方案。因此,本文重点解决如下三类典型网格应用的容错或性能问题。这些应用在科研和商业领域均对网格技术发展具有重大推动作用,而且在广义上在如何高效执行协作与同步、有效利用网络资源和显著降低网络延迟等方面具有相同的性能目标。耗时计算:随着系统规模的增大,容错成为一个关键问题。没有容错机制的支持,网格应用程序几乎不可能成功执行,尤其对于那些需要数天甚至数月计算时间的耗时应用程序。在松耦合的计算环境中,容错技术不宜频繁使用系统范围的同步和协作,以保证应用程序具有较小的运行开销和良好的可扩展性。为简化应用程序的创建,容错功能必须与应用程序的逻辑功能分离,并抽象为通用服务对应用程序员开放。此外,对于耗时计算,可扩展的检查点设置及回滚恢复技术可能比带投票及一致意见的任务复制技术节约计算资源,因而更为可取。网格信息检索:在网格环境下,存在大量地理上分布的可用数据资源。这些数据资源通常分布在多个管理域,并具有不同的组织形式。考虑到数据资源的广泛性、异构性和松耦合性,从网格高效检索有用知识的难度远大于传统分布式系统中的信息检索。因此,网格信息检索系统要求具有更强的减少带宽消耗、降低网络时延、消除网络不稳定性影响及并行利用网格计算资源的能力。内容传递服务:作为当前存储、维护并提供文档访问的高效解决方案,内容传递服务日益得到信息提供者的青睐。信息提供者常常使用内容传递服务在因特网上复制或镜像内容,建立内容传递网络,从而解决流行网络站点的可扩展性、可用性及性能问题。内容传递服务对网格技术发展提供有力的商业支持,并将逐渐采用类似网格的基础设施来平衡服务器的负载、消除数据在网络上的重复传输、减小请求的响应时间、增加系统吞吐量、并保证预期的服务质量。然而内容传递网络的性能极大地受其代理放置策略的影响。尤其在网格环境中,大量可用的计算资源显著扩大了资源分配的决策空间,使得内容分布服务的代理放置问题变得更加重要。总之,本文的研究目标在于提出解决耗时网格应用容错、提高网格信息检索效率和改善内容传递服务性能的相关技术,其主要贡献如下:提出了一个新的、基于移动Agent的自适应、可扩展容错技术MACR。该技术分离容错功能与应用程序逻辑功能,因而便于与网格应用集成。MACR灵活运用移动Agent技术的移动特性,使用移动Agent协调本地进程之间的通信和协作。MACR综合了异步检查点设置方法和同步检查点设置方法的优点,降低了程序正常运行期间的额外开销,消除了回滚恢复期间可能发生的多米诺效应。MACR通过动态调整算法参数,具有自动适应系统规模和网络状态变化的能力。给出了基于移动Agent的混合检查点设置及回滚恢复算法,严格证明了算法的正确性,系统分析了算法的性能和可扩展性,提出了解决算法性能和可扩展性问题的技术,并利用仿真实验研究了算法的有效性和可扩展性。提出了一个用于提高网格信息检索性能的移动Agent系统的分层结构。基于该结构,提出了一个新的、用以规划移动Agent最优迁移的遗传算法。该算法保证检索任务消耗最少的网络资源,同时保证任务在用户指定的完成时间内结束。该算法采用树结点的前序遍历序列以及各个结点对应的子女数组成的序列共同对Agent迁移树进行编码。理论分析和仿真实验证明,该编码保证编码空间和题解空间之间存在一一对应关系,并具有编码及解码简单、遗传操作相对容易进行、较好解决问题空间的探索和局部信息利用之间的平衡、以及快速收敛等优点。提出了一个新的、旨在提高内容传递服务性能的代理放置策略LDASP。LDASP以最小化系统通信开销并同时保证最大化系统吞吐量为目标,求解最优的代理放置方式。与通信网络中的资源分配问题现有求解策略不同,LDASP通过模拟内容传递网络的请求路由机制考虑了代理服务器的负载分布及处理能力约束,从而保证系统具有最低的资源消耗、最大的吞吐能力和良好的负载均衡。提出了高效的贪婪算法和近似动态规划算法,用以求解树型网络条件下简化的LDASP问题。理论分析和仿真实验证明了二者的正确性、有效性和计算复杂性。
【Abstract】 The rapid expansion of the Internet and the dramatic increase in available network bandwidth has qualitatively changed how the world computes, communicates, and collaborates. The emergence of computational grids provides a promising infrastructure that can potentially make high-performance computing power accessible to general users as easily and seamlessly as electricity from an electrical power grid. Due to the focus on large-scale resource sharing and innovative problem solving, grids enable applications that are conceptually more complex than current parallel and distributed applica-tions, including the applications that could not be solved before. However, it is not a trivial task to develop a generic solution that can offer fault tolerance and high performance for the wide spectrum of grid applications due to the following reasons:The scale, the heterogeneity, and the dynamic nature of a grid make it difficult to efficiently support grid-scale interprocess communication, coordination, and synchronization, thus signifi-cantly complicating the design of efficient, reliable, and scalable grid applications;Grid applications basically rely on the standard grid protocols for executing their computations on grid resources, but the hourglass design principle makes them too generic to support any fault tolerance mechanisms;Different grid applications have distinct performance goals inherently determined by the appli-cation’s characteristics so that they are hard to be addressed via a generic approach.Therefore, this dissertation aims to address the fault-tolerance and/or high-performance issues for the following three typical grid applications that motivate the development of grid technology in both scientific and commercial communities. In a broad sense, these issues share the same goals in effi-ciently performing coordination and synchronization, effectively utilizing network resources, and re-markably overcoming network latency.Computation-intensive applications. With increases in system scale, fault-tolerance has be-come such a critical issue that without its support, grid applications have little chance to finish successfully, especially for those that require days or even months to execute. Frequent sys-tem-wide synchronization and coordination should be avoided since it will result in high over-head and bad scalability in a loosely coupled computing environment. To simplify the con-struction of the applications, fault-tolerance function needs to be separated from application logic functionality and be abstracted as a generic service opened to application programmers. Also, for these computation-intensive applications, a scalable checkpointing and rollback re-covery scheme may be preferable to task replication with voting and consensus for its possi- bility in saving computational resources.Grid information retrieval. In emerging grids, vast volumes of geographically distributed data resources will be pooled together and are available for being integrated into applications. These data resources usually cross over different administrative domains and are structured in distinct ways. Given that data resources are loosely coupled and heterogeneous, the difficulty in effectively retrieving useful knowledge from a grid is more highlighted than doing that from a traditional distributed information system. Therefore, grid information retrieval sys-tems are expected to be enabled to reduce bandwidth utilization, overcome latency, tolerate network disconnection, and utilize the parallelized computational resources in grids.Content delivery services. Information providers who seek for a cost-effective solution today are increasingly choosing Web content delivery services to store, maintain, and provide access to their documents. To address the scalability, availability, and performance problem of their popular Internet sites, information providers commonly use delivery services to replicate or mirror their content over Internet, as exemplified by the prevalence of Content Delivery Net-works (CDNs). Surely, this kind of services is currently a driving force that propels the development of grid technology, and will eventually use a grid-like structure to alleviate server workload, eliminate redundant data traversal over the network, reduce response latency, increase system throughput, and most importantly, deliver desired QoS. However, the CDN performance is dramatically affected by its surrogate placement policy. In a grid environment, the optimal surrogate placement problem for content delivery services becomes a more serious issue as the vast available computing resources drastically enlarge the decision space.In a word, the goals of this dissertation are to develop techniques to achieve fault tolerance for computation-intensive grid applications, efficiency for grid information retrieval, and high-perform-ance for content delivery services in grids. The major contributions are summarized as follows:A novel, mobile agent enabled fault tolerance technique, MACR, that can be easily integrated into grid applications through the separation of concerns, is proposed. The importance of MACR is justified by our finding that no bridging mechanism exists that can integrate or reuse multiple underlying fault tolerance protocols deployed in different administrative domains without modifying them. MACR takes advantage of both the independent and the coordinated checkpointing approaches by exploring the appealing features of mobile agent technology and using mobile agents to carry out coordination tasks for cooperating local processes; as a result, it can not only be made adaptable to the system scale and the network state changes, but also can guarantee low runtime overhead, while at the same time eliminating the possibility of the undesirable domino effect. The feasbility of MACR is demostrated by designing a hybrid checkpointing and rollback algorithm using mobile agents as an aid. The correctness of the al-gorithm is formally proved and the performance and scalability issues are systematically dis-cussed. Techniques are proposed to address the scalability issue and the potential performance bottleneck of the proposed algorithm, and simulations are performed to show the effectiveness and scalability of the proposed protocol and its improving techniques.A flexible layered architecture of mobile agent system that aims to achieve high-performance for grid information retrieval is presented. Based on this architecture, a new genetic algorithm is proposed to schedule the optimal migration of mobile agents with the objective to minimize the total communication cost subject to a user-defined limit of task completion time. The pro-posed genetic algorithm encodes the agent migration trees by the pre-ordered traversal se-quence of tree vertices, together with the children number sequence of corresponding tree ver-tices. Through theoretical analysis and experimental simulations, this encoding is proved to be able to guarantee a one-to-one correspondence between the encoding space and the problem space, and has the advantages of simplicity for encoding and decoding, ease for GA operations, better equilibrium between exploration and exploitation, and fast convergence to the optimal or near-optimal solution.A new surrogate placement strategy, LDASP, is proposed to enhance performance for content delivery services. LDASP aims to address surrogate placement in a manner that minimizes the communication cost while ensuring at the same time the maximization of system throughput. This work differs from existing work on the resource allocation problem in communication networks in that it considers load distribution and processing capacity constraints on surro-gates by modeling the underlying request-routing mechanism, thus guaranteeing a CDN to have minimum network resource consumption, maximum system throughput, and better load balancing among surrogates. An efficient and effective greedy algorithm and an approximate dynamic programming algorithm are developed for a simplified version of the LDASP prob-lem in tree networks. The correctness, efficiency, and effectiveness of the proposed algorithms are systematically analyzed through formal proofs and experimental simulations.
【Key words】 grid computing; fault tolerance; high performance; mobile agent; checkpointing and roll-back recovery; information retrieval; content delivery service;