

Research on Survivability Enhancing Techniques of Grid Applications

【作者】 王树鹏

【导师】 云晓春;

【作者基本信息】 哈尔滨工业大学 , 计算机系统结构, 2007, 博士

【摘要】 网格技术的出现和发展为人们提供了大量的计算资源来执行大规模的应用,这给人们带来了巨大的机遇。然而,在动态、复杂的网格系统中,恶意攻击或者硬件失效都会导致网格资源的失效,并且失效率远远高于传统的分布式环境中的失效率,这个问题给应用在网格环境中的执行带来了巨大的挑战,分配到远程站点的任务会因为网格资源的失效而无法正常执行,特别是对于大规模的任务来说,它们需要占用大量的资源,并且执行较长的时间,网格资源的失效可能会导致它们根本无法执行结束。因此,本文就如何保证网格应用在复杂动态的网格环境中持续不间断的执行问题进行了研究,将系统生存性的思想应用到网格环境中,提出了网格应用生存性的概念,研究了网格应用的生存性分析方法和可生存性生命周期模型,分析了支撑生存性模型的关键技术,这对于促进网格技术的发展和实用化具有重要的现实意义。本文以网格应用的可生存性作为研究目标,着重对以下几个方面进行了深入的研究:首先介绍了网格计算环境和网格应用的背景,分析了网格应用在网格环境中执行所面临的挑战;然后分析了当前网格安全技术和系统生存性技术的研究现状;最后进一步明确了研究网格应用生存性的意义和必要性。在此基础上,给出了本文研究所用的系统模型,包括网格模型、失效模型和应用模型,给出了网格应用生存性的定义,对网格应用的生存性分析方法进行了研究,给出了基于离线预防和在线重构的网格应用的可生存性生命周期模型,分析了在该模型中影响网格应用生存性的关键问题,为下文的研究工作奠定了基础。为了实现生存性模型中的离线预防机制,提出了网格应用生存性的调度目标,实现了同时考虑网格应用生存性和Makespan目标的局部代价函数,然后分别针对网格独立任务应用和网格工作流应用设计了同时考虑应用生存性和Makespan目标的调度算法,该算法能够避免网格应用被调度到失效率高的计算资源上去,在一定程度上提高了网格应用的生存性。为了增强在线响应机制中状态检测的能力,降低检测误报率和漏报率,缩短检测时间,我们研究了网格环境中的失效检测机制。当前的失效检测算法虽然通过自适应的预测机制适应了心跳包传输延迟的变化,避免了传输延迟的变化带来的检测误报,但是这些算法都没有考虑心跳包丢失的情况,心跳包的丢失会导致这些算法出现很高的检测误报率,为此我们提出了一种基于PUSH和PULL的失效检测算法,该算法基于不可靠的半同步分布式系统模型,解决了心跳包丢失带来的检测误报率过高的问题,并缩短了检测时间。最后我们基于复制机制实现了网格应用的失效响应功能。针对目前复制机制应用透明性和通用性差的问题,本文对透明通用的复制机制进行了研究。提出了网络数据流层次的消息代理机制和灵活的配置机制,给出了一个异步主动复制协议和失效响应协议,实现了一个透明通用的复制代理,该复制代理能够实现复制组中各副本的状态同步以及主副本失效后的响应恢复功能。在上述研究的基础上,基于离线预防和在线重构机制相结合的网格应用生存性模型,设计实现了一个网格应用调度和管理系统,在该系统中,有效利用了上文提出的网格应用生存性支撑技术。最后,通过一个实际网格应用的运行实例,证明了本文提出的网格应用生存性增强技术的有效性。

【Abstract】 The emergence and development of Grid provides large numbers of computing resources for large-scale applications. However, the dynamic and complex characteristic of the Grid system cause the higher failure rate of Grid resources, compared to that in traditional distributed systems. This brings great challenges for the execution of Grid applications in Grid environment. The tasks allocated to the grid resources may be halted by the failure of grid resources. Especially for the large-scale applications, which require large numbers of resources and will take lots of time, the failure of grid resources may cause that they can not execute normally. Therefore, this paper focuses on the problem how to make the applications execute normally in the complex grid system. And the survivability theory is applied into the grid system, and the concept of the grid application survivability is proposed. The research on the survivable grid applications in this paper has great significances on the development and application of grid technologies. The main research topic of this paper includes the following aspects:The first part of this dissertation introduces the research background of Grid system and Grid applications, analyzes the challenge faced by the execution of grid applications and make clear the significance of the research on Grid application survivability. Then it reviews the research state of Grid security and system survivability. The current research of Grid security adopts the traditional security theory, and the current research of system survivability focuses on the traditional distributed information system. There is no systematic research on the survivability of Grid applications.On the basis of that, this dissertation introduces the system model, including Grid model, failure model and application model, and gives the definition of the survivability of Grid applications. Then the survivability analysis method on the grid applications and the survivability life-cycle model of grid applications are proposed. And the key technologies supporting the survivable grid applications are introduced.To implement the capability of the grid applications to guard against the failure of grid resources, the dissertation proposes the scheduling objective of survivability and the cost function considering the objectives of survivability and makespan at the same time. Then the scheduling algorithms considering the the objectives of survivability and makespan are proposed for grid independent task applications and Grid workflow applications respectively. These scheduling algorithms can prevent grid tasks from being scheduled to the grid resources with higher failure rate..To improve the capability of detecting failures, and decrease the error rate of failure detection and the detection time, the failure detection machinism in grid environment is considered. The current failure detection algorithms can adapt to the variation of transmission delay by adaptive mechanism, and decrese the error rate of failure detection caused by the variation of transimissin delay. However, this algorithm does not consider the loss of detecting packets which cause high error rate. To solve this problem, the PUSH-and-PULL based failure detection algorithm is proposed. This algorithm bases on the semi-synchronous distributing system model and can decrease the high error rate of failure detection efficiently.Finally, the failure response capability is implemented by a transparent replication mechaninsm. The message agent mechianism on the level of network message flow and flexible configuration mechanism are proposed, a asynchronous active replication protocol and failure response protocol are proposed, then a transparent and all-purpose replication agent is implemented. This agent can synchronize the state of replicas in the replica group and implement the failure recovery capability after the failure of the primary replica.Base on the above studies, a Grid application scheduling and managing system is designed and implemented using the off-line defense and on-line reconfiguration techniques. In this system, the survivability enhancing techniques proposed in the previous chapters are utilized efficiently. Finally, the efficiency of these Grid survivability enhancing techniques is approved by the execution of an real application.


