节点文献

面向分布式关键任务系统的自律恢复机制研究

Research on Autonomic Recovery Mechanisms for Distributed Mission-Critical System

【作者】 叶海智

【导师】 王慧强;

【作者基本信息】 哈尔滨工程大学 , 计算机应用技术, 2010, 博士

【摘要】 随着技术的不断发展和应用需求的变化,人们对分布式关键任务系统的可用性要求越来越高,不仅希望系统能够保障关键业务数据信息的完整性,而且具有不间断运行或者即使失效发生也能在最短时间内自动恢复的能力。然而,由于系统功能种类和结构复杂性的不断增加,以及恶意攻击和软件缺陷等因素的存在,失效事件频繁发生,失效场景呈现出多样性和不可预测性的特点,使得对失效根源的追踪、分析和恢复变得异常困难,迫切需要系统具有自我检测、并能针对不同的失效场景智能化地进行恢复决策和实现自我恢复的能力。在这种背景下开展面向分布式关键任务系统自律恢复机制的研究,旨在将最近提出的自律计算技术与检测技术、恢复技术和决策方法相结合,通过合理的设计使系统在较少人为干预的情况下,具有自我恢复的能力,确保系统应用服务的可用性和连续性。目前,自律计算仍处于起步阶段,应用其解决分布式关键任务系统失效恢复问题的相关研究工作还比较缺乏,如何构建系统的自律恢复框架、如何对系统进行失效检测、恢复决策和恢复实现等诸多问题尚待研究和解决。基于上述情况,本文以提高系统自我恢复能力为目标,以应用服务的恢复为重点,以应用构件和运行环境的失效检测、决策及恢复方法为主线,对系统的自律恢复机制进行深入研究。首先,针对本文的研究目标并结合系统的特征需求,从自律计算的基本思想出发,构建一种面向系统自律恢复的框架模型DARA(DMCS Autonomic Recovery Architecture)。该框架模型依次分为知识层、管理层和目标层,在整体结构上形成一个“失效检测—恢复决策—恢复执行”的自律恢复控制环,在由系统实体模型、状态模型和恢复策略组成的恢复管理知识支持下,可有效地降低对系统恢复管理的复杂性,同时通过引入π演算完成对该模型的形式化描述和验证,证明了模型的合理性。其次,从检测方法和消息传递机制两个方面,开展面向系统自律恢复的失效检测问题研究。在检测方法上,为满足运行环境失效检测的准确性和对失效根源的定位需求,提出一种基于混合模式的检测方法A-Hybrid。该方法利用服务器模型、主机模型等信息来检测和定位失效对象;在消息传递方面,根据检测器与被检测对象间消息交互的松耦合需求,给出一种基于发布/订阅的检测消息传递机制。实验结果表明:A-Hybrid方法不仅能够以较高的准确度检测到失效对象,而且能够对失效根源进行定位,为下一步运行环境的恢复决策提供了可靠依据。再次,从应用构件和运行环境两个方面,进行面向系统自律恢复的决策方法研究。对于应用构件,针对其强关联性失效所带来的决策低效问题,给出一种基于重启树优化的恢复决策方法。该方法首先计算出构件间的失效关联度FRD,将关联度高的构件合并为一个重启群实现对重启树的优化,然后根据该重启树和检测结果给出可疑失效构件的恢复计划。应用结果表明,这种方法具有较高的决策效率,有利于应用构件的快速恢复。对于运行环境,根据其失效场景的多样性特点,提出一种基于智能规划的决策方法。运用环境中实体间的依赖关系进行领域描述,并根据检测结果和目标策略确定初始状态和目标状态,然后通过规划器生成恢复计划。实验结果表明,该方法能够对不同的失效场景智能地给出相应的恢复计划,为环境的恢复奠定了基础。最后,从应用构件和运行环境两个方面,开展系统失效恢复方法的研究。对于应用构件,以其短暂性失效恢复为重点,以系统应用服务的高可用性需求为目标,提出一种多粒度微重启的恢复方法。该方法通过将重启对象划分并包装为不同粒度的可重启元素,从而能够进行更为有效地重启恢复。实验结果表明,该方法同一般微重启方法相比,重启恢复时间可减少48%,使系统应用服务的可用性得到显著提高。对于运行环境,给出一种基于脚本的恢复方法,重点研究恢复计划与脚本的对应关系,并对运行环境在不同失效程度下恢复计划及其脚本的生成时间进行了实验研究,以便为具体关键任务系统的不同需求提供灵活的环境恢复方案。

【Abstract】 With the development of IT technology and the changing requirements, the availability demands of Distributed Mission-Critical System was getting higher and higher, the information integrality of critical mission data should be not only guaranteed,but also the uninterrupted running or automatic recovery in a short time when failures happen. However, with the increasing system scale and complexity as well as inevitable faults such as malicious attack and bug, system failures happened frequently, and the tracking, analysis and recovery of failures have become extremely difficult. Therefore, abilities of self-monitoring, self-diagnosis,intelligent decision-making and self-recovery of the system according to different failure scenarios and self-recovery are in urgent need. Autonomic computing technology provides a new research idea for the settlement of this issue. By combining autonomic computing with detection technologies, recovery technologies and decision-making methods,disteibuted mission critical system can recovery from system failures automatically and its high availability is guaranteed.However, the autonomic computing was still in its infancy and its application in distributed mission critical system failure recovery is also in a lack. Many basic issues such as how to build the autonomic recovery architecture of system as well as the implementation of autonomic failure detection, decision-making and recovery still need to be studied carefully. So,the system autonomic recovery mechanism was studied deeply in order to improve the self-recovery abilities of the system.Firstly, in order to fulfill the system specific requirement, its architecture DARA (DMCS Autonomic Recovery Architecture) was proposed based on the autonomic computing concept. In this architecture, autonomic recovery was divided into knowledge level, management level and target level.The functionality of each level was analyzed. From the architecture perspective, a failure "detection--decision-making--recovery" control loop was formed which can lower the complexity of autonomic recovery. Then, the system recovery management knowledge database including system entity dependency model, state model and management strategy was built which can provide support for failure detection and recovery. The architecture formalization and validation based on% calculus was carried out to prove the rationality of architecture.Secondly, failure detection of distributed mission critical system was studied from detection method and massage transfer mechanism. On the detection aspect, to meet the high accuracy of runtime environment failure detection, A-Hybrid detection method was proposed. This method can detect and locate failure objects through applying application configuration model, server model and host model. On the massage transfer aspect, according to the loose-coupling massage requirements,a mechanism for detection message transfer based on subscription/publishing was proposed. Experimental results showed that compared with other detection methods, A-Hybrid method can accurately detect failures and identifies the specific failure objects.Thirdly, from the aspect of application components and runtime environment, a decision-making method about autonomic recovery of system was studied. For the application components,to solve the low decision-making efficiency problem with failure strong correlation, a recovery decision-making method was proposed based on reboot tree optimization. To achieve the optimization of reboot tree, the components with high failure correlation were unite as a whole reboot group by computing the failure relevancy degree,and then a recovery plan was made based on the reboot tree and detection results.Examples showed that compared with the method without the reboot tree optimization, our method can achieve high efficiency and less recovery time. At the same time, considering the diversity of runtime environment failure scenarios,a decision-making method based on AI planning was put forward. A domain description was carried out between the dependencies of objects in runtime environment, and the initial state and goal state were determined by detection results and target policy, and then the recovery plan was made by planner. Experimental results showed that the decision-making method based on AI planning can generate relevant recovery plan effectively.Finally, autonomic recovery implementation for mission critical system was studied from two aspects:application components and runtime environment. For the application components,by clustering the reboot objects as different microreboot elements,a multi-granularity microreboot method was proposed for transient failure recovery in order to achieve high availability. Experimental results showed this method need less 48% reboot time than traditional reboot method which helps to achieve high availability. For runtime environment, a recovery method based on scripts was put forward, which focused on the relationship between recovery plan and scripts,moreover, the generating time of scripts under different failure degree in runtime environment was studied, which can provide flexible recovery plan according to different application environment requirement.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络