节点文献

面向分布式关键任务系统的自愈调控技术研究

Research on Self-Healing Regulation Technology for Distributed Mission-Critical Systems

【作者】 卢旭

【导师】 王慧强;

【作者基本信息】 哈尔滨工程大学 , 计算机应用技术, 2011, 博士

【摘要】 分布式关键任务系统的异构性、复杂性和使用环境动态变化不可避免地导致了系统失效、任务偏离甚至中断运行、崩溃死机等现象发生,造成重大经济损失甚至是人员伤亡等严重后果,这也使得人工完成其管理和恢复、不间断地保持任务运行变得愈加困难。在此种背景下,以自我管理能力为核心研究目标的自律计算逐渐得到了广泛重视,并在多个领域有着深入研究与应用。自愈调控技术是自律计算基础性关键技术之一,面向分布式关键任务系统的自愈调控技术实现了关键任务系统的失效监控与预测、自愈调控策略生成以及关键任务调度等系统设计基础功能,对关键任务运行可靠性和可持续性都有着重要的保障作用。本文针对关键任务系统使命连续性需求,对分布式关键任务系统自愈调控关键技术以及应用展开研究。从自愈调控总体设计原则讨论入手,首先指出自愈调控总体设计中所需要考虑的基本原则,针对自愈调控设计流程给出综合评价指标体系;以此为基础提出自愈调控整体架构并详细阐述了架构设计理念和关键实现技术;围绕关键任务执行的形式化建模问题,采用状态π演算描述关键任务执行与切换语义,并对关键任务执行逻辑进行验证,为后续自愈调控关键技术研究提供了理论上的可行性和合理性保障。自愈调控策略动态生成是分布式关键任务系统自愈调控研究的核心内容。提出了基于策略的自愈调控模式,阐述了自愈调控策略的基本表述形式并给出了自愈调控策略动态管理中策略分类以及化简步骤;针对失效检测机制准确率不高且故障定位难的特点,提出基于部分可观察随机过程(Partially Observable Markov Decision Processes, POMDP)的自愈调控策略更新算法,采用近似迭代方法求解POMDP策略并给出了迭代收敛性的理论分析。仿真实验利用LANL(Los Alamos National Lab)失效数据中恢复策略效果进行统计,然后计算策略求解的迭代与收敛速度并比较了多种类型自愈策略的恢复效果。实验结果表明与固定策略相比,POMDP策略在不准确失效检测下迭代速度更快且恢复时间更短。自愈调控数据分析与预测是实现分布式关键任务系统失效自愈的必要条件。针对非线性相关失效数据所具有的高维、稀疏等特征,首先提出了非线性相关失效事件联合聚类算法,以互信息熵损失差作为度量标准并从理论上分析算法有限次迭代收敛性;然后针对数值型失效数据采用有监督局部线性嵌入算法进行数据降维,通过失效模式识别实现失效提前预判。实验首先比较了不同算法在失效数据集上的聚类效果和收敛速度,然后采集了故障态与正常态下系统状态指标数据并进行预测性能分析。实验结果表明,所提出的非线性相关失效数据分析方法能够有效聚类出失效数据对象,基于局部线性嵌入的失效预测结果可为主动恢复操作提供决策依据。关键任务自愈调度机制是分布式关键任务系统自愈调控设计与实现的重要保障。针对失效发生随机性以及关键任务运行连续性等特点,采用先调度,后优化的指导思想,提出了基于DAG任务重构迁移的关键任务调度方案。首先重新生成关联任务有向无环图(directed acyclic graph, DAG),提出DAG动态重构算法将关联任务转化为层次化DAG任务,然后计算关键任务迁移路径并给出可迁移任务死锁避免理论分析,将迁移任务提前调度到当前空闲资源运行,达到缩短任务执行时间的目的。仿真实验测试了三种故障注入类型下任务迁移方案与等待恢复方案的加速比执行性能,实验结果表明任务迁移方案在弹性负载与未知故障情况下具有较好的调度质量,为关键任务系统不间断运行提供合理可行的技术方案。

【Abstract】 The complexity, heterogeneity and dynamics application environment of Distributed Mission-Critical Systems (DMCS) inevitably lead to system failure, mission suspending, running interrupt even system crash and other phenomena, causing huge economic lives losses and other serious consequences. Meanwhile, DMCS failure also makes manual management and manipulation more difficult. Threrfore, autonomic computing technology with the core goal of self-management has been studied in various fields. Self-Healing Regulation Technology (SHRT) is the one of the critical technologies of autonomic computing. DMCS-oriented self-healing regulation technology can achieve fundamental functions such as failure prediction, self-healing policy generation and critical task scheduling, which have a significant influence on dependability and sustainability of critical mission running. In this paper, aiming at the dependability and sustainability requirements of critical mission running, self-healing regulation technology and its application have been studied systematically.In this dissertation, self-healing regulation technology research started from the overall design principle discussion. Firstly, the critical problems of SHRT overall design has been analyzed, then the comprehensive evaluation metrics system has been proposed. Secondly, the SHRT architecture has been proposed and the critical implementation has been analyzed. Aiming at formal validation of critical task execution flow, n-calculus has been applied to describe the semantic task execution and switch. Moreover, the critical task execution logic has been validated, which can provide theoretical feasibility and rationality assurance.Self-healing regulation policy dynamic generation method is the critical research topic for the DMCS-oriented self-healing regulation technology. Based on the self-healing regulation architecture, the policy-based self-healing regulation pattern has been proposed. The basic expression format and logic syntax have been discussed. In addition, the policy simplifying and classifying approach has been proposed for the dynamic policy management. In order to solve the problem of inaccurate failure detection and diagnosis, a Partially Observable Markov Decision Processes (POMDP) based self-healing policy re-generation algorithm has been proposed and the policy convergence has been analyzed theoretically. In the experiment we used Los Alamos National Laboratory (LANL) failure data to count the real effect of recovery policy, which showed the necessity of self-healing regulation technology, and then in the simulation experiment we calculated the policy solving iteration and convergence speed and compared different type self-healing policy performance. Our research result can point out the direction of self-healing policy generation and optimizing.Self-healing regulation data analysis and prediction is a necessary condition for DMCS self-healing. Aiming at the high dimension and sparsity feature of nonlinear correlated failure failure of high-performance computer system, an information-theoretic based co-clustering algorithm for nonlinearly correlated failure data was proposed. The co-clustering algorithm was measured using mutual information entropy. And the convergence and local optimality of co-clustering algorithm were proved theoretically. Second, the manifold learning algorithm named supervised locally linear embedding (SLLE) is applied to achieve feature extraction. In the experiment we first compared the clustering effect of different methods on LANL data, and then we collected system performance metrics under fault injection and normal state. We compared the failure prediction performance and the experimental results on labeled failure data showed that the co-clustering analysis algorithm outperformed other clustering analysis algorithms and has the features of rationality and effectiveness for discovering the nonlinearly correlated failure patterns. The failure analysis and SLLE based prediction results demonstrated that our method can help to predict underlying failures.Critical task scheduling for self-healing regulation in DMCS is a significant assurance for SHRT design and implementation. Taking failure randomicity and critical task running continuity into consideration and to achieve the rational scheduling of failed task, a critical task scheduling method based on Directed Acyclic Graph (DAG) task reconstruction and migration is proposed with the principle of scheduling first, optimization after. Firstly, the DAG of correlated task was regenerated according to the proposed DAG dynamic reconstruction algorithm to transform the correlated task to layered DAG task. And then the critical task migration route was computed and migratble task deadlock avoidance analysis is provided. By critical task migration to current idle resources, task execution time can be reduced markedly. Simulation experiment tested the task speedup performance of task migration method and waiting-recovery method with three kind of faults injected. The experiment results showed that task migration method can achieve the better scheduling quality under the flexible load and unknown fault injection.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络