节点文献

分布式系统自愈调控关键技术研究

Research on Key Technologies for Self-healing Scheduling in Distributed Systems

【作者】 卢旭

【导师】 王慧强;

【作者基本信息】 哈尔滨工程大学 , 计算机系统结构, 2009, 硕士

【摘要】 自愈调控是构建可信计算机系统的必要手段,也是系统高可用性的重要保证。传统的分布式系统失效恢复技术主要依赖高成本冗余和人为管理,由于系统失效后人为修复的难度和成本加大,如何实现无人干预下自主修复系统失效,维持系统高可用性成为当前研究的一个热点。针对分布式系统自愈调控领域中的主要问题,本文开展了一系列研究,主要体现在如下几个方面:提出一种分布式系统失效恢复的部分可观察随机决策模型(PartiallyObservable Markov Decision Processes model),并采用FIB(Fast InformedBound)值迭代方法来求解POMDP模型,解决了不准确的失效检测下分布式系统恢复策略生成问题。为衡量恢复策略的近似最优性,首先证明FIB迭代收敛并具有压缩映射的性质,然后给出POMDP模型最优解的界,进而利用最优解的最大界差(Maximum Bound Difference)估计近似最优解的误差。以某网络安全态势感知系统为例进行仿真实验,与其他恢复策略相比POMDP策略对于不准确检测下的失效恢复具有较好的效果。提出了一个基于微重启的分布式系统任务失效快速恢复模型,该模型与现有恢复模型相比,不仅考虑到了恢复时间,还考虑到了恢复的可靠性代价问题,因此更加接近于实际和精确。在此基础上提出了一个基于扩展型贝叶斯分析的实时恢复算法。算法以提高系统的可用性为目标,并在时间优先级相同的情况下考虑了可靠性代价。最后对扩展性贝叶斯分析的实时恢复算法进行了可恢复性证明,为系统失效自恢复提供了相应的理论依据。为了解决失效预测问题,本文提出了基于流形学习的失效预测方法,并将非线性降维的思想应用到了失效特征提取中,提出一种基于有监督Hessian局部线性嵌入降维的特征提取方法,从而实现了失效内在特征自动提取和无人干预下的失效预测,以小型局域网即时通讯系统为研究对象搭建失效预测实验环境,初步实验结果说明了基于流形学习的失效预测方法的可行性。

【Abstract】 Self-healing scheduling technique is critical for dependability of computer systems and also a guarantee of high availability. Traditional techniques for failure recovery highly depend on the redundancy and administrators’ domain knowledge. Due to the cost and difficulty of failure recovery, self-healing ability became an important research field in dependability computers research. Therefore, relative researches were developed in this paper to tackle the problem in this field. Our main contributions are summarized as follows:To overcome the challenges of recovery polices generation in the presence of inaccurate failure detection, a failure recovery model for microrebootable distributed systems based on discounted Partially Observable Markov Decision Processes is presented in this paper. Thus the reasonable recovery policies are generated by solving the POMDP model. To tackle the problem of computational complexity of exact solution, a value function approximate solution called fast informed bound solution is used for the near-optimal policies. In addition, the lower and upper approximations bound of the optimal value function are proposed, which are used for the error estimation of near-optimal value function with maximum bound difference. Simulation-based experimental results on a realistic network security situation prediction system demonstrate that the proposed model can be solved effectively, and the resulting policies convincingly outperform others.Secondly, a distributed systems tasks failure recovery model is presented based on microreboot. Compared with other models, our model not only takes recovery time into consideration, but also considers the reliability cost of recovery. Therefore, our model is more precise and accurate. The correspondingly algorithm of real-time task failure recovery is presented based on the extended bayes analysis, which takes reliability cost into account when recovery time priority is equal. To provide theoretic foundation for failure recovery, we prove the recoverability of our algorithm.Finally, we present an failure prediction method based on manifold learning. To extract failure features for prediction, we apply an nonlinear dimensionality deduction algorithm called supervised Hessian locally linear embedding algorithm. Then we adopt k nearest neighbors classifier for classification. The experimental results show that manifold learning approach can effectively find the failure inherent features and makes the failure prediction based on manifold learning possible.

  • 【分类号】TP316.4
  • 【被引频次】4
  • 【下载频次】251
节点文献中: 

本文链接的文献网络图示:

本文的引文网络