节点文献

面向硬件故障的软件容错

Software Implemented Hardware Fault Tolerance

【作者】 高珑

【导师】 杨学军;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2006, 博士

【副题名】模型,算法和实验

【摘要】 空间计算机是空间信息处理的基础平台,具有重大的战略意义。在空间环境中,硬件瞬时故障给空间计算机带来的可靠性问题非常突出。使用抗辐照器件可以提高空间计算机的可靠性,但是抗辐照器件性能非常低,价格非常高,功耗也很高,不适合用来建造用于科学计算目的的高性能的空间计算机。COTS器件性能很高,价格和功耗都很低,在COTS器件上面通过软件技术容忍硬件瞬时故障,可以提供高可靠、高性能、低成本和低功耗的空间计算机的解决方案。但是软件对于硬件瞬时故障传播的影响如何,软件容忍硬件瞬时故障的能力到底如何,这种能力对于系统有什么样的影响,都还没有模型能够描述。软件冗余在容忍硬件故障的同时,也带来了很大的开销,如何减小这种开销的影响,也是需要解决的问题。本文先建立了计算数据流模型,并在计算数据流模型的基础上建立了错误流模型。通过区分两种不同类型的错误,以及引入的6条错误传播规则和2条错误独立定律,我们计算出了错误流模型中任意数据在任意时刻产生错误的概率。在此基础上,我们根据容错概念的本质含义,概率化的定义了程序的容错能力。并分析了程序的容错能力对软件实现的双冗余容错系统的容错能力和性能的影响。以程序的容错能力为优化目标,我们提出了通过基于错误流分析的等价变换提高程序的容错能力的概念和方法。其中,我们还在错误流分析的基础上,提出了两种容错算法的优化方法,明显增加了性能并降低了功耗。本文的主要创新如下,1.通过引入原子数据和计算关系的概念,建立了计算数据流模型,描述了存储单元之间由于计算而形成的时空联系。通过引入原子数据的错误概率函数和计算关系的错误传播概率函数,在计算数据流模型上建立了错误流模型,概率化的描述了计算关系传播硬件错误的特性,计算出了任意存储单元在任意时刻发生错误的概率。最终建立了错误流分析的理论框架。2.基于错误流分析提出程序容错能力的概念,给出了程序容错能力的计算方法,提出容忍错误是程序内在属性的观点。并以程序的容错能力为优化目标,提出了一种不进行任何显式的冗余,而仅通过基于错误流分析的等价变换就能提高程序容错能力的方法。并且应用错误流分析,描述了构建双冗余容错系统的方法,分析了提高单个软件副本的容错能力会给双冗余容错系统带来的影响。3.提出对于程序容错能力具有关键影响的错误流关键子图的概念,基于错误流分析分别给出了由关键结点和关键路径生成错误流关键子图的方法。并且提出一种仅复制错误流关键子图的部分冗余容错算法,和EDDI算法相比,部分冗余容错算法在损失很小的错误覆盖率的情况下,能够提高IPC性能10%,减少执行时间15%,减小能量消耗10%。4.通过分析EDDI算法由于插入的分支指令而造成的性能和功耗损失,提出了一种通过附加计算减少分支指令数量的错误流压缩算法,和EDDI算法相比,错误流压缩算法在增加很小的错误延迟的情况下,能够提高性能12%,减少执行时间10%,减小能量消耗5%。

【Abstract】 Onboard computers are very important to information processing in space. In space environments, transient hardware faults bring great impacts on onboard computers. Radiation hardened components can improve system reliability, but their performance lag several generations behind COTS components. Radiation hardened components are very expensive due to their rare availability, and they often consume more power, take up more space and weight heavier. They are not suitable to build high performance space computers. Compared with radiation hardened components, COTS components have very high performance, lower price and lower power dissipations. Software implemented hardware fault tolerance on COTS components can provide space computers with high reliability, high performance, low cost and low power dissipations.But there still remain problems. The problems include how do hardware faults propagate within software, how is the fault tolerance capability of software measured, and what effects can it bring to system reliability. And there is great overhead if we use software to tolerante hardware faults, how to minimize this overhead is still a problem.In this paper, we first setup computational data flow model, based on what we setup error flow model. By categorizing errors into two kinds, introducing 6 rules of error propagation and 2 error independence rules, we can get error probility of any data at any time. According to the concept of fault tolerance, we defined the fault tolerance capability of a program. We analyzed the consequences the fault tolerance of a program can bring to the fault tolerance and performance of a system. Take fault tolerance capability as a target, we suggested that by equivalent transformation based on error flow analyses we can improve the fault tolerance capability of a program during compiling time. Finally, we give two optimized fault tolerance algorithms which can improve performance and reduce power dissipations at the same time.Our major contributions can be concluded into 5 aspects as below,1. We defined concepts of atomic data and computational relations to describe relations between registers or storage units, which are affected by computations in programs. We setup the model of computational data flow. We defined error probability function of atomic data and error propagation probability function of computational relations, with which we setup the error flow model on top of computational data flow model. Error flow model described how errors propagate through computational relations in a probability way. By analyses on error flow model, we can compute the error probability of any registers or any other storage unit at any time. Finally we setup a theory framework of error flow analyses.2. To measure the capability of a program’s fault tolerance, we defined a concept of fault tolerance capability based on error flow analyses, give a method of error flow anayses to calculate fault tolerance capability of any program. And we suggested a method to improve a program’s fault tolerance capability by error flow analyses and equivalent transformation, without any explicit redundancy. Finally we applied error flow analyses to describe the method to build a double redudancy fault tolerant system, and describe the effects on a double redudancy fault tolerant system if we improve a single program replica’s fault tolerant capability.3. We suggest the concept of key subgraph of error flow graph, which has critical effetcs on a program’s fault tolerance capability, and give the methods to generate key subgraph from key nodes or key paths. And we suggest a partial redundancy fault tolerance algorithm by only replicating key subgraph instead of whole error flow graph. Compared with EDDI, partial redundancy can improve IPC by 10%, reduce execution time by 15%, and reduce power dissipations by 10%, at a cost of very little loss of error comverage.4. Based on error flow analyses, we suggest error flow compressing algorithm to reduce branch instructions inserted in EDDI algorithm, which have great impacts on performance and power dissipations. Compared with EDDI, error flow compressing algorithm can improve IPC by 12%, reduce execution time by 10%, reduce power dissipations by 5%, at a cost of very little increasement of error latency.

节点文献中: