节点文献

面向寄存器软错误的容错编译技术研究

Research on Compile Techniques of Fault Tolerance for Soft Errors

【作者】 徐建军

【导师】 谭庆平;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2010, 博士

【摘要】 从计算机诞生之日起,可靠性问题就是计算机研究领域最需要关注的几个问题之一。虽然现代计算机的可靠性已经得到了很大程度的提高,但是随着计算机制造工艺的日趋复杂和应用领域的不断拓展,计算机可靠性仍然面临很多新的挑战。软错误是半导体电路中的一种瞬态故障现象,通常是由外部环境中的高能粒子辐照和电压扰动、电磁干扰等电磁噪声诱发。宇宙射线辐射所导致的单粒子翻转等软错误一直是影响航天计算机可靠性的重要原因。而随着集成电路制造工艺的持续进步,现代处理器的性能在大幅度提高的同时,对软错误也越来越敏感。继性能和功耗问题之后,软错误导致的计算可信性已成为一个日益严峻的课题。其中,由于寄存器访问频繁却未能被良好保护,发生于寄存器中的软错误成为影响系统可靠性的最关键因素之一。与硬件容错相比,针对软错误的软件容错技术由于在实现成本和灵活性等方面的优势而备受关注。本文在程序汇编代码的基础上,从程序可靠性的角度研究了面向寄存器软错误问题的程序分析、错误检测和编译优化等技术。本文的主要工作分为以下五个方面:1.从所运行程序角度就寄存器软错误对可靠性的影响进行定量分析,是设计和实现高效容错算法的基础。基于程序汇编代码,本文提出一种针对寄存器软错误的程序可靠性静态分析方法ASER。首先在一种已有静态分析方法的基础上,通过摘要函数的过程间分析框架提高了分析结果的精度。然后在寄存器活性分析的基础上,使用图可达的遍历方法提取出具体影响程序运行的寄存器生存期。ASER分析结果指出在寄存器软错误影响下程序可靠性与其自身结构的关系,以及寄存器相关生存期的量化分析结果。以上研究成果有助于理解程序中的关键脆弱点,为设计和实现针对寄存器软错误的高效错误检测和恢复技术提供了理论依据。2. ECC编码是解决软错误问题的有效手段,但是对全部寄存器用ECC进行保护在功耗开销、芯片面积和性能等方面都存在一定困难。本文假设只对部分寄存器进行ECC保护,然后提出一种寄存器重分配方法RAPP。该方法首先根据ASER的分析结果构造寄存器生存期的相干图,然后通过层次化图着色的启发式分配算法,把那些有ECC保护的寄存器尽量分配给比较关键的寄存器生存期。与已有方法相比,RAPP方法在兼顾功耗开销的前提下,对程序可靠性的改善效果最为明显。3.针对数据流错误的软件容错技术通常采取程序复算的方法,即把程序重复执行多次并比较运算结果以实现错误检测和恢复。其中指令级程序复算由于检错能力强、对用户透明和便于采用优化措施等特点而成为研究热点。但是指令复算中的一致性比较指令是限制容错程序性能的最关键因素,本文针对此问题提出一种针对指令复算的检查点优化方法COID。该方法基于错误传播的数据流分析,以系统调用指令为界限,在保证错误检测率的前提下,给出一种安全删除同步比较指令的方法。故障注入和性能分析实验表明,COID方法在不影响软错误检测率的前提下,将指令复算程序的平均性能提高了12.78%。4.软错误可能导致程序控制流错误,已有的控制流检测技术在性能开销和检测率方面存在不足。本文首先在程序控制流图的基础上,利用图着色算法对基本块进行分类,然后基于基本块的格式化标记提出一种有效的控制流检测方法ECCFS,并针对基本块内部和过程间的控制流检测问题分别给出扩展解决方法。检测效能分析和故障注入实验的结果表明,ECCFS能够检测出绝大部分的控制流错误。与已有控制流检测方法相比较,ECCFS在错误检测率和性能开销等方面都具有一定的优势。5.目前,硬件和软件实现的软错误容错技术在性能、功耗和实现成本等方面均有不同程度的开销。本文针对寄存器软错误,提出一种用于增强程序可靠性的编译优化方法SISER。其基本思想是通过指令调度缩短程序运行过程中寄存器总的活跃区间,即减少受寄存器软错误影响的有效区域,以提高程序运行时的可靠性。基于ASER和指令依赖关系的分析结果,SISER以动态规划的方式给出具体的基本块调度算法。指令调度实验结果表明,在无明显开销的前提下,目标程序的可靠性平均被提高了4.41%。与传统容错技术相比,SISER方法最大的特点是不会引入额外的时空开销。

【Abstract】 From the birth date of computer, reliability has been one of the most con-cerned issues in computer science domain. Today, though the reliability of moderncomputer has been improved significantly, the increasingly complicated computermanufacturing technics and the continuously expanded computer application areacause the reliability of computer systems always faces many new challenges.Soft errors are a kind of transient fault phenomenon in semiconductor circuit,which are caused by external radiation or electrical noises, such as high energy neu-trons from cosmic rays, power glitches, electromagnetic interference and etc. Softerrors introduced by the radiation of cosmic rays, e.g. single event upsets, alwaysa?ect the reliability of space computers. Moreover, with the continuously increasingperformance enabled by the scaling of VLSI technologies, modern microprocessorsare becoming more susceptible to soft errors. Subsequently to the wall of perfor-mance and power consumption, the dependability of computing, caused by softerrors, has emerged as a growing concern. Since Register Files (RFs) are accessedvery frequently and can not be well protected, soft errors occurring in them are oneof the critical reasons for a?ecting program reliability.Comparing with the hardware-implemented fault tolerance for soft errors, thesoftware-implemented methods are attractive because of their advantage on costsand ?exibility. For addressing the soft errors occurred in RFs, this dissertationfocuses on the techniques about program analysis, error detection, compiler opti-mization and etc. The main work is divided into the following five parts:1. It is valuable for analyzing the impact of soft errors occurred in RFs from theperspective of program, which is the foundation for implementing e?cient faulttolerance technologies. Based on the assembly codes, this dissertation proposesa static approach, named ASER, which is able to analyze the soft errors ef-fect quantitatively for the reliability of a given program. Based on a previousstatic method, ASER calculate the living probability of registers according tothe inter-procedural analysis framework of summary functions, resulting in theimprovement of final accuracy. Then, the concrete live ranges of registers are sketched via a graph reachability method. Analytical experiments show thatthe reliability of a program has a connection with its native structure. More-over, the critical factor of all involved live ranges have been presented, whichidentify the vulnerabilities of a program under the occurrence of soft errors inRFs. These contributions are in favor of implementing the e?cient algorithmsfor tolerating soft errors.2. ECC coding is one of the most powerful and popular architectural error pro-tection mechanisms for mitigating soft errors. But it is di?cult to fully protectRFs using ECC because of the significant penalty in power, area and possiblyperformance. This dissertation assumes that the register file is only partiallyprotected by ECC, and presents a register reassignment method, named RAPP.Firstly, the register interference graph is constructed according to the analyti-cal result about registers’live ranges from ASER. Then through a hierarchicalgraph coloring algorithm, the ECC protected registers are assigned to the mostcritical live ranges of registers. Comparing with other available partial pro-tected methodologies, experimental results show that RAPP improve programreliability significantly and take into account the power overhead.3. To address the data ?ow errors caused by soft errors, the instruction-level du-plication techniques have been used widely for their advantage on ?exible andgeneral implementation with strong capacity for error detection. However, theconsiderable consistency check instructions are the fundamental limitation forprogram performance. This dissertation presents a checkpoint optimizationmethod for instruction duplication, named COID. Based on the data ?ow analy-sis for error propagation, this method try to remove the redundancy comparisoninstructions under the boundaries of system call instructions without a?ectingthe error detection rate. To illustrate the e?ectiveness of this method, we per-form several fault injection experiments and performance evaluations on a setof simple benchmark programs. Experimental results indicate that COID hasimproved the average performance of instruction duplication for 12.78% withoutdegrading the error detection rate.4. Control ?ow errors are a major e?ect incurred by soft errors. Current avail-able control ?ow checking methods have deficiency in performance overhead and checking capacity. Through the control ?ow graph of program, basic blocksare firstly categorized by the graph coloring algorithm. Then, an e?ective con-trol ?ow checking method, named ECCFS, is presented based on the formattedsignature of basic blocks. Moreover, the extend solutions are proposed for thecontrol ?ow checking of intra-block and inter-procedure, respectively. The ana-lytical result of checking capacity and the experimental result of fault injectionindicate that ECCFS can detect most control ?ow errors. Compared with thetypical control ?ow checking methods, ECCFS has the advantage in the errorsdetecting rate and the performance overhead.5. Currently, a variety of methodologies have been proposed to address the e?ectsof soft errors. Unfortunately, these techniques will incur performance penalty,storage overhead and economical costs in di?erent degree. For enhancing theruntime reliability of program without extra costs, the dissertation presentsa compiler optimization method, named SISER. Its basic idea is to decreasethe total susceptible intervals that may be a?ected by soft errors during theexecution process through re-arranging the code execution ?ow. Based on theanalytical results of ASER, the detailed algorithm of basic block scheduling ispresented in the fashion of dynamic programming. Experimental results indicatethat the average reliability of programs have been improved about 4.41%. SISERdoes not provoke extra palpable overhead, which is its outstanding characteristiccomparing with other traditional methodologies of fault tolerance.

节点文献中: