节点文献

面向软错误的故障恢复和验证技术研究

Research on Techniques of Error Recovery and Verification for Soft Errors

【作者】 谭兰芳

【导师】 谭庆平;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2013, 博士

【摘要】 随着集成电路制造工艺的发展,现代微处理器的性能在大幅度提高的同时,面临软错误的威胁也越来越严重。软错误是由外部环境中的高能粒子辐照或电压扰动、地磁干扰等因素诱发的一种硬件瞬态故障现象。它不破坏电路的内部结构,但是却可以通过改变处理器状态或者存储单元值等方式影响程序的正常运行,从而对系统可靠性造成严重影响。为提高系统可靠性,国内外纷纷开展了容错技术的研究。从实现方式来看,面向软错误的容错技术主要可以分为硬件实现的和软件实现的容错技术。与硬件实现的容错技术相比,软件实现的容错技术无需改造或重新设计硬件,具有实现成本低、开发周期短、可灵活配置等优势而备受关注。从软错误的处理过程来看,软件实现的容错技术主要包括这几个方面的研究内容:软错误的影响分析和评估、错误检测、错误恢复、容错优化配置和容错算法验证。由于目前的研究大都集中在软错误的影响分析和错误检测方面,本文主要针对错误恢复和容错算法验证展开研究。本文的主要贡献可分为以下四个方面:1.本文提出一种基于格式化标签分析的控制流恢复技术,使得错误检测后程序状态能恢复至故障发生前的一个正确状态,确保程序继续执行且输出正确结果。该方法首先在汇编语言级上将程序代码划分为无存基本块,并为每个无存基本块分配格式化的静态标签;然后基于分配的静态标签添加控制流检测和恢复指令,其中检测指令主要负责控制流检测而恢复指令主要负责恢复由控制流错误导致的程序数据流错误;最后定义分层故障处理例程,即为每个过程单独定义一个过程错误处理例程和为整个程序定义一个全局错误处理例程。该方法首次解决了过程间的控制流错误检测和恢复问题,能检测和恢复所有的基本块间的控制流错误,并能检测和恢复绝大部分的基本块内部的控制流错误。与纯控制流检测算法相比,该方法在控制流错误检测的基础上以相对较少的性能开销实现了错误恢复。2.本文提出一种源代码级的数据流错误容错处理机制,主要包括三个方面:(1)基于基本块的概念给出包含块的定义,该错误处理机制以包含块为基本单位对数据流错误进行检测和恢复,确保包含块内发生的数据流错误不会传播至块外。(2)提出一种基于差异转换和冗余复算的错误检测机制,其基本思想是基于一组差异转换规则,将原程序转换为功能完全一致的冗余程序,通过在特定位置插入比较检测语句来判断程序运行过程中是否发生软错误。(3)提出一种应用级检查点备份的数据流错误恢复机制,即通过求解数据流分析方程得出检查点包含的变量集合,以此为依据插入恢复代码。为自动生成容错程序,本文设计并实现一个源到源的转换工具。故障注入实验和性能开销实验结果表明:与其它源代码级数据流错误容错方法相比,该方法能以相对较少的性能开销达到较高的错误覆盖率。3.本文根据模型检验原理,提出一种通用的针对基于标签分析的控制流检测算法的形式化验证方法。该方法首先对待验证目标——基于标签分析的控制流检测算法进行概述;在此基础上将容错程序建模为控制流状态机,并给出其语法和语义的定义;然后对控制流状态机进行进一步具体化,通过定义一个状态转换系统来描述控制流检测过程状态的转移;并基于状态转换系统和模型检验工具的对应关系,将状态转换系统转换为模型检验工具的输入程序,以进行自动验证;最后以代表性的控制流检测算法CFCSS算法和DSM算法为例,说明该方法的实用性。验证结果表明:该验证方法首次发现了DSM算法的检测盲点,以及与CFCSS算法中标签设计相关的一些检测盲点。4.针对数据流容错算法的有效性验证,本文提出一种基于汇编语言类型系统的形式化验证方法。它的基本思想是给汇编语言加上静态类型属性,通过类型安全性来保证程序的容错属性。本文以典型的数据流恢复算法SWIFT_R为例,首先给出类型化的容错汇编语言TFAL的语法,通过将一条指令的执行建模为状态的一次转移,对TFAL的操作语义进行解释;在此基础上对TFAL的指令进行类型检查,得出了SWIFT_R算法的检测盲点;在假定排除这些检测盲点的前提下,首先证明了TFAL系统的类型安全性——前进和保持属性。然后在此基础上定义状态的相似关系,进一步证明了SWIFT_R算法的容错属性,即原程序在无错环境下运行的输出结果与容错程序在错误环境下运行的输出结果一致,且状态转移过程相似。

【Abstract】 With the development of the VLSI fabrication technologies, the performance of themodern microprocessors has been increasing exponentially, while their sensitivity to softerrors also increases dramatically. Soft errors are intermittent faults caused by externalevents, such as the radiation of energetic particles, voltage disturbance and electromag-netic interference. Soft errors do not cause permanent damage but may result in incorrectprogram execution by altering signal transfers or stored values, thus having in a seriousimpact on system reliability.In order to improve system reliability, fault-tolerant technologies have been pro-posed. Accordingtotheirimplementation,fault-toleranttechnologiescanbeclassifiedin-to hardware-implemented technology and software-implemented technology. Comparedwith the hardware-implemented technology, software-implemented technology does notneed to alter or re-design hardware architecture, with the advantages of low cost, short de-velopment life cycle, and flexible configuration. Therefore, software-implemented tech-nology has been an efficient solution to deal with soft errors. According to the fault pro-cessing, software-implemented technologies include: soft error analysis and assessmen-t, error detection, error recovery, fault-tolerant optimal configuration and fault-tolerantverification. The earlier researches mainly focus on soft error impact analysis and errordetection, so our study concentrates on error recovery and fault-tolerant verification. Themain contributions are as follows:1. We present a pure-software method based on encoded signatures to recover fromcontrol-flow errors(CFEs). After error detection, both the program control-flowand data-flow transfer back to a correct state before the error occurrence, ensuringthat the program continues executing and produces correct output. In this study, theassembly program code is firstly partitioned into storeless basic block and a staticencoded signature is assigned to each storeless basic block. Then, checking instruc-tions and recovery instructions are inserted in each storeless basic block based onthe assigned signatures. Checking instructions are designed to detect CFEs, whilerecovery instructions are designed to recover the data errors caused by the CFEpropagation. Finally, CFE-handlers are defined to handle CFEs. To the best of ourknowledge, this is the first time to solve the problem of inter-function CFEs’ detec- tion and recovery. Moreover, all the inter-block and most of the intra-block CFEscanbedetectedandrecovered. ComparedwiththecurrentCFEdetectiontechnique,our method achieves the goal of error recovery on the basis of error detection at arelatively lower performance overhead.2. Weproposeafault-toleranttechniqueatthesourceleveltodealwithdataflowerrorscaused by soft errors, which consists of three parts:(1)A definition of containmentblock based on the concept of basic block. The proposed technique deals with da-ta flow errors at the granularity of containment block, so that the data flow errorswithin a containment block don’t propagate to other blocks.(2) An error detectionmechanismbased ondata diversitytransformation andredundancy computing. Thebasic principle of error detection is that a redundant fault-tolerant program with thesame function as the original program is generated based on a set of diversity trans-formation rules, and some comparison statements are inserted at certain positionsto check whether an error occurs during the execution.(3) An error recovery mech-anism based on application-level checkpoint. Data flow analysis is introduced toobtain the variables of each checkpoint, and the statements for error recovery areadded as well. A source to source transformation tool is implemented to generatethe fault-tolerant program automatically. Fault injection and performance overheadexperimental results show that most control flow errors can be recovered with rel-atively low performance overhead.3. Wepresentageneralapproachtoevaluatetheeffectivenessofsignature-monitoringmechanism based on model checking principle. At first, we make an abstract sum-maryofsignature-monitoringmechanisms. Thenthefault-tolerantprogramismod-eled as a control-flow machine state, and its syntax and semantics are defined usinga step-operational semantics. The control-flow machine is refined into a state tran-sition system, which is translated into the input program of the model checker inorder to perform the verification automatically. At last, our approach is appliedto two representative techniques, DSM and CFCSS, to demonstrate the practica-bility. The verification results show that the undetected errors of DSM algorithmare revealed and some undetected errors due to the signature association of CFCSSalgorithm are revealed for the first time.4. We propose a formal verification technique based on typed assembly system to ver-ify the correctness of data flow error tolerance technique. The basic principle of typed assembly system is to add some static type property to the assembly languageso that the target property can be proved through the verification of the soundnessof the typed system. We take a representative data flow error recovery technique-SWIFT_R as an example to illustrate the verification process. At first, the syntaxof TFAL is defined and the operational semantics is given as a step operation bymodeling the execution of an instruction as a step transition. Based on the defini-tions of syntax and semantics, all the instructions of TFAL are type-checked, andthe undetected errors of SWIFT_R are obtained. Suppose that all the undetectederrors are excluded, the type safety of TFAL, including progress and preservation,is verified. Then, the similarity relation is defined. Based on the similarity relation,the property of fault tolerance is proved. A program is fault-tolerant when the out-put of the original program under normal environment is the same as the output ofthe fault-tolerant program under fault environment. Moreover, the state transitionsof the original program and the fault-tolerant program are similar.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络