节点文献
多核微处理器容软错误设计关键技术研究
Research on the Key Techniques of Soft Error Tolerance Design on Multi Core Microprocessor
【作者】 龚锐;
【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2008, 博士
【摘要】 微处理器受到高能粒子轰击或噪声干扰等恶劣环境的影响,将发生瞬态故障。这些瞬态故障可能引起软错误(Soft Error),甚至失效,这将对微处理器的可靠性产生较大的影响。随着集成电路制造工艺的进步,单片上能够集成的晶体管数目将呈指数增长,这将使得微处理器面临越来越严重的软错误威胁。目前,多核微处理器已经逐渐成为市场的主流。容软错误(Soft Error Tolerance)技术一般都需要某种程度的冗余,而多核微处理器中天然的冗余资源为容软错误设计提供了新的解决思路。如何有效地利用多核微处理器中的冗余资源来增强微处理器的容软错误能力,进而提高其可靠性,就成了亟待解决的问题,对其进行深入研究具有重要的理论意义和实用价值。本文的研究工作围绕多核微处理器容软错误设计中的一系列关键技术展开。首先研究了多核微处理器容软错误执行模型,容软错误执行模型关系到程序如何高效、正确、可靠地在多核微处理器上执行,这也是发挥多核冗余资源优势实现容软错误设计的关键所在。其次,本文对具体的容软错误加固技术进行了研究,任何容软错误微处理器都要采用不同层次的加固技术对软错误进行屏蔽、检测或恢复,本文主要研究了门级的冗余技术和体系结构级的控制流检测技术。最后,本文对微处理器可靠性评估模型进行了研究,以便能在设计流程的早期就对微处理器可靠性进行定量评估,从而对设计选择和优化进行有效地指导。本文所作的主要创新工作包括:(I)本文提出了两种多核微处理器容软错误执行模型,包括:(1)基于现场保存与恢复的双核冗余执行模型DCR。在该模型中,两份相同的线程在两个具有现场保存与恢复功能的内核上冗余执行。通过增强内核的功能,使得该模型在能够有效恢复软错误的同时,具有较低的容错专用核间队列带宽需求和实现复杂度。(2)可重构的三核冗余执行模型TCR。该模型通过增强内核的冗余,在三个不同的内核上执行三份相同的线程,发现软错误以后可以进行动态重构,从而以较低的容错专用核间队列带宽需求和较高的执行性能实现了对软错误的有效屏蔽。(II)本文提出了两种基于异步电路技术的门级冗余结构,包括:(1)基于异步C单元的双模冗余结构DMR。该结构采用异步C单元对双模冗余单元的输出进行屏蔽,有效地降低了硬件冗余度,在具有对SEU(Single Event Upset)故障屏蔽能力的同时,有效地降低了芯片的面积开销。(2)基于异步双沿触发寄存器的时空三模冗余结构TSTMR。本文借鉴异步电路中解同步电路显式分离主从锁存器的结构,提出了双沿触发寄存器(DCTREG)。TSTMR结构通过采用DCTREG,将时间冗余应用到门级,从而实现对SEU和SET(Single Event Transient)故障的全面屏蔽。(III)本文提出了一种增强型控制流检测技术ECFC,该技术主要包括检测方法和实现方法两部分:(1)基于节点和边的签名检测方法。该方法通过将签名同时赋予控制流图中的节点和边,实现了比经典的基于节点的签名检测方法更严格的控制流检测,并且可以杜绝经典检测方法中可能出现的非法转移误判和调整签名冲突的情况。(2)软硬件结合的控制流检测实现方法。该实现方法由编译器在程序中插入签名数据,在程序执行的过程中,执行完控制流转移指令后自动触发一次硬件检测操作。该实现方法具有二进制代码量小、性能高、检错及时等优点。(IV)本文提出了一种综合考虑芯片面积和性能开销的可靠性评估模型:该模型采用一种新的评估量化标准,以实现对微处理器可靠性的定量评估。采用该评估模型,可以在设计流程中对采用了不同容软错误技术的微处理器的可靠性进行准确的定量评估,有利于对设计选择和优化进行指导。本文还在此评估模型下,对上述容软错误执行模型、门级冗余结构和体系结构级控制流检测技术进行了可靠性评估。本文通过对容软错误执行模型、容软错误加固技术和可靠性评估模型的研究,对容软错误多核微处理器的设计实现进行了有益的探索。本文的实现、验证和评估结果表明,上述技术是有效的,能够应用于容软错误多核微处理器的设计和实现。
【Abstract】 One of the most critical challenges in modern microprocessor design is the transient fault caused by high-energy particles or random noise. These transient faults may cause soft errors, even failures, which can affect the reliability of microprocessors. With the development of integrated circuit, the transient fault rate of a single microprocessor keeps increasing exponentially with the exponential increase of transistors per chip. Multi core microprocessors become the mainstream in the last few years. Generally speaking, the soft error tolerance design needs some kind of redundancy. Therefor, the redundant cores in multi core microprocessor provide a potential solution for soft error tolerance design. And how to efficiently use the redundant resources in multi core microprocessor to enhance the reliability becomes the research focus in recent years.This thesis details our researches on some key techniques of soft error tolerance design on multi core microprocessor. Firstly, our researches focus on the soft error tolerant execution model, which is the key technique to exploit the redundant resources in multi core microprocessors for soft error tolerance design. Secondly, we research some hardened techniques on gate and architecture level. These hardened techniques provide soft error masking, detection or recovery. Finally, we research the reliability evaluation model of microprocessors so that the evaluation results can be used to conduct the design process of high reliable microprocessors.The primary innovative works in this thesis are list as follows.(I) Two soft error tolerant execution models on multi core microprocessor are proposed. (1) The dual core redundancy execution model based on context saving and recovery (DCR) executes two copies of a given program on different cores. The soft errors can be recovered with low inter-core FIFO bandwidth demand and low implementation complexity by enhancing the cores with context saving and recovery. (2) The reconfigurable triple core redundancy execution model (TCR) executes three threads of a program on different cores. Once detecting a soft error, the execution model can be reconfigured to mask the failed core. Thus the soft error can be masked with low inter-core FIFO bandwidth demand and high execution performance.(II) Two gate level redundancy structure based on asynchronous circuit are proposed. (1) The dual modular redundancy based on C element (DMR) uses the asynchronous C element to mask the corrupted values in the dual redundant device. It can efficiently reduce the die area overheads, while provides the SEU tolerant ability. (2) The temporal spatial triple modular redundancy based on dual clock triggered register (TSTMR) can mask both SEU and SET faults. With the same explicitly separated master and slave latch structure as de-synchronous pipeline, dual clock triggered register (DCTREG) uses one clock for sample enable and another for output enable, thus the temporal redundancy can be implemented on gate level.(III) An enhanced control flow checking technique (ECFC) is proposed. This ECFC technique includes checking method and implementation method. (1) Checking method based on signature node and edge signs for both nodes and edges in control flow graph. It is a more powerful checking method. And it can eliminate the misjudgment of illegal branch and the conflict of adjusting signature in the typical checking method. (2) Control flow checking implementation method with compiler signatures and hardware checking inserts signature data in the code when compiling. Then the hardware checking operation is triggered by control flow switching instructions. This implementation method shows its advantages on small binary code size, high performance and real time checking.(IV) A reliability evaluation model based on die area and performance overheads are proposed. This model uses a novel reliability metric to evaluate the reliability of microprocessors veraciously and quantitatively. An evaluation framework is also proposed so that the reliability of different soft error tolerant techniques can be evaluated during the design flow and the evaluation results can be used to conduct the designer to choose appropriate techniques among various hardening methods. The aforementioned soft error tolerant execution models, gate level redundancy techniques and architecture level control flow checking technique have been evaluated using this evaluation model.This thesis explores the soft error tolerance design on multi core microprocessor by researching the soft error tolerant execution model, hardened techniques on gate and architecture levels and reliability evaluation model. The experimental results demonstrate that these models and techniques are effective and can be used in the design and implementation of soft error tolerant multi core microprocessors.
【Key words】 Multi Core Microprocessor; Soft Error Tolerance; Execution Model; Gate Level Redundancy; Control Flow Checking; Reliability Evaluation;