节点文献

空间辐射环境下软件实现的硬件故障检测技术研究

The Research on Software Implemented Hardware Fault Detection Techniques in Radiation Environment

【作者】 李建立

【导师】 谭庆平;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2008, 硕士

【摘要】 当前,世界上太空探索的热潮再度兴起。人类探索太空的活动更加活跃的同时,太空辐射环境对探测器可靠性的负面影响也日益突出。空间辐射环境对电子器件的影响可分为单粒子效应和总剂量效应两类,其中单粒子效应尤其是单粒子翻转故障构成了对星载计算机安全的主要威胁。相关研究结果表明,相对于使用抗辐照器件的硬件容错技术,采用软件容错技术,即在商用器件上采用软件的方法容忍硬件故障,可以在保证系统可靠性的前提下,获得更高的系统性能。软件容错技术同时也具有低成本、低功耗、可灵活配置等优点。本文在分析当前已有软件容错技术成果的基础上,围绕故障检测算法、容错优化技术进行了深入研究。首先,本文提出了一种新的故障检测技术——基于层次分解的故障检测技术。不同于其它已有的故障检测技术,基于层次分解的故障检测技术将程序结构划分为不同的层次,并在不同的层级上使用不同的故障检测算法,通过这些检测算法相互配合、层层检测,实现了对不同种类故障、不同类型的程序错误进行检测,提高了程序运行的可靠性。然后,本文根据程序不同区域在应用故障检测算法后通常在可靠性和性能方面具有不同反应的特点,提出了一种可配置的故障检测算法。算法建立了容错程序的可靠性反应和性能反应分析模型,并基于分析结果获得具有最佳性价比的容错配置方案。最后,本文基于编译容错的思路,实现了基于层次分解的故障检测技术和可配置的故障检测算法,并通过故障注入实验对这些技术的故障检测能力和性能代价进行了测试。基于层次分解的故障检测技术对硬件故障的检测率达到了97.9%-99.1%。相比基于层次分解的故障检测技术,可配置的故障检测算法以0.5%-1.4%的故障检测率损失为代价,使容错的性能消耗下降了12%-20%。

【Abstract】 At present, the worldwide space exploration boom is re-emerging. While space exploration activities become more active, the negative impact on the reliability of space detectors caused by space radiation also becomes more severe. The impact of the space radiation environment on the electronic devices can be divided into the single event effect and the total ionizing dose. The single event effect, particularly, the single event upset has become a major threat to the security of on-board computers.The correlative research results have shown that, compared with hardware implemented fault tolerant techniques based on the anti-radiation devices, software implemented hardware fault tolerance techniques which can tolerate hardware fault based on COTS components, not only can guarantee the reliability of the system but also can improve the system performance. At the same time, software implemented hardware fault tolerance techniques are also low-cost, low-power, flexible configuration, etc.Based on the analysis of the current achievements of software implemented hardware fault tolerance techniques, this essay has an in-depth study on the fault detection algorithms and the optimization of fault tolerance techniques. Firstly, this essay has present a novel fault detection technique—Fault Detection Technique by Program Hiberarchy which is called FDTPH. Unlike other existed fault detection techniques, the FDTPH divides program structure into different layers, and uses different fault detection algorithms at different layers. By these detection algorithms cooperating with each other and detecting errors layer upon layer, the FDTPH has accomplished detecting different kinds of faults and different kinds of errors, and the FDTPH can improve the reliability of programs. Secondly, based on the phenomena that the different regions of program usually have different performance response and reliability response after applying the fault detection algorithms, this essay has proposed the Configurable Fault Detection Algorithm which is called CFDA. The CFDA has established the performance response and the reliability response analysis models, and it can get the best cost-effective fault tolerant configuration based on the analysis results of the models. Finally, this essay has implemented the FDTPH and the CFDA in the method outputting fault tolerance program by fault tolerance compiling, and the essay has tested the fault detection capability and performance cost of these techniques by fault injection experiments.The fault detection rate of the FDTPH has reached 97.9%-99.1%. Compared to the FDTPH, the CFDA has made the performance cost of fault tolerance technique drop by 12%-20%, at the cost of only reducing the fault detection rate by 0.5%-1.4%.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络