节点文献

面向事务存储系统的容错技术研究

Research on Fault Tolerance for Transactional Memory System

【作者】 宋伟

【导师】 杨学军;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2011, 博士

【摘要】 随着多核处理器的发展,事务存储作为一种有潜力的并发控制机制受到了越来越多的关注。另一方面,随着大规模集成电路的发展进入深亚微米级甚至纳米级,处理器更容易受电磁辐射、宇宙射线以及其它干扰源的影响,这使得处理器的可靠性问题变得日益突出。因此事务存储机制下的容错问题也将逐渐成为一个值得关注的问题。本文针对事务存储系统下的容错问题展开研究,以事务存储系统中的错误传播行为为理论基础,围绕故障检测、故障恢复和故障屏蔽等几个关键问题提出了理论方法、技术方案和实现框架。本文的主要贡献如下:1.以基于程序语句序列的语句间的错误传播行为为始,层层递进的分析了错误在事务存储系统中的传播行为。通过对事务自身的属性和特点的分析,针对容错位置和容错对象集合这两个容错技术主要关心的信息,给出了事务存储系统中两类天然的容错位置及对应的容错对象集合,并证明了其所具有的不同的容错能力,从理论上揭示了事务存储系统天然的容错特性。2.提出了基于事务冗余的错误检测方法——EDRT错误检测方法。该方法为事务创建冗余副本,并同时执行事务及其副本,通过在提交前比较两事务的写集合实现了低错误检测开销的基于冗余事务的错误检测方法。此外,我们根据事务存储系统所采用的数据版本管理机制的不同特点,分别从错误检测数据比较集的获取和比较方法以及冲突检测机制两方面提出了将EDRT错误检测方法应用于基于Eager数据版本管理机制和基于Lazy数据版本管理机制的事务存储系统的系统约束和设计指导方法。通过一组实验我们验证了相比于传统的双模冗余错误检测方法,EDRT错误检测方法可以在较低的错误检测开销下获取较好的错误检测能力。3.提出了基于事务回退的故障恢复方法——FRTR故障恢复方法。该方法利用事务存储系统的数据版本管理机制作为故障恢复的“检查点”,通过单故障事务的回退来完成故障恢复的过程。通过对支持FRTR故障恢复方法的容错事务存储系统的隔离性的讨论,我们证明了基于单事务回退的FRTR故障恢复方法对于事务存储系统的故障恢复的充分性。通过一组实验我们验证了FRTR故障恢复方法的低故障恢复开销。此外,我们将并行复算的思想引入FRTR故障恢复方法,进一步降低了FRTR故障恢复方法的故障恢复开销,并针对OpenTM程序给出了基于事务存储系统的并行复算的编程指导。通过实验我们也验证了对于较大粒度事务的事务存储系统,该优化方法对FRTR故障恢复方法的性能优化的有效性。4.提出了基于三模冗余的容错方法——TriTM容错方法。该方法将三模冗余的思想引入事务存储系统,以事务的写集合作为数据比较集合,实现了一种低容错开销的故障屏蔽方法。我们利用TriTM容错方法的自纠错能力,提出了基于比较点优化设置的TriTM容错方法的性能优化方法Opti_TriTM。此外,我们根据基于Closed嵌套的事务存储系统的特点,提出了基于Closed嵌套事务的TriTM容错方法的实现方法。通过一组实验我们验证了相比于传统的三模冗余容错方法,TriTM容错方法具有较低的容错开销,同时我们也验证了Opti_TriTM对容错性能优化的有效性。

【Abstract】 With the development of multi-core processors, transactional memory has attracted more and more attention as a promising concurrent control mechanism. On the other hand, with the development of large scale integrated circuit entering into deep submicron or even nanometer level, the processors become more and more susceptible to electromagnetic radiation, cosmic ray and other interfering resources. This makes the reliability of the processors become more outstanding, so as a result, the fault tolerance in transactional memory system becomes a concerning issue.In this paper, we study the issues on the fault tolerance in transactional memory system. Based on the theoretical foundation of error propagation behavior in transactional memory system, we propose the theoretical methods, technical solutions and implementation frameworks around the issues of fault detection, fault recovery and fault masking. This paper has the following contributions:1. Taking the error propagation behavior between statements sequence as the beginning, we analyze the error propagation behavior in transactional memory system progressively. We provide two sorts of fault tolerant positions and the corresponding fault tolerant objects, and prove the different fault tolerant abilities they have, and reveal the fault tolerant characteristics of transactional memory.2. We propose an error detection method based on redundant transaction– EDRT. This method creates a redundant copy for every transaction, and executes both the transaction and its copy, and achieves the error detection by comparing the write sets of the two transactions before the committing operation. In addition, we propose the system restraints and the designing guide for how to apply the EDRT to the transactional memory systems based on both the eager and lazy data-versioning mechanisms from the aspects of both the acquisition and comparison method of error detection data sets and the conflict detection mechanism. We prove that the EDRT has good error detection ability with low cost through a set of experiments.3. We propose a fault recovery method based on the transaction rollback– FRTR. This method takes the data-versioning mechanism as the checkpoint, and accomplishes the fault recovery by rolling back the single fault transaction. We prove the sufficiency for fault recovery in transactional memory system through discussing the isolation of the transactional memory system that supports the FRTR. We also prove the low cost of FRTR through a set of experiments. In addition, we introduce the idea of parallel recomputing into the FRTR to reduce the cost of FRTR, and we provide the programming guide of the parallel recomputing for OpenTM. We also prove the availability of this optimization method through a set of experiments.4. We propose a fault tolerant method based on triple redundancy– TriTM. This method introduces the idea of triple redundancy into the transactional memory system, taking the write sets of the transactions as the data comparison set, and implements a low cost fault masking method. By utilizing the error correction ability of TriTM, we propose an optimization method based on the optimization of the set of the comparison point in TriTM. In addition, we implement the TriTM in the closed nesting transactional memory system. And we also prove the low cost of TriTM and the availability of Opti_TriTM through a set of experiments.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络