节点文献

代价高效的容错片上网络关键技术研究

Research on Key Technologies of Cost-efficient Fault-tolerant Networks-on-chip

【作者】 陈延仓

【导师】 谢伦国; 张民选;

【作者基本信息】 国防科学技术大学 , 电子科学与技术, 2012, 博士

【摘要】 未来,单芯片集成的处理器数量将会达到数百甚至数千,处理器之间的通信量非常庞大。传统片上互连结构的可扩展性太差,无法满足多核芯片的通信需求。此外,随着CMOS特征尺寸的缩小,门延时明显降低,而线延时降低的幅度远低于门延时,导致线延时超过门延时。因此,必须精心设计全局互连线或者避免使用全局互连线。片上网络采用局部链路代替全局互连线,具有很好可扩展性,能够满足片上多核处理器通信需求。大于10%的晶体管可能因为工艺偏差等原因而发生硬故障。随着CMOS工艺特征尺寸的缩小,连线宽度变小,发生硬故障的概率增大。硬故障可能导致片上网络瘫痪。现有片上网络仅能容忍少量的硬故障,并且容错粒度较粗导致面积开销大。因此,设计代价高效的容错片上网络成为片上网络研究的重大问题之一。为提高片上网络的容错能力,本文从实现硬件级模拟开发平台着手,深入研究了路由器结构、容错路由算法、细粒度容错片上网络结构和容错任务映射算法,主要取得了如下研究成果:1.设计了一种面向片上网络的硬件级模拟开发平台HardSim。它集模拟和设计验证于一体,能够执行硬件级模拟,比微片级模拟描述更详细的硬件特征。它支持人工合成负载和真实应用程序踪迹模拟。它实现了两种故障注入模式,静态故障注入和动态故障注入。静态故障在模拟启动前产生并且载入网络。动态故障是在模拟过程动态产生并注入网络。2.提出了一种低延时共享输出缓存路由器SOBR。它具有5个重要特征:(1)虚通道(VC)位于输出端口,而不是输入端口;(2)虚通道交换器动态配置访问阵列,提高了可用缓存容量,提升了路由器性能;(3)支持跳步读操作的动态FIFO缓存结构,减少报文阻塞;(4)采用动态分层交换提升路由器性能;(5)所有类型的微片通过路由器的最小延时均为一个时钟周期。基于65nm标准单元库的综合结果表明,SOBR路由器的关键路径仅为24个逻辑门,延时约为0.64ns。由于流水线仅为1个周期,SOBR的平均延时显著低于其他路由器。在4×4mesh和均匀随机通信模式下,SOBR饱和吞吐率高达0.86微片/结点/周期。由于省略了VC分配器和交叉开关分配器等模块,SOBR的面积开销比相同缓存容量的经典输入虚通道路由器减少了9.4%。此外,定性分析结果表明,VC交换器有效提高了SOBR的容错能力。3.提出了一种高效的分布式容错路由算法PR-WF,并且将该算法用于SOBR路由器。PR-WF以西向优先转向模型为基础,采用动态伪接收(DPR)机制,动态启动或关闭向西转向,并且避免网络发生死锁。所谓DPR机制,指的是本地网络接口接收向西转向的报文并且将其转发到西向端口。本地网络接口需要FIFO缓存以存储向西转向报文。PR-WF采用特定优先权原则,为每个端口生成多条优先权队列,并且根据网络链路和邻居路由器端口状态,产生输出端口。PR-WF是一种基于逻辑的分布式容错路由算法,其面积开销远低于基于路由表的容错路由算法。PR-WF算法与网络尺寸无关,因此,具有更好的可扩展性。对于10%的链路故障率,PR-WF仅需废弃1.8%的完好链路就可以避免发生活锁。对于链路故障率为10%的9×9mesh,PR-WF平均跳步次数比最短路径仅增加了8.34%。综上,PR-WF路由算法是一种高效的分布式容错路由算法。4.提出并实现了一种细粒度容错片上网络结构SNoC。SNoC在通道切割的基础上,通过切片接口部件将切片之间相互耦合,使得网络能够以细粒度容忍硬故障。每个路由器包含4个切片,每条链路包含4个子链路和1个备份子链路。SNoC采用一种自适应切片接口部件,能够根据切片和子链路的状态为切片接口提供优化配置。切片均采用SOBR结构和PR-WF容错路由算法。模拟与分析结果表明,SNoC结构大幅减少了有效故障数量。即使在故障率较高的情况下,SNoC结构也取得很好的性能。基于65nm的综合结果表明,SNoC面积开销相比基于通道切割的片上网络增加了约1%。5.提出了一种低开销容错任务映射算法CMAP。现有任务映射算法大都是基于搜索的方法,速度慢,可扩展性差。构造算法根据最优解的特征,从无到有地构造近似最优解。优点是时间复杂度小,运行速度快。CMAP是一种面向任务映射的构造任务映射算法,能够感知拓扑结构,通过构造链表尽可能将权重较大的边映射到单跳步路由路径,或者优先将度数较大的结点映射到局部最优位置,解决了不规则拓扑的任务映射问题。通过两种真实应用和多种任务图对该算法进行了评估,证明了CMAP算法具有较高的准确性、效率、扩展性和容错能力。

【Abstract】 In future, the number of processors in a single chip will reach hundreds or eventhousands, and communication bandwidths between processors are very large.Traditional on-chip interconnection methods are unable to meet the communicationrequirements of multi-core chips, because of their poor scalabilities. As the feature sizeof COMS technology shrinking, gate delay are significantly reduced, and reductions ofwire delay are much lower than the gate delay, so wire delay are bigger than gate delay.Therefore, the global interconnection has to be designed carefully or discarded.Networks-on-Chip (NoCs) use local links instead of gloable interconnection wires, andhas good scalability to meet the needs of multi-core communication requirements.More than10%transistors may be faults because of technology variations or otherreasions. As the COMS technology feature size shrinks, wire widths become smaller,and the probability of hard faults is also increased. Hard faults may lead to NoCsparalysis. Existing NoCs are only able to tolerant small amount of faults, and theirfault-tolerant grains are coarse resulting in large area overheads. The design of alow-overhead fault-tolerant NoC has been a major challenge. In order to improve theability of toleranting hard faults of NoCs, this paper started from the implementation ofa hardware-level simulation and design platform, researched router architecture,fault-tolerant routing algorithm, fine-grain network architecture and task mappingalgorithm deeply, and achieved achievements as following:1. A hardware-level on-chip network simulation and design platform (i.e.HardSim) is designed. It is able to perform hardware-level simulation whichdescribes more hardware details than flit-level simulations. It combinessimulation and design verification in a unified flow. It supports simulations oftrace-based real applications. It implements two kinds of fault injectionpatterns, static fault injections and dynamic fault injections. Static faults areproduced and loaded into networks before the start of simulations. Dynamicfaults are dynamically produced and loaded into networks among simulations.2. A low-latency shared output buffered router, named SOBR, is presented.SOBR has5important features:(1) its virtual-channels locate in output ports,rather than input ports;(2) the dynamically configuration of access matrixes byvirtual-channel swapper is effective to improve the capacity of availablebuffers and the performance of networks;(3) its dynamic FIFO bufferarchitecture supports leap read operation to reduce packet blockings;(4) adynamic layered switching is taken to improve performances of networks;(5)all types of flits can pass through a router in one clock cycle ideally. Under65nm synthesized results show that its critical path is only24logic gates, and its worst delay is about0.64ns. Owing to its single-cycle pipeline, averagelatencies of SOBR are clearly lower than other routers. For4×4mesh and theuniform random traffic pattern, the maximum saturation throughput of SOBRis up to0.86flits per node per cycle. Owing to the elimination of VCallocation, switch allocation and switch modules, the area overhead of SOBRreduces up to9.4%when compared with the input virtual-channel router of thesame buffer. Qualitative analysis showed that the virtual-channel swapper iseffective to enhance the fault-tolerant ability of SOBR.3. A low-overhead distributed fault-tolerant routing algorithm, i.e. PR-WR, ispresented and integrated into SOBR router. It is based on the turn model of thewest-first routing, and takes a dynamic pseudo receiving (DPR) mechanism toenable or disenable west turns, and can ensure networks to be deadlock-free.The DPR mechanism refers to that local network interfaces receive west-turnpackets temporarily and then forward them to west ports as soon as possible.In order to store west-turn packets, each local network interface has a FIFObuffer queue for each west turns. DPR mechanism can turn off west turns toavoid deadlock. PR-WF chooses a suitable output port according to the state ofoutput links and neighbor routers by the specific principle of priorities. It is alogic-based distributed fault-tolerant routing, and its area overhead is muchlower than table-based fault-tolerant routings. It has nothing to do with thenetwork size, so it has good scalability. In order to avoid livelock, it onlyneeds to disable1.8%good links under10%link fault rate. For9×9mesh with10%faulty links, its average hop number only increases by8.34%than theshortest path. In summary, PR-WF is an efficient low-overhead distributedfault-tolerant routing.4. A low-overhead fault-tolerant NoC architecture, i.e SNoC, is presented. SNoCis base on channel slicing, and couples all slices by slice interfaces whichmake networks to tolerant faults in fine granularities. Each router has4slices,and each link has5sub-links. Its slice interface is self-reconfigurable toprovide optimal configurations according to states of slices and links. Its sliceuses SOBR architecture and PR-WF fault-tolerant routing. Simulation resultsshow that, SNoC is able to achieve good performance even under high faultyrates. Under65nm synthesized results show that its router critical pathsincrease only0.08ns than SOBR router. Compared with the network ofchannel slicing, its area overhead only increases about1%than the channelslicing network.5. A low-overhead fault-tolerant task mapping algorithm, i.e. CMP, is proposed.Existing task mapping algorithms are based on search method, which are time-consuming and have poor scalability. Construction algorithms graduallyconstruct sub-optimal solutions according to characteristics of optimalsolutions. It is of low complexities, and run faster than search algorithms.Therefore, a construction algorithm, named CMAP, is present for taskmapping problem. It is topology-awared to resolve task mapping problem forirregular topologies. It maps large weighted links of the task graph to singlehop routing as much as possible, or maps large degree node of the task graphto local optimal positions. Two real applications and a variety of task graphsare used to verify its accuracy, efficiency, scalability and fault-tolerant ability.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络