节点文献

片上网络无缓冲路由器关键技术研究

Research on Key Techniques of Bufferless Router for Network-on-chip

【作者】 冯超超

【导师】 张民选;

【作者基本信息】 国防科学技术大学 , 电子科学与技术, 2012, 博士

【摘要】 微电子技术的迅猛发展推动了芯片设计进入多核时代,随着片上集成核数的不断增加,片上核间通信已成为多核片上系统(System-on-Chip,SoC)的性能瓶颈。片上网络(Network-on-Chip,NoC)的出现,替代了传统的总线和交叉开关互连结构,成为一种可扩展、高带宽的通信架构,有效解决了大规模多核SoC中的全局通信问题,提升了多核片上通信的性能。但是,随着集成度的不断提高,功耗和面积的日益增加成为制约多核SoC发展的重要因素,并且,特征尺寸的缩小、电源电压的降低以及时钟频率的提升严重影响NoC的可靠性。因此,研究高能效、低开销、高可靠的NoC对于大规模多核SoC的设计具有重要意义。无缓冲路由器为NoC提供了一种低开销的解决方案。在无缓冲路由器中,除了流水线寄存器外,不需要额外的缓冲器,在很大程度上降低了路由器的功耗与面积开销,简化了路由器的设计。现有的无缓冲路由器中串行化的交换分配器制约了其性能的提升,并且在无缓冲路由器中缺乏对可靠性设计的支持,难以在复杂环境下有效应对故障。为此,本文围绕无缓冲路由器微体系结构的性能优化和可靠性设计展开研究,主要工作体现在以下四个方面:1.偏转路由性能分析及基于置换网络的无缓冲路由器论文针对多种NoC拓扑结构设计偏转路由算法,并在多种合成通信模式下对偏转路由的性能进行分析和评估。评估结果表明,网络拓扑结构和通信模式对偏转路由算法的性能具有很大影响,设计师在设计无缓冲NoC时可以针对特定应用选择合适的拓扑结构。针对目前广泛使用的2D Mesh NoC,论文提出一种基于置换网络的单周期高性能无缓冲路由器(BLESS_PERM),采用一个简单的两级置换网络替代了原路由器中串行化的交换分配器以及交叉开关,有效缩短了关键路径的逻辑级数,简化了设计复杂度,提高了实现的时钟频率。模拟结果表明,合成通信模式下BLESS_PERM路由器的包平均延迟比VC、BLESS_BASE、BLESS_PL及CHIPPER路由器分别少70%,65%,56%和41%;真实应用通信模式下BLESS_PERM路由器的包平均延迟比VC、BLESS_BASE、BLESS_PL及CHIPPER路由器分别少80%,72%,66%和38%。2.无缓冲路由器容错体系结构针对无缓冲路由器的可靠性设计,论文提出了一套完整的容错体系结构,可以检测并处理链路中出现的瞬态故障与永久故障。该容错体系结构包括:一种基于分块SECDED编码的在线故障检测机制,能够有效检测并区分瞬态故障与永久故障,并且不干扰正常包的传输。一种自动请求重传(ARQ)与前向纠错(FEC)相混合的容错流控策略,在链路级处理包传输过程中出现的瞬态故障。两种容错偏转路由算法,在网络层绕开永久故障链路路由。邻近节点故障感知偏转路由算法(FoN)基于2跳步故障信息传递模型以及故障区域的形状进行路由选择,可以有效处理无连续凹点的凸形和凹形规则故障区域。基于强化学习的可重构容错偏转路由算法(FTDR)针对非规则故障区域,采用一种强化学习的方法对路由表进行重配置以实现容错。为了降低FTDR算法的实现开销,还提出了一种基于层次化路由表的算法FTDR-H。一种基于可配置双向链路的容错偏转路由器(BiFTDR),根据链路故障状态及到达包信息对相邻路由器之间的双向链路进行方向配置,不需要绕道路由即可处理单向故障链路。3.基于偏转路由的高性能可容错多播机制论文提出三种基于偏转路由的高性能多播机制(DRM)。DRM_noPR机制实现简单,多播包路由过程中根据最佳候选目标选择最佳路由方向,沿一条动态变化的路径路由到每一个目标。DRM_PR_src和DRM_PR_all机制根据路由器端口的忙闲状态在源节点或中间节点对多播包按一种区域划分规则进行复制,增加了多播路径的多样化,有效降低了多播延迟。此外,为了提高多播传输的可靠性,论文在三种DRM机制的基础上提出了容错DRM机制(FT_DRM)。FT_DRM采用基于强化学习的方法对路由表进行重配置,可以绕开永久故障链路进行多播路由并且不存在丢包。实验结果表明,无故障网络中DRM_PR_src机制的包平均延迟比DRM_noPR机制少18%;DRM_PR_all机制的包平均延迟比DRM_noPR和DRM_PR_src机制分别少40%和27%;在网络中存在5%及10%故障链路的情况下,DRM_PR_src机制的包平均延迟比DRM_noPR机制少17%;DRM_PR_all机制的包平均延迟比DRM_noPR机制少38%。4.面向三维片上网络的无缓冲路由器针对将无缓冲路由器由二维扩展到三维,串行化输出端口分配进一步导致路由器性能严重下降的问题,论文提出一种基于三级置换网络的单周期高性能三维无缓冲路由器(3D_PERM),采用一个三级置换网络替换串行化的交换分配器以及7×7交叉开关,在包交换的同时采用简单置换规则有效避免活锁,提高性能的同时降低了硬件实现开销。模拟结果表明,合成通信模式下3D_PERM路由器的包平均延迟比3D_BASE和3D_CHIPEER路由器分别小73%和14%;真实应用通信模式下3D_PERM路由器的包平均延迟比3D_BASE和3D_CHIPPER路由器分别小78%和14%。针对三维集成电路面临的TSV制造工艺低成品率问题,论文提出一种低开销容错偏转路由器(FTDR-3D_OPT)用于3D Mesh NoC。FTDR-3D_OPT使用一个层路由表和两个TSV状态向量代替全局路由表以避开水平故障链路和垂直故障链路路由实现容错。综合结果表明,与采用全局路由表的三维容错偏转路由器相比,FTDR-3D_OPT的面积和功耗分别降低40%和49%。

【Abstract】 With the rapid development of microelectronic techniques, chip design enters intothemulticoreera. Duetotheincreasingnumberofcoresonasinglechip,communicationsbetween cores have become the performance bottleneck of the multicore System-on-Chip(SoC). Network-on-Chip (NoC) as an alternative to the classical bus or crossbar intercon-nection architecture has become a scalable and high-bandwidth communication paradig-m, which solves the global communication problem for the large scale multicore SoC andimproves the performance of the on-chip communication effectively. However, with theenhancementoftheintegrationdegree, powerconsumptionandareahavealreadybecomea limiting constraint in the design of multicore SoC. In addition, shrinking feature size,lower power voltage and higher frequency have a negative impact on the reliability ofNoC. Thus, energy-efficient, low-overhead and high reliable NoC is especially desirablefor the large scale multicore SoC.Bufferless router provides a low-overhead solution for NoC. In bufferless router, noadditional buffers are needed except the pipeline registers, which can reduce the powerconsumption and area overhead significantly and also simplify the design. The serializedswitch allocator in existing bufferless router limits the enhancement of the performance.Furthermore, the lack of reliability design in bufferless router makes it difficult to han-dle faults in the complicated situation. Thus, this dissertation investigates performanceoptimization and reliability design for the bufferless router microarchitecture. The maincontributions of this dissertation are as follows:1. Performance analysis for deflection routing and bufferless router based on a per-mutation networkThe thesis designs deflection routing algorithms for various NoC topologies andconducts the performance evaluations using different synthetic traffic patterns. The e-valuation results illustrate that the performance of deflection routing is susceptible tothe network topology and traffic pattern. The NoC architect should choose the suitableNoC topology for the specific application when designing bufferless NoC. For the univer-sal topology——2D Mesh NoC, the thesis proposes a1-cycle high-performance buffer-less router based on a permutation network (called BLESS_PERM). The BLESS_PERMrouter replaces the serialized switch allocator and crossbar with a simple2-level permuta- tionnetwork, whichcanreducethenumberoflogiclevelsonthecriticalpath, simplifythedesign complexity and enhance the clock frequency. Simulation results illustrate that theBLESS_PERMrouterachieves70%,65%,56%and41%lessaveragepacketlatencythanthe VC, BLESS_BASE, BLESS_PL and CHIPPER routers respectively under synthetictrafficworkloads, andachieves80%,72%,66%and38%lessaveragepacketlatencythanthose four routers respectively under real application workloads.2. Fault-tolerant architecture for bufferless routerForthereliabilitydesignofthebufferlessrouter,thethesisproposesacompletefault-tolerant architecture, which can detect and handle both transient and permanent faultylinks. The fault-tolerant architecture includes:An on-line fault detection mechanism using SECDED block coding, which can de-tect and distinguish transient faults from permanent faults without interfering withnormal packets transmission.A hybrid automatic repeat request (ARQ) and forward error correction (FEC) fault-tolerant flow-control scheme to handle transient faults occurring in packet on link-level.Two fault-tolerant deflection routing algorithms to route packets around permanentlinkfaultsonnetworklayer. TheFault-on-Neighbor(FoN)awaredeflectionroutingalgorithm, which can tolerate convex and concave fault regions without two con-cave points in sequence, makes routing decision based on the2-hop fault informa-tion transmission model and the fault region shape without deadlock and livelock.The reconfigurable fault-tolerant deflection routing algorithm (FTDR) based on re-inforcement learning, which can handle irregular fault regions, utilizes a reinforce-ment learning method to reconfigure the routing table to achieve fault-tolerance.A hierarchical-routing-table-based algorithm (FTDR-H) is also presented to reducethe area overhead of the FTDR router.Afault-tolerantdeflectionrouterwithreconfigurablebidirectionallinks(calledBiFT-DR). The BiFTDR router reconfigures the direction of the bidirectional links be-tween neighboring routers according to the link status and incoming packets infor-mation, which can handle unidirectional fault model without bypassing.3. High-performance and fault-tolerant deflection-routing-based multicast schemesThethesisproposesthreehigh-performancedeflection-routing-basedmulticast(DR- M) schemes. The DRM_noPR scheme is a simple multicast scheme, which selects theproductive direction based on the best candidate. The multicast packet will be routed toeachdestinationalongadynamicpathintheDRM_noPRscheme. TheDRM_PR_srcandDRM_PR_all schemes replicate multicast packets according to a region partition rule andthe busy or free status of the output ports, which can increase the diversity of the multicastpath and reduce the multicast latency. Furthermore, in order to improve the reliability ofthe multicast communication, the fault-tolerant DRM schemes (FT_DRM) are proposedbased on the three DRM schemes. FT_DRM schemes reconfigure the routing table basedon a reinforcement learning method and route multicast packets around permanent linkfaults without any packet lost. Experimental results show that in the network withoutfaulty links the DRM_PR_src scheme achieves18%less average packet latency than theDRM_noPR scheme, and the DRM_PR_all scheme achieves40%and27%less averagepacket latency than the DRM_noPR and DRM_PR_src schemes respectively. In the net-workwith5%and10%faultylinksoftotallinks, theDRM_PR_srcschemeachieves17%less average packet latency than the DRM_noPR scheme, and the DRM_PR_all schemeachieves38%less average packet latency than the DRM_noPR scheme.4. Bufferless router for3D NoCAs the bufferless router extends from2D to3D, the performance of the router de-grades with the serialized output port allocation further. The thesis proposes a1-cyclehigh-performance3D bufferless router based on a3-level permutation network (called3D_PERM). The3D_PERM router uses a3-level permutation network to replace the se-rialized switch allocator and a7×7crossbar, which can improve the performance andreduce the hardware overhead. Simulation results demonstrate that the3D_PERM routerachieves73%and14%lessaveragepacketlatencythanthe3D_BASEand3D_CHIPPERrouters respectively under synthetic traffic workloads, and achieves78%and14%lessaverage packet latency than the above two3D bufferless routers respectively under realapplication workloads. To address the low yield of the TSV manufacture technology in3D IC, the thesis proposes a low-overhead fault-tolerant deflection router (called FTDR-3D_OPT) for3D Mesh NoC. The FTDR-3D_OPT router uses a layer routing table andtwo TSV state vectors to make efficient routing decision to avoid both horizontal and ver-tical link faults. Synthesize results demonstrate that the area and power consumption ofthe FTDR-3D_OPT router are40%and49%less than those of a3D fault-tolerant deflec- tion router with a global routing table.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络