节点文献

Cache一致性片上网络路由算法和流控机制优化关键技术研究

Research on the Key Techniques of Routing Algorithm and Flow Control Optimizations for Cache-Coherent Networks-on-Chip

【作者】 马胜

【导师】 王志英;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2012, 博士

【摘要】 半导体技术的发展不断增加芯片中集成的核数,传统总线或点对点通信架构面临着带宽、延迟、功耗和可扩展性等方面的不足。针对这些局限,片上网络应运而生,它能在芯片内部提供一种简单、高效和可扩展的通信机制。另一方面,由于并行编程的高难度和兼容历史代码的需求,cache一致性协议在众核结构上将长期存在。在cache一致性众核结构上,片上网络传输的通信主要由采用的一致性协议所决定。为高效支持这些通信,需要在分析一致性协议结构和通信特征的基础上对片上网络进行优化设计。本文的主要研究成果及创新点如下:(1).提出了一种面向负载整合工作模式的路由算法Cache一致性协议的层次结构和应用程序有限的并行度导致多应用程序同时运行在一个众核平台上,这种负载整合工作模式要求路由算法提供良好的适应性和动态隔离性。本文提出了基于目标的自适应路由(Destination-Based AdaptiveRouting, DBAR)算法。通过使用一个低开销的拥塞信息传播网络,DBAR获得了本地和远端的网络状态信息,从而能有效避免网络拥塞。更重要的,通过将目标信息集成到输出端口选择中,DBAR能动态隔离多应用程序。在多种网络配置下,DBAR的性能都优于之前的设计。(2).提出了一种面向完全自适应路由算法的流控机制由于面积和功耗的限制,cache一致性片上网络一般配置有限的虚通道数目,它给完全自适应路由算法设计提出了新的挑战。之前的死锁避免理论要求使用保守虚通道分配策略,严重限制了路由算法的性能。本文提出了全报文发送(Whole Packet Forwarding, WPF)虚通道分配策略,并证明了WPF不会带来死锁。WPF能显著提升路由算法的性能,它对之前的死锁避免理论进行了重要扩展。在虚通道受限环境中,路由算法应该提供较高的路由灵活性,本文进一步给出了一种在较低硬件开销条件下提供较高路由灵活性的完全自适应路由算法。(3).提出了面向torus片上网络的切片气泡流控机制Cache一致性片上网络需要同时传输长报文和短报文,已有的torus网络死锁避免理论不能高效处理这种混合长度报文的传输,它们要么需要使用两条虚通道,要么需要将短报文视为长报文。本文提出了切片气泡流控(Flit Bubble FlowControl, FBFC)死锁避免理论。FBFC通过在虫孔交换ring网络上维持一个空闲缓存单元避免死锁。FBFC只使用一条虚通道,获得了较高的路由器频率;FBFC不需要将每个短报文视为长报文,提高了缓存利用率。基于FBFC理论,本文给出了两种实现,它们的性能都显著优于已有设计。(4).提出了一种高效支持cache一致性协议归约和多播通信的技术Cache一致性协议需要使用归约和多播通信,为了防止这些通信成为系统性能瓶颈,必须要对它们提供硬件支持。本文研究了对目录一致性协议的多播cache行作废消息和归约acknowledgement(ACK)消息的硬件支持。本文提出了一种消息组合框架支持归约ACK消息的组合操作,该框架不仅在低到中等负载下降低了报文平均延迟,同时也提升了网络吞吐率,它只需少量的额外硬件开销。此外,本文还提出了均衡自适应多播路由(Balanced, Adaptive Multicast, BAM)算法,该算法均衡了不同维度间的缓存资源,进一步提升了网络吞吐率。综上所述,本文紧紧围绕“面向cache一致性通信优化片上网络设计”这一目标,基于对一致性协议结构和通信特征的分析,优化设计了路由算法和流控机制。这些优化设计不仅取得了一定的系统性能提升,同时也扩展了已有的死锁避免理论。因此,本文既具备一定的工程价值,又具备一定的理论意义。

【Abstract】 The advancement of semiconduct technology increases the core count, and makesthe traditional bus or point-to-point communication mechanisms face several challenges,including low bandwidth, high latency, high power consumption, low scalability and etc..To address these challenges, Network-on-Chip (NoC) was proposed. NoC was regardedas a simple, efficient and scalable communication paradigm for future many-core plat-forms. On the other side, due to the difficulty of parallel programming and compatibilityrequirements of history codes, cache coherence protocols will exist in many-core plat-forms. In cache coherent many-core platforms, the traffic delivered by NoC is mostlydecided by the applied cache coherence protocol. To provide efficient communicationsupport for coherent traffic, it is necessary to analyze the traffic characteristics, and thenoptimize the NoC design. The main contributions of this thesis are as follows.1. Efficient routing algorithm to support workload consolidation scenarios.Duetothehierarchicalcachecoherenceprotocolandlimitedapplicationparallelism,it is quite possible that multiple applications will run concurrently in a many-core plat-form. These workload consolidation scenarios require the routing algorithm to provideboth sufficient adaptivity and dynamic isolation. This thesis proposes Destination-BasedAdaptive Routing (DBAR). By leveraging a low-cost congestion propagation network,DBAR utilizes both local and non-local network status to efficiently avoid congestion.More importantly, by integrating the destination information into the output port selectionprocedure, DBAR dynamically isolate multiple concurrent applications. DBAR offersbetter performance than the best baseline algorithm for many measured configurations.2. Efficient design of fully adaptive routing algorithm for cache coherent trafficDue to area and power consumption limitations, cache coherent NoC generally con-figure a small number of virtual channels (VCs). Limited VCs pose several challengesto the design of fully adaptive routing algorithms. Previous deadlock avoidance theoriesrequire a conservative VC re-allocation scheme, which strongly limits the performance ofrouting algorithms. This thesis proposes a novel VC re-allocation scheme, whole packetforwarding (WPF). We prove that WPF does not induce deadlock, thus WPF is an im-portant extension of previous deadlock-avoidance theories. WPF can greatly improvethe performance of fully adaptive routing algorithms. To efficiently utilize WPF in VC- limited networks, we design a novel fully adaptive routing algorithm which maintainspacket adaptivity without significant hardware cost.3. Flit bubble flow control for torus cache coherent NoCsShortandlongpacketscommonlyco-existincache-coherentnetworks-on-chip(NoC-s). Existing deadlock avoidance designs for torus networks do not efficiently handle thismixofpacketsizes. ThesepreviousdesignseitherleveragetwoVCsorregardeachpacketas a maximum-length packet. We propose a novel deadlock avoidance theory, flit bubbleflow control (FBFC). The insight of FBFC is that maintaining one free flit-size bufferslot inside a ring can avoid deadlock for wormhole torus networks. Only one VC is re-quired. FBFC does not treat short packets as long ones; this yields high buffer utilization.Based on this theory, we present two implementations, and both show large performanceimprovement than previous designs.4. Efficient support of collective communication in cache coherence protocolsCache coherence protocol utilizes reduction and multicast. Hardware support is nec-essary to prevent these collective communication from becoming a system bottleneck.This research explores support for reduction and multicast communication operationsin a directory cache coherence protocol. This paper makes two primary contributions:an efficient framework to support the reduction of ACK packets and a novel Balanced,Adaptive Multicast (BAM) routing algorithm. By combining ACK packets during trans-mission, this framework not only reduces packet latency, but also improves the networksaturation throughput with little overhead. The balanced buffer resource configuration ofBAM helps to get some additional saturation throughput improvements.In summary, this thesis aims to ‘optimizing the design of NoCs for cache coherenceprotocols’. Based on the analysis of the characteristics of coherent traffics, we optimizethe design of routing algorithms and flow control mechanisms for NoCs. The proposedoptimizations not only improve the performance, but also extend the deadlock avoidancetheories. Thus, this thesis has both engineering value and theoretical significance.

节点文献中: