节点文献

大规模众核微处理器互连网络体系结构及性能分析研究

Interconnection Network Architecture for Large-scale Manycore Processors and Its Performance Analysis

【作者】 冯权友

【导师】 窦文华;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2012, 博士

【摘要】 基于多核甚至众核设计的高性能处理器,是未来艾级高性能计算机的支撑技术。高带宽、低延迟、低功耗和强扩展性的互连网络对于释放处理器核强大的并行计算能力、提高众核处理器的性能有十分重要的意义。目前,众核系统的设计挑战中,互连通信逐渐成为制约系统性能提升的瓶颈。新兴的3D集成技术和硅基光子器件在芯片功能、集成密度和功耗方面有独特优势。这些新技术、新器件的发展成熟为解决众核系统互连瓶颈带来新的机会。本文以研究众核系统互连瓶颈为出发点,探索众核微处理器互连网络的创新型体系结构,并利用网络演算理论对众核互连网络进行建模与分析。主要研究内容包括四个方面:(1)众核系统片上核间互连网络体系结构核间传输的报文以控制报文为主,对实时性有着极高的要求。随着计算核节点数增多,传输延迟成为限制大规模众核处理器核间互连网络性能的首要因素。以Mesh为代表的简单低维片上网络结构,虽然布线简单,但由于其网络传输跳步数随着系统节点规模呈比例增长,很难满足大规模众核芯片的低延迟传输需求。利用3D集成技术,本文提出了一种三维扁平蝴蝶形网络的拓扑结构,用于大规模众核处理器的核间电报文传输。采用整数线性规划模型,我们克服了蝶形网络中高阶路由器和长互连线的布线挑战,成功地将扁平蝴蝶形网络嵌入到三维叠层中。扁平蝴蝶形拓扑是一种高维拓扑结构,扩展性强,尤其适合大规模计算核节点之间的互连。三维蝶形网络在保证Mesh连通性的同时增加了额外的捷径链路,同时利用高速的垂直互连线,实现了核间报文的快速传递。实验结果表明,三维蝶形网络能够有效的降低核间互连延迟,显著的提升众核处理器性能。(2)众核微处理器光访存网络体系结构访存互连对众核处理器至关重要,如果不能快速的存取数据,众核处理器强大的并行计算能力将很难发挥。随着单片上集成的处理器核数越来越多,访存通信带宽需求也急剧增长。传统的基于电IO管脚的“处理器-存储器”互连方案在大规模众核芯片中遇到了挑战,电互连方式很难在满足严格的功耗预算的前提下,为片上众核提供足够大的访存带宽。利用新兴的硅基光电子器件和3D集成技术,我们提出了一种高带宽、低功耗的光访存网络方案,用于众核处理器与DRAM之间的互连通信。这种基于光突发交换协议的访存网络采用光互连接口代替电IO管脚,能够实现众核处理器和存储器的高带宽无缝互连。除了带宽优势外,与以往的光访存网络相比,新方案的波长资源利用率得到了极大的提高,进一步提高了访存通信的功耗效率。实验结果表明,基于光突发交换协议的访存网络的功耗效率比光线路交换的访存网络提高了近2倍,比电接口方案提高了6倍。(3)芯片尺度光网络中的电控制层拥塞避免方案由于光缓存、光逻辑器件缺失,光电混合网络大都采用电控制层,负责资源仲裁、链路控制。在芯片尺度光突发交换网络研究中,我们发现,大量的细粒度光突发报文、严格的传输延迟限制和中等的网络工作频率限制了光网络的电控制层处理能力,极易导致严重的网络拥塞。因而,我们提出了一套流量整形方案,解决电控制层网络拥塞问题。在注入网络前,系统中所有报文流首先进行全局协调和整形,确保中间任何节点上的控制报文聚合流速率不会超过其最大处理能力,以达到减轻控制层拥塞的目的。我们采用优化算法,选取报文流整形器的整形参数(比如,报文流速度和报文突发性参数)。这种拥塞控制方案在一定程度上,为各个报文流的端到端传输进行资源预约,在带宽方面提供基本的服务质量保证,可以有效的缓解由控制层拥塞引起的光突发报文丢失现象。基于合成流量和真实运用轨迹的实验表明,这种新方法能有效避免控制层拥塞,降低报文丢失率,提高芯片尺度光突发交换网络的系统性能。(4)芯片尺度光互连网络性能分析芯片尺度光互连网络的设计需要平衡多方面的因素,包括网络延迟、吞吐量、能耗和硅片面积占用。这些系统级互连参数的选择直接影响整个芯片的性能,因而进行片上网络的性能分析,对系统的设计具有重要意义。为此,我们开展了芯片尺度光网络的解析建模工作。利用随机网络演算理论,我们建立了光突发交换网络的存储资源需求模型,以及光器件的波长资源需求估算模型。仿真实验与数值分析的结果表明,这些解析模型计算得到的边界相当紧致。利用这些随机网络演算分析模型,我们可以快速评估众核系统光互连网络的系统级设计参数,比如存储器资源需求、传输延迟、光器件资源需求等。在设计初期,建模分析网络的性能,还可以提前降低设计风险。总的说来,我们的解析模型刻画了系统性能与网络负载、体系结构之间的关系,有助于迅速找出影响性能的关键因素和设计瓶颈,促进设计空间收敛。综上所述,本文研究了众核系统的互连瓶颈问题,提出了新的网络体系结构,并基于网络演算理论,对该体系结构进行了解析建模和性能分析。本文理论与实际结合紧密,为众核处理器互连瓶颈问题提供了新的解决方案,对推动高性能处理器技术发展做出了积极的贡献,并进一步扩展了网络演算理论的运用领域。

【Abstract】 High performance multi-core, or even manycore processors are the enablingtechnology for future Exascale computing era. To efficiently exploit the unprecedentedparallelism of these cores and further boost the throughput of manycore systems, it isimportant to provide a high-bandwidth, low-latency, low-power and highly scaleablechip-scale interconnection infrastructure. Recently, the challenge of manycoreprocessors has gradually shifted from logic design to interconnects; on-chip inter-corecommunication and processor-to-memory interconnects have become the bottleneck forsystem improvement. The advances of3D integration technology and silicon photonicdevices provide new opportunities for manycore interconnects design.In this thesis, aiming at the manycore interconnect design challenges, we proposenew interconnection network solutions for both inter-core and processor-to-memorycommunication by exploiting the advantages of3D integration and silicon photonics.We also develop analytical models to study the performance of these new architecturesusing network calculus. The main contributions are summarized as follows.(1) A three dimensional flattened butterfly network for on-chip inter-corecommutationRecent studies show that inter-core messages have stringent demand ontransmission delays as most of them are small control packets, e.g. cache-coherentmessages. Transmission delays will get much worse when more cores are integrated, forexample,1000cores. Although low-radix topologies, e.g. the popular2D mesh, are easyto place and route, they are unable to meet the latency budget of large-scale manycoresystem, as the transmission hops of low-radix networks increase proportionally withcores. Therefore, we propose a three dimensional flattened butterfly network forinter-core communication in large-scale manycore systems by exploiting the advantagesof3D integration technology. We overcome the routing challenges of area-hungryhigh-radix routers and global long wires in flattened butterfly using3D stacking andsuccessfully embed it into multiple stacking layers by forming the problem as an integerlinear programming model. A three dimensional flattened butterfly is very efficient forfast inter-core message transfer, because it not only employs the express one-hopvertical interconnects, but also provides additional links besides the connectivity of2Dmesh. Thus, as proved by our simulation results, the new scheme can greatly reduceinter-core message delays and boost the performance of manycore processors.(2) A photonic-burst switched memory access network for large-scale manycoreprocessorsProcessor-to-memory schemes are vital for manycore system since tardy memoryaccess will limit the performance of parallel computing cores. Memory bandwidth demand increases proportionally with the number of integrated cores. As projected byITRS, traditional electric IOs are unable to provide enough bandwidth for large-scalemanycore system due to stringent power budget. Therefore, we propose ahigh-bandwidth, low-power optical memory access scheme for manycoreprocessor-to-DRAM communication by exploiting the advantages of3D integrationtechnology and silicon photonic devices. Our photonic burst-switched (PBS) scheme isan adaptation of optical burst switching for chip-scale network using silicon photonicdevices. The PBS network meets the enormous bandwidth demand and stringent energyconstraints by using high-speed low-power CMOS-compatible photonic devices.Furthermore, it has higher bandwidth utilization than previous wavelength-routedschemes and optical circuit-switched memory access networks because ofsub-wavelength optical switching. We examine the system feasibility and performancesusing physically-accurate network-level simulation environment. We evaluate thearchitecture using synthetic traffic patterns and real workloads traces. Simulation resultsshow that our scheme achieves considerable energy savings, compared to opticalcircuit-switched memory access network and traditional electric IO schemes.(3) A new method to reduce control-plane congestion in chip-scale OBS networkIn current OBS optical networks, many control-plane operations, such as sharedresources arbitration and link management, are usually performed in the electric domainbecause of the absence of optical buffer devices and optical logic devices. Due to therandom nature of burst arrivals at core nodes, control-plane congestion can occur in anOBS network when the short-term arrival rate of headers at a core-node exceeds themaximum rate at which they can be processed. The problem gets even worse inchip-scale OBS, since1) chip-scale OBS network is characterized by massive shortbursts (fine-grained control messages, like memory read/write requests) that havestringent requirements on communication delay;2) the operation frequency ofchip-scale OBS network is constrained by thermal constraint and limited power budget,and therefore can not be very high. All these features definitely intensify thecontrol-plane congestion. Thus, we propose a new approach to address the control-planecongestion problem in chip-scale OBS using traffic regulations. Before being injected,every concurrent control flow is globally regulated and coordinated so that theaggregated flows do not exceed the header processing capacity of intermediate corenode, leading to the alleviation of control-plane congestion. In other words, ourregulation method provides some end-to-end bandwidth guarantees for each flow,resulting in significant reduction of burst losses. To select optimal regulator parameters,we formulate the regulation method into an optimization problem. Simulation resultswith both real application traces and synthetic flows show that our approach caneffectively resolve the control-plane congestion and achieve considerable performanceimprovements in terms of network delay and burst losses rate. (4) Resources dimensioning and performance analysis of chip-scale opticalnetwork using stochastic network calculusThe design of chip-scale optical network is characterized by challenging trade-offsamong latency, throughput, energy consumption, and silicon area requirements. Thesearchitectural parameters directly influence system performance. Thus, it is very usefulto perform such analysis in early stages of design so as to avoid bottleneck and reducedesign risks. So we develop analytical models to study chip-scale OBS network. Usingstochastic network calculus, we propose an analytical model of the ingress node todimension buffer size and calculate end-to-end latency; we also develop a “virtualwavelength buffer” model to estimate the required wavelength number with respect to atolerable burst loss probability. Analytical performance bounds on buffer size and delayare computed and compared with simulations. The simulation results verify that thetightness of the bounds is good. Using these stochastic network calculus models, we canfast evaluate the interconnect architecture parameters including buffer size, transmissiondelay and wavelength requirement. Our analytical models accurately depict therelationship between system performance and network architectures, so they are veryuseful for locating system bottlenecks, resulting in fast convergence of the complexdesign space.In summary, we investigate the manycore interconnect bottleneck and propose newinterconnection network architectures for large-scale manycore processors; we alsobuild analytical performance models for the new interconnect schemes using networkcalculus. We contribute new solutions towards the manycore communication problemand further extend the application field of network calculus theory. Our works havetheir academic and practical value on promoting the advancement of high performanceprocessors.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络