节点文献

面向万亿次量级嵌入式计算的体系结构关键技术研究

Key Techniques Research on Terascale Embedded Computing

【作者】 杨乾明

【导师】 张春元;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2012, 博士

【摘要】 随着各种通信标准和编码算法的不断演进,高性能嵌入式应用对处理器的性能和能耗提出了越来越高的需求,万亿次量级嵌入式应用开始涌现,超大规模集成电路(VLSI)技术的飞速发展也为构建满足这种需求的高能效嵌入式处理器提供了可能。然而,将VLSI潜能变成满足万亿次量级嵌入式应用需求的实际计算能力仍然是一项极具挑战性的工作。传统的嵌入式处理器采用简单的处理器结构,可以获得很低的功耗,但是性能远不能满足未来嵌入式应用的需求。而以GPU、MIC为代表的高性能微处理器,采用众核结构在单个芯片上集成了数十亿支晶体管,虽然可以提供很高的性能,但是由于使用传统的超标量、同时多线程等技术,消耗了大量的功耗,远不能满足未来嵌入式应用的能耗需求。基于以上背景,作者选择了“面向万亿次量级嵌入式计算的体系结构关键技术研究”作为论文课题。本文深入研究了各种能耗有效的体系结构技术,研究内容涉及新型数据存储层次设计、全分布式VLIW的功能单元互连设计、超低功耗的处理器核设计、基于流模板的可重构计算等关键领域。本文的工作和创新体现在:1、提出了多级粒度匹配的数据存储层次(MGR:Multi-level Granularity-matchedRegister Hierarchy)设计。MGR将嵌入式应用的数据访问和处理过程层次化:最外层为粗粒度的流式数据访问,有很强的顺序性和可预知性;中间层为块数据访问模式,每次取一个块,可预知性强,块间相关性较弱;最内层是对块内数据的访问,较灵活,具有一定的随机性。针对这三个层次,MGR分别用帧缓冲存储器、高级寄存器文件和超小像素点寄存器文件去捕获不同层的数据局域性,使得每一级存储层次的设计都只需关注其本身功能的实现,这样每一层的硬件实现都简单高效。实验结果显示,相比于当前的其它典型存储层次,MGR可以获得53%~62%的能耗降低,同时性能保持不变或只有少许降低。2、提出了面向全分布式VLIW结构的功能单元部分互连设计。针对全分布式VLIW结构下功能单元全互连结构延迟大、功耗高、可扩展性差的问题,提出功能单元部分互连设计。首先分析了嵌入式应用对全互连结构的使用情况,总结出几种典型的通信模式;然后针对这些通信模式提出了多种部分互连结构,建立了部分互连结构的VLSI模型;最后深入分析了各种部分互连结构对延迟、面积、功耗和程序性能的影响。实验结果显示,相比于全互连结构,部分互连结构可以极大的降低硬件开销,而性能只有稍许的降低。同时,随着VLIW规模的扩大,部分互连将展现出更好的可扩展性。3、设计了一种超低功耗的嵌入式处理器核。由大量简单小核和少量复杂大核构成的大规模多核并行机制成为提高嵌入式处理器能效的主流趋势。针对简单小核,提出Smart Core处理器设计。Smart Core基于显式并行、精确计算的设计理念,采用了VLIW并行执行模式、多级数据存储层次(流式存储+层次化寄存器文件+超小寄存器文件)、非对称全分布式指令寄存器来分别降低指令流水线、数据供应、指令供应的能耗。初步的实验结果表明,Smart Core比传统嵌入式处理器提高能效25倍,在40nm工艺下,由Smart Core构建的众核系统可以获得单芯片1Tops以上的性能,同时保持操作能效比在100Gops/W以上。4、提出了基于流模板的多粒度动态可重构处理器(MGR-SAT: AMulti-Granularity Reconfigurable DSP based on Stream Architecture Template)设计。MGR-SAT结合了流处理技术、动态可重构技术和基于平台的技术,在硬件上由标量核、流处理核及相应外部接口组成。流处理核是一个动态可配置单元,由粗粒度可配置单元和细粒度可配置单元组成,用于计算加速。MGR-SAT整体上以流处理的方式运行,标量核负责配置流处理核,并启动流处理核的执行和数据传输。实验结果显示,MGR-SAT与当前典型的处理平台相比,有着明显的性能和功耗优势。

【Abstract】 With the evolution of more sophisticated communication standards and algorithms,embedded applications exhibit higher performance and efficiency requirements. Someemerging applications demand terascale operations per second. Although the rapiddevelopment of VLSI technology enables building processor with the tera order ofcomputing capacity, how to transfer the billions of transistors to the actual computingpower is still a challenging task. Using the simple control structure, traditionalembedded processor can get very low power consumption, but not provide enoughperformance. High performance microprocessors such as GPU and MIC High integratebillions of transistors by the many core technology, and can provide the performanceexceeding one Tops, but they are far from meeting the need of the future embeddedapplication in power and energy efficiency because they used the technologies ofmultithread and shared coherent cache, which consume much energy. To solve theabove problems, the subject of “Key techniques Research on terascale embeddedcomputing” is selected by this article.This article focuses on various energy-efficient architecture technologies, includingnew data memory hierarchy design, interconnection of functional units in fullydistributed VLIW, ultra low power processor core design, the organization ofcomputing resources. This thesis has completed the following main contributions andinnovations:1. We propose a multi-level granularity-matched register hierarchy named MGR.MGR divides the data access of embedded applications into three layers. The outermostlayer deals with the sequential and predictable streaming data; the middle layer dealswith block data and the dependencies between blocks are weak; the innermost layerdeals with the data within the same block and the access pattern is flexible and random.Corresponding to the three layers, MGR use frame buffer register file, the enhancedregister file and tiny-sized pixel register file to capture their respective data localities.So each memory layer is concerned only with its own function and its hardwareimplementation becomes simple and efficient. Compared to other typical memoryhierarchy, the results show that MGR can get53%-62%of reduction in energyconsumption, while achieving almost the same performance.2. We study the partial-connected crossbar for fully distributed VLIW. Thecrossbar with full connectivity is high delay, high power consumption and weak scaling.We first analyze the usage of full crossbar in embedded applications and summarizeseveral typical communication patterns. Corresponding to them, kinds of crossbars withsparse connectivity are proposed. We model the delay, area, power of the partialconnected crossbar. The experimental results show that, compare to the full crossbar, partial connected crossbar can greatly reduce the hardware cost while decreasingperformance slightly. Moreover, when scaling the number of function units in VLIW,the partial connected crossbar will exhibit more efficiency.3. We design an ultra-low-power embedded processor core. The future many coreprocessors may consist of a large number of small processor cores and some bigprocessor cores may construct. As the role of small core, an ultra-low-power embeddedprocessor core named Smart Core is proposed. On the methodologies of explicit paralleland accurate computing, Smart Core use the VLIW execution mode, multi-level datamemory hierarchy (streaming memory+hierarchical register file+tiny-sized registerfile), and asymmetrical fully distributed instruction register to reduce the energy ofinstruction pipeline, data supply and instruction supply correspondingly. Preliminaryresults show that Smart Core achieves an energy efficiency that is25x greater than thetraditional embedded RISC processor. When scaled to a40nm CMOS technology,single chip multi-processor, consisted of many cores like Smart Core, is capable ofproviding more than1Tops performance while achieving efficiency of100Gops/W ormore.4. We present a multi-granularity reconfigurable DSP based on stream Architecturetemplate named MGR-SAT. MGR-SAT merges stream processing technology, dynamicreconfigurable technology and platform-based technology, consisting of scalar core,stream processing core and the external interfaces. The stream processing consists of acoarse-grained reconfigurable unit and a fine-grained reconfigurable unit and can bereconfigurable dynamically when running. Scale core is responsible for configuring thestream processing core, initiating it and enabling the transfers of block data. Theexperimental results show that, compared to other typical processing platform,MGR-SAT delivers higher performance and power efficiency significantly.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络