节点文献

面积带宽优化的嵌入式GPU可编程着色器体系结构研究

An Area and Bandwidth Efcient Programmable Shader Architecture for Embedded Graphics Processing Units

【作者】 常轶松

【导师】 孙济洲;

【作者基本信息】 天津大学 , 计算机应用技术, 2013, 博士

【摘要】 随着VLSI工艺水平的不断提高与应用需求的不断增长,在系统级芯片中集成基于多统一着色器的嵌入式GPU已成为高端移动终端设备的重要发展趋势。但由于芯片面积的严格约束,嵌入式GPU中可容纳的可编程着色器核心数量极为有限。这就要求在体系结构设计中必须有效提升单着色器的计算性能,并保证较小的面积开销;另一方面,嵌入式GPU在绘制过程中需要频繁访问片外图形数据存储资源,造成极高的总线数据访问带宽,增加了嵌入式GPU的系统功耗。因此如何对可编程着色器的逻辑面积和数据访问带宽进行优化成为嵌入式GPU体系结构研究的重要方向。本文将针对上述问题,从多核嵌入式GPU系统级建模方法、面积优化的单着色器运算单元通路与体系结构设计、带宽优化的多着色器顶点缓存结构等方面开展研究工作,为未来多核嵌入式GPU体系结构的研究与设计提供理论和技术基础。首先,本文提出一种基于混合建模技术的嵌入式GPU高层次全系统仿真平台。为了有效提升复杂系统软件的仿真速度,提出了基于QEMU虚拟机的微处理器指令集仿真器,并利用SystemC事务级模型对系统级芯片内部互连结构进行建模,有效提升系统仿真效率。之后提出一种基于基于片内数据缓冲区的多统一着色器的嵌入式GPU基础体系结构,并利用周期级建模的方法对其微结构细节特征进行描述。最后将周期级模型与SystemC事务级硬件模型进行整合,从而为本文后续的研究工作提供基础实验平台。其次,本文提出了可编程着色器内部面积优化的浮点运算单元数据通路。首先针对浮点向量运算的特点,提出了一种多功能统一浮点向量运算单元结构。通过对已有向量内积运算单元关键硬件模块进行向量化复用,使其支持基本向量运算类指令的处理,并在保证计算性能的同时尽可能降低逻辑面积开销。以此为基础,通过在着色器内部复用空闲向量运算单元,完成标量超越函数二次多项式近似的计算,进一步降低浮点标量特殊功能单元的逻辑开销。第三,本文以传输触发结构为基础,从性能和面积开销两个方面对单着色器体系结构进行优化。基于传输触发结构下细粒度数据传输和体系结构层次可见的数据旁路,减少着色指令执行过程中冗余结果数据的写回操作,从而有效发掘着色器内部的指令级并行性,并减少其数据通路中互连结构的设计复杂度。之后以顶点着色器为例,对基于传输触发的可编程着色器微体系结构进行详细设计。通过融合传输触发和顶点处理的特点,定制了着色器微指令集;并分别通过配置运算单元数目和改进寄存器端口及写回机制,达到进一步降低面积开销的目的。最后,本文对该着色器进行了硬件设计和FPGA原型系统搭建,验证了本文所提出的可编程着色器体系结构具有较高的计算性能并能够减少面积开销,从而有效提升着色器的面积效能。最后,本文提出一种面向图元的顶点拾取策略,有效消除在多着色器上运行的顶点数据任务间的顺序依赖性。在此基础上,通过改进原有面向单顶点着色器的顶点Cache结构,对多着色器结构下的顶点数据访问带宽进行优化。在进行顶点着色器前,使用Pre-TnL顶点Cache与面向图元顶点拾取策略相结合,缓存最近拾取的顶点数据,降低其总线访问频度;之后通过设计一种tag部分与数据存储部分分离的Post-TnL顶点Cache结构,有效缓存多着色器最近提交的顶点处理结果。最后通过在多核嵌入式GPU任务调度器中设计顺序提交控制逻辑,保证分离Cache缓存结果的正确性。仿真结果表明,分离Post-TnL顶点Cache可以有效减少重复处理的顶点数目,进一步降低顶点访问带宽。仿真评估和硬件实现验证结果表明,本文提出的嵌入式GPU可编程着色器体系结构设计方法可以实现对面积开销和顶点数据访问带宽的优化,为未来针对基于多统一着色器嵌入式GPU体系结构的设计与实现进行了有益的探索。

【Abstract】 As the development of silicon technology and application requirement, embeddedgraphics processing units (GPU) with multiple unified shaders have been widely integrat-ed into System-on-Chip (SoCs) for high-end mobile devices. However, the number ofprogrammable shader cores in embedded GPU architecture is restricted by silicon areacost so that it is required to improve performance while maintain area efciency duringshader architecture design. Moreover, a large amount of graphics data located in externalmemory should be accessed in rendering, leading to a higher bus bandwidth and evenhuge power dissipation in embedded GPUs. Therefore, it is essential to optimize areacost and data bandwidth in programmable shader architecture. In this dissertation, someresearch works focusing on both problems are proposed, including modeling method ofmulti-core embedded GPU architecture, area efcient arithmetic datapath and processorarchitecture for shaders and bandwidth optimized vertex cache hierarchy in multi-shaderarchitecture. The main target of the proposed works is to provide fundamental theory andtechnology for future research and design of multi-core embedded GPU architecture.First, a high-level, full system simulation platform based on hybrid modeling meth-ods for embedded GPUs is proposed. To avoid slow simulation speed of complex systemsoftware, an instruction-set simulator based on QEMU is proposed. Additionally, inter-connection network and device interfaces in SoC are modeled in SystemC-TLM to im-prove simulation efciency. After that, we introduce a basic embedded GPU architecturebased on multiple unified shaders and internal data bufers. To describe its micro archi-tecture, a detailed cycle-level model is proposed and combined with the SystemC-TLMhardware model to provide a fundamental experiment platform for our research works.Second, area efcient floating-point (FP) function units in shader are proposed. Atfirst, a unified, multi-functional FP vector arithmetic unit (VAU) is implemented. To sup-port basic vector operations, the main hardware blocks in the conventional vector produc-tion unit is vectorized and multiplexed, which can efectively maintain performance andreduce huge additional area cost. Based on VAU, we introduce a method to use idle VAUsin shader architecture to calculate quadratic approximation, which can further reduce thearea cost of elementary transcendental function unit.Third, a high performance, area efcient programmable shader architecture basedon transport triggered architecture (TTA) is proposed. With the help of fine-grained datatransport and visible bypass at micro architecture level, redundant write back of instruc- tion results can be avoided, which is benefit for exploitation of instruction level parallelis-m. Then a detailed TTA-like vertex shader micro architecture is implemented. Combiningboth features of TTA and vertex processing, we define a customized shading instructionset. By configuring the number of functional units and optimizing the design of registerport and result writeback scheme, area cost of the implemented vertex shader can be fur-ther reduced. We finally implement the proposed vertex shader in both ASIC design andFPGA prototype platform to evaluate that the proposed TTA-like shader architecture canprovide high performance with reduced area cost, leading to significant area efciency forembedded platform.Finally, we introduce a primitive-oriented vertex fetch (POVF) scheme to eliminatesequential dependencies among diferent vertex batches in the multiple shader architec-ture. Based on it, we try to reduce vertex data fetching bandwidth by optimizing vertexcache hierarchy for multi-shader architecture. To reduce bus access frequency for vertexdata, a pre-TnL vertex cache combined with POVF scheme is proposed to hold recentlyfetched vertex data before shading. On the other hand, a tag-SRAM separated post-TnLvertex cache is also implemented to bufering recently shaded vertex result data at difer-ent stages of vertex processing. To guarantee valid vertex cache results, hardware logicfor in-order submission of vertex batches is also implemented in the task scheduler of themulti-shader embedded GPU architecture. Simulation results shows that the number ofredundant vertex data processing and vertex bandwidth can be reduced using the separatedpost-TnL vertex cache.Simulation and implementation results show that the area cost and vertex fetchingbandwidth can be efectively optimized using the micro architecture design methods pro-posed in this dissertation, which is a beneficial exploration for research and design ofembedded GPU architecture based on multiple unified shaders in future.

  • 【网络出版投稿人】 天津大学
  • 【网络出版年期】2014年 12期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络