节点文献

基于超长指令字模板高精度算法加速器体系结构研究

Research on the Hardware Acceleration for High-precision Algorithm Based-on Very Long Instruction Word Framework

【作者】 雷元武

【导师】 窦勇;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2012, 博士

【摘要】 科学计算已经成为继理论研究和物理实验之后,现代科学研究的第三种手段,其计算结果的精度将直接影响科学研究的成果和成败。随着计算规模不断扩大,科学计算中浮点运算的舍入误差累积加剧,这导致计算结果不精确、不可靠、甚至不正确。高精度算术是保证大规模科学计算精度最直接、有效、可靠的方法,同时它具有提高算法可再现性、增强算法稳定性、加快算法收敛速度等优势。然而,基于CPU或GPU的通用计算平台,内部定制了确定宽度的数据通路和固定精度的运算单元,只能通过软件模拟的方式实现多种高精度浮点算术,这导致计算性能和效率低。近年来,FPGA器件以其可定制、可重构、高性能、低功耗的优势,成为理想的加速计算平台。本文将FPGA可重构技术、超长指令字(VLIW)技术与高精度计算相结合,探索解决基于FPGA的高精度算法加速器设计面临的关键问题,开发高精度应用中不同层次的并行性和最大化FPGA的性能和资源利用率。本文取得的主要研究成果如下:1、提出一个适应高精度运算的处理器体系结构——定制VLIW模板。VLIW技术是挖掘算法并行性的一种理想方法,具有硬件结构简单、性能高和扩展性好的特点。本文针对高精度运算的特征,在FPGA平台上定制了一个VLIW模板结构,内部集成多个定制高精度基本运算单元,通过VLIW指令的显式并行技术来开发高精度运算中的指令级并行。基于此模板建立可配置的多VLIW核的高精度算法加速器体系结构,开发高精度应用算法中线程级并行。最后,针对VLIW技术中的关键问题——代码膨胀,提出一种适合FPGA平台的多级索引VLIW指令压缩技术,使用标志位和多存储体方式解决传统代码压缩技术中的VLIW指令长度不确定问题,最大限度避免空操作带来指令空间浪费。在基于定制VLIW模板的四精度基本函数处理器和四精度算法加速器设计中,该压缩策略的压缩率分别为37.5%和24.5%。2、提出基于全展开的精确四精度向量内积算法及实现结构。针对科学计算中最常见的、对数值算法稳定性和结果精度影响较大的基本操作——向量内积,本文提出基于全展开的精确四精度向量内积算法和实现结构(Quad-HPMAC),采用无损失的定点操作获得精确内积结果,采用累加和的两级存储结构、累加和划分及进位保留累加等优化策略来提高Quad-HPMAC单元的频率和吞吐率。最后,基于Quad-HPMAC模块建立统一四精度矩阵运算加速器,实现矩阵乘、LU分解和MGS-QR分解算法。实验结果表明,相对于通用Intel多核平台上并行软件实现,该加速器能够取得5~8位的精度提升和40倍以上的性能提升。3、提出基于VLIW模板的统一四精度基本函数计算模型及实现结构。针对科学计算中基本函数种类多、实现复杂、使用频率低、计算延时大的特征,本文提出基于VLIW模板的统一四精度基本函数计算模型和实现结构(QP_VELP)。该结构具有性能高和扩展性好的优势,利用Estrin策略提高多项式计算的并行性,通过循环展开、流水线并行和VLIW指令显式并行技术提高性能。与相关工作相比,统一基本函数处理器不仅在资源消耗、延时、精度等方面占优,而且该处理器能够使用统一硬件资源实现多种基本函数的计算,在实际科学和工程应用中取得较高的资源利用率。4、提出基于VLIW模板的四精度算法加速器结构。本文针对科学计算中不规则类计算密集型算法,以空间目标轨道预测SGP4/SDP4算法为例,提出基于VLIW模板的四精度算法加速器结构。通过集成QP_VELP模块实现多种使用频率低的基本函数,解决基本操作种类多的问题;通过定制VLIW指令的约束来满足操作之间复杂的数据依赖关系;通过多个四精度操作单元的并行执行来开发算法的指令级并行性;通过多个VLIW核的并行执行来开发算法的线程级并行。同时,本文还提出基于贪婪思想的指令调度算法,结合存储空间分配及冲突检测,实现算法的数据流图到定制VLIW指令槽的映射,最大限度地减少定制VLIW指令中的空操作。实验结果表明,相对于Intel多核处理器,该四精度算法加速器能够取得7.8~15倍的性能提升。5、针对某些计算精度要求更高的特定科学应用领域,本文将四精度算法加速器中的相关概念、研究及实现方法扩展到任意精度浮点算术系统中。提出基于全展开的任意精度精确向量内积算法及实现结构(VPMAC)和基于VLIW模板的任意精度基本函数处理器(VP_VELP),其中VP_VELP内部集成多个任意精度基本操作单元,通过VLIW指令的显式并行技术和动态改变内部计算精度的方法来提高性能,使用统一硬件资源实现多种任意精度基本操作和任意精度基本函数。最后,通过VPMAC协处理器和统一任意精度矩阵加速器(VPMATA)这两种方式实现任意精度矩阵类算法。实验结果表明:相对于Intel四核处理器上的并行MPFR函数库,集成8个VPMAC模块和1个VP_VELP模块的VPMATA能够获得13~63倍的加速效果。

【Abstract】 Scentific computing becomes the third mode for scientific discovery beyond theoryand experient. Most of them operate on floating-point arithmetic, in which roundingerror is an unavoidable consequence. And the accumulation of rounding errors leads toinaccurate, unreliable and even wrong results. Thus, many scientific applications rely onthe high-precision arithmetic. However, the performance of high-precision arithmetic ingeneral-purpose processor is very poor since most of them are accomplished bysoftware emulation with fixed-precision operations, such as64-bit floating-point.Field-Programmable Gate Arrays (FPGAs) have advantages over CPU in terms ofcustomizable, reconfigurable, performance, and power consumption, so the use ofFPGA-based accelerators has become a promising approach for speed up scientificapplications. In this thesis, we implement high-precision floating-point arithmetic onFPGAs to explore the capability and flexibility of FPGA solutions in sense to acceleratehigh-precision scientific applications. In summary, this thesis makes the followingcontributions:(1) We propose a parameterizable Very Long Instruction Word (VLIW) frameworkon FPGAs, which features with less hardware complexity, high performance, and highscalability. Based on this formwork, a hardware accelerator with multiple VLIW kernelsis presented to exploit instruction level parallelism (ILP) and thread level parallel (TLP)in high-precision applications simultaneously. In order to solve the code densityproblem in VLIW implementation, we propose a mult-level index code compressionscheme for custom VLIW framework on FPGAs. For each unit, a flag is used toindicate whether this unit is used and a RAM is built to store the used operation. Thisscheme can solve the uncertain length of VLIW instruction in tradition codecompression method and avoid explicit no-ops fully.(2) We propose exact vector inner product algorithm and structure (Quad-HPMAC)for IEEE-754(2008) standard quadruple precision floating-point arithmetic. A very longfixed-point register is employed to store the summation without information loss andexact fixed-point operations, instead of floating-point operations, are used to gain exactresults. Several schemes, such as two-level RAM banks structure for summation, partialsummation scheme, and carry-save accumulation scheme, are introduced to improve thefrequency and throughput of Quad-HPMAC unit. Finally, a prototype of the unifiedmatrix accelerator, equipped with4Quad-HPMAC units, is presented to implementtypical quadruple precision matrix computation algorithms, such as matrixmultiplication, LU decomposition, and MGS-QR decomposition. Experimental resultsshow that our design outperforms general-purpose processors in terms of precision,performance, and power consumption. (3) We propose a special-purpose processor (QP_VELP) based on the customVLIW framework, which used the unified hardware to efficiently evaluate variousquadruple precision elementary functions. This processor is well match up to thefeatures of elementary functions in scientific applications, such as high implementationcomplexity, low use frequency, and high latency. The pipelined implementation ofpolynomial approximation with Estrin scheme is addressed to enhance the ILP. Theperformance of QP_VELP is improved through loop unrolling technique and explicitlyparallel of VLIW instruction. Compared to the related work, our design achieves higherprecision and lower latency with less resource consumption. Moreover, our solution forelementary functions can achieve high resource utilization.(4) Taking the orbit prediction algorithm of spatial object (SGP4/SDP4) as anexample, we present a VLIW-based architecture for quadruple precision scientificapplications. The QP_VELP unit is integrated into this accelerator to implement variouselementary functions in SGP4/SDP4with the unified hardware. Multiple basicquadruple precision operation units in this accelerator can be executed in parallel toexploit the ILP and TLP in SGP4/SDP4. Meanwhile, we propose a greedy algorithm,which schedules the operations in the data flow graph of SGP4/SDP4algorithm into thecustom VLIW instruction, and generates the VLIW instruction sequence with littleno-ops. Experimental results show that our VLIW-based accelerator exhibits speedupperformance and power advantage compared to general-purpose processor.(5) We extend the concept, research method, and implementation scheme in thedesign of quadruple precision algorithm accelerator to arbitrary precision arithmeticsystem. First, we address the exact vector inner product structure (VPMAC) forarbitrary precision floating-point arithmetic, which uses the exact fixed-point operationto avoid the introduction of rounding errors. Then, we address the processor (VP_VELP)based on the custom VLIW framework for arbitrary precision elementary functions. Theperformance of VP_VELP is improved through the explicitly parallel technology ofVLIW instruction and by dynamically varying the precision of intermediatecomputation. Finally, two schemes, called the VPMAC coprocessor and the unifiedmatrix accelerator (VPMATA), are presented to accelerate the typical arbitraryprecision matrix computation algorithms. Experimental results show that the VPMATA,equipped with8VPMAC units and1VP_VELP unit, achieves13X-63X betterperformance.

  • 【分类号】TP332;TN791
  • 【被引频次】1
  • 【下载频次】118
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络