节点文献

面向可重构系统的几个常用算法及其实现技术研究

Research on Several Frequently-used Algorithms and Their Implementation for Reconfigurable System

【作者】 牟胜梅

【导师】 杨晓东;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2008, 博士

【摘要】 用硬件实现算法一般可比软件实现快数个数量级,以FPGA为代表的可重构硬件以其速度快、功耗低、编程灵活等优点成为算法硬件加速的首选。随着集成电路技术的发展,FPGA芯片的容量和性能不断提高,其在数字信号处理、计算机技术和科学计算等领域的应用日益广泛。本文以减少计算延迟、优化资源使用、提高吞吐率为目标,基于可重构平台对数字信号处理和科学计算领域使用频率较高的初等函数及其复合函数求值、向量旋转和快速傅利叶变换等算法及其实现方法进行了深入研究,综合了近似算法、表驱动的求值算法、流水线组织与访存调度等优化策略,提出相关创新方法。本文的主要工作包括:1)研究了指数/对数函数的计算方法:对常规CORDIC算法进行压缩,将、通路合而为一,实现了专用求值器,减少了芯片面积开销;提出一个统一的指数/对数函数迭代求值算法,通过设置不同的初值和功能选择信号,仅用、通路便可实现指数或对数函数求值,与CORDIC算法相比可节省1/3以上的面积开销;针对低精度、高速度应用,设计并实现了一个基于近似算法的高速对数变换器,并利用简洁、高效的校正逻辑提高了计算精度。2)研究了基于查找表的函数求值方法,利用二阶minimax多项式对函数逐段逼近,实现了一个高效的多函数查表求值系统。通过合理选择分段策略,兼顾了段地址译码逻辑的复杂性与查找表的存储开销;系数逐一圆整(rounding)、三次逼近减少了系数有限精度引入的方法误差,从而减少了存储开销;采用定/浮点双通路分别计算不同特性的函数,保证了计算精度;预先进行精度控制和中间结果截断,减少了多项式的计算开销;最后合理分配各函数查找表的存储空间,实现了系统集成。3)面向旋转角取值范围广且不可预知以及旋转角数目有限且可预知两类应用,以减少迭代次数和面积开销为目标,提出2S-PCS和FFT CORDIC两种向量旋转算法。充分利用了FPGA片上存储资源,借助查找表辅助计算,在减少了迭代次数的同时保持了常规CORDIC算法扩展因子易于计算和补偿的优点,使其相对乘/加方法更具优势。采用28位数据通路时,与常规CORDIC算法相比,2S-PCS算法的流水线级数约减少38%,面积约减少27.9%,精度提高3位(2进制)左右,显示了算法的优良性能。最后面向两类特殊FFT应用对CORDIC算法进行了优化。4)以基2时分FFT算法为基础,针对浮点FFT处理器中的写后读数据相关,提出几种减少相关的方法,并通过改进运算蝶结构、合理调度FPGA片上RAM访问,设计并实现了一个高吞吐率FFT处理器,每周期可读取两个复操作数,输出两个复计算结果,吞吐率为传统FFT处理器的2倍。此外,还面向点数不固定的应用,设计并实现了一个运算蝶级联的可变长FFT处理器。本文所研究的算法实现方法具有通用性,不仅适用于可重构平台,略作调整也适用于ASIC设计。由于算法粒度较小,更适合与其他操作结合实现更大规模的应用。若对这些算法单独进行硬件加速,则要考虑数据输入/输出等额外开销对性能加速的影响。

【Abstract】 Hardware-based implementations of algorithms are desirable, since they can be several orders of magnitudes faster than software-based methods. Reconfigurable devices such as Field Programmable Gate Arrays are ideal candidates for this purpose, because of their high speed, low power and flexibility. With the development of integrated circuits, the density and performance of FPGA grow steadily, which makes it more widely used in applications such as digital signal processing and scientific computing.Aiming at reducing latency, optimizing resource utilization and increasing throughput, in this paper, we make research on such frequently used algorithms and their implementation methods as evaluation of elementary and compound functions, vector rotation and fast Fourier transform on reconfigurable devices. Some effective methods and optimization techniques are proposed based on approximation algorithms, table-based evaluation methods, pipelining and memory access scheduling. Our work includes:1) Research on the evaluation methods of exponential and logarithm function: First, we design a simplified circuit based on CORDIC algorithm to compute exponential function, which fuses the datapath of x and y to reduce the area cost. Second, we propose a unified exponential and logarithm function evaluation algorithm called LnE; after setting initial values and function selection signal properly, the wanted function are computed. LnE algorithm spares z datapath of conventional CORDIC algorithm and saves more than 1/3 of the area estate. Thirdly, we design and implement a fast binary logarithm calculator based on an approximation algorithm to cater the applications with requirement of low accuracy but high speed, with simple and effective correcting circuits to improve accuracy.2) Propose an effective table-based multi-function evaluation algorithm, which uses minimax quadratic approximation on each segment. It chooses different segmentation techniques to tradeoff segment address decoding complexity and memory cost of lookup tables, and uses fixed-point and floating-point datapath to compute nearly linear and highly non-linear functions respectively. Through a three-pass iterative approximation process, the algorithm takes into account the effect of rounding the polynomial coefficients to a finite size, allowing for a further reduction in the size of lookup tables to be used. Through accuracy controlling and intermediate results truncation, the area and delay of the system are reduced. Finally a multi-function system is realized by skillfully allocating memory space to different lookup tables.3)There are two types of vector rotation applications, one with random and unknown rotation angles, and the other with pre-known and limited rotation angles. Based on such two kinds of applications, we propose 2S-PCS and FFT CORDIC algorithms to reduce iteration numbers and area occupation. Making use of lookup tables to support computing, these two algorithms reduce the iteration number and computing delay greatly while not compromising the ease of scale factor computing and compensation as that in conventional CORDIC algorithm. When using 28-bit datapath, 2S-PCS algorithm requires 38% less pipeline stages and 27.9% less area consumption compared to those of conventional CORDIC algorithm, while achieving 3 more binary bits accuracy, which shows great performance superiority.We still make some improvements on conventional CORDIC algorithm based on two special FFT applications.4) Based on radix-2 time decimation FFT algorithms, we design a variable-length fixed-point FFT processor based on cascaded butterflies. Furthermore, we analyze the read-after-write dependency in floating-point FFT processors, give some proposals to reduce it, and realize a 32-bit IEEE 754 single-precision FFT processor with modified structure of butterfly units and optimized RAM access, which makes it possible to read and write two complex operands simultaneously per cycle, and double the throughput of traditional FFT processors.It should be mentioned that the implementation methods of algorithms discussed in this paper are general and not limited to reconfigurable hardware; they can also be used to guide ASIC design by making some modifications. It’s desirable to integrate these algorithms into larger applications to achieve higher acceleration ratios. If just implementing these algorithms alone in hardware, we should also consider the effect of communication overhead between hardware and software.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络