节点文献

SoC中应用类IP核高级综合技术研究

Research on High Level Synthesis of IP Core for Specific Applications

【作者】 董亚卓

【导师】 窦勇;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2008, 博士

【摘要】 近年来,随着集成电路设计和工艺技术水平的快速提高,片上系统SoC设计技术得到越来越广泛的应用,已经逐步涉及到电子设计技术的诸多领域。SoC设计技术已经成为当今超大规模集成电路的发展趋势。在SoC设计中,IP核是其设计的基础和核心,SoC设计需要尽可能地使用现有IP,以搭积木的方式完成大部分设计。其中,应用类IP核设计是SoC创新性的体现,也是制约SoC快速构建的关键。IP核高级综合技术实现将硬件行为级描述转化为结构描述,甚至布图描述,提高了抽象级别,使设计者从繁杂的底层设计细节中解脱出来,更加专注于整个系统的设计,提高了设计的效率和正确率,降低了设计成本。IP核高级综合技术自提出以来,引起了学术界和工业界的高度重视,并且在未来的设计中将占据更加重要的地位。本文主要面向一类应用程序类型——滑动窗口应用展开研究。滑动窗口广泛应用于图像处理、模式识别和数字信号处理领域,它具有数据量大,计算密集等特点。滑动窗口应用因其访存的特殊性,而成为很多高级综合工具研究的入手点。令人遗憾的是,现有的高级综合系统在解决滑动窗口应用中还存在各种不足,或者没有明确的体系结构模型,或者没有充分开发数据重用,或者为实现数据重用使用了过多的硬件资源,或者没有进行设计空间探索优化。本文在现有工作的基础上,系统的研究了面向滑动窗口应用的IP核的高级综合技术,主要对以下几个方面的问题进行了研究。针对现有体系结构模型的不足,本文首先提出了IP核的参数化三层存储结构模型,设计目标是充分开发滑动窗口应用中存在的数据重用,减少访存次数,加快程序执行速度。该模型采用三级存储层次和寄存器轮转策略,充分开发循环层内和循环层间数据重用,其具体结构由若干参数确定,参数值由编译器根据具体滑动窗口应用的特点在编译阶段确定。本文针对不同类型的数据重用,提出了参数提取算法。实验结果表明,与相关工作相比,本文提出的存储结构模型使用相对较少的存储单元,将程序执行节拍减少了2.13到3.8倍,将程序执行频率由69MHz提升到了200MHz以上。在参数化三层存储结构模型的基础上,本文研究了IP核RTL级硬件描述文件的自动生成。设计目标是实现IP核的可综合Verilog代码自动生成。该过程包括三部分:控制状态机自动生成、运算流水线自动生成和整体封装模块生成。首先,编译器将滑动窗口应用源程序划分为控制部分和运算部分。通过在编译平台上对程序控制部分进行分析,获得循环信息(循环初值、终值和步进值)和数据重用信息,本文提出的控制状态机自动生成算法根据这些信息,实现控制状态机的自动生成。源程序运算部分在编译平台上经过数据结构定义、相关性分析等操作,输出数据流图描述文件,再经过运算流水段划分,生成新的程序中间表述IR(Intermediate Representation),最后,调用相应的运算单元IP函数,实现运算流水线的自动生成。整体封装模块将控制单元、运算流水线和暂存单元等模块集成,实现RTL级IP核硬件描述文件的生成。这种方法避免了手工映射的复杂性和低效性,实现自动映射,并且结果比较优化。在此基础上,本文进一步研究了片上资源足够和不足两种情况下的设计空间探索技术。当片上资源足够时,本文设计了一种基于硬件流水结构的设计空间探索方法,设计目标是充分利用片上资源,提高算法并行度,减少程序执行节拍。其基本思想为在程序正式加载到目标开发板之前,综合考虑片上系统提供的各种资源(主要为芯片面积、存储带宽和存储资源,本文用片上逻辑计算部件个数来衡量片上面积资源),生成能充分利用片上资源的底层硬件结构。如果片上资源有余,则最大化循环展开,增加程序并行性。如果面积资源有余,而存储资源不足,本文将输入数组沿水平方向分块,并实现块内部的数据流水化调度,以尽可能的减少重复访问片外存储系统的次数。实验证明,本文提出的设计空间探索方法,可以将片上资源利用率提高到85%以上,同时本文的阵列分块方法与相关工作相比,可以将访存次数降低2%到20%。在一些大规模应用中,存在大量包含多个循环基本块的程序,由于片上资源有限,并不能将这些循环基本块同时映射到目标芯片上。在这种情况下,如果为每个循环基本块设计一个专用IP核显然是不实际的。本文在片上资源受限的情况下,针对多循环程序设计了一个参数化的流水线模板,该模板结构对特定目标应用中所有循环基本块通用,能够实现对所有循环基本块的顺序映射。该模板根据目标应用需求和片上资源数量确定底层运算单元的配置,并基于软件流水的迭代模调度思想和ShiftQ体系结构模型,实现对各个循环基本块的指令调度和中间暂存寄存器自动生成。实验表明,针对每个循环基本块,本文设计的流水线模板能达到与专用硬件结构相当的执行节拍,同时本文提出的通用模板结构简化了为每个循环设计专用IP这一过程,降低了设计复杂度,缩短了设计周期。综上所述,本文面向滑动窗口应用,研究其IP核的高级综合技术,对存储结构模型、RTL级硬件描述文件自动生成和两种情况下的设计空间探索方法等问题提出了有效的解决方案,对于推进应用类IP核高级综合技术的研究和实用化具有一定的理论意义和应用价值。

【Abstract】 In recent years, with the rapid development of IC (integrate circuit) design, the technology of system-on-chip (SoC) has been widely used and increasingly involoved in many fields of electronic technology. In fact, SoC has become a trend of current VLSI (very large scale integration) design.The IP (intellectual property) core is the basis and kernel of SoC design. Designers of SoC try to reuse existing IP cores as much as possible to finish the whole project simply by getting them together. These IP cores oriented at special applications embody the innovation of SoC and are also a key factor to the design speed. The HLS (high level synthesis) of IP core raise the level of design from transforming behavior-level description to structure-level, even layout description. HLS can help the designers be released from the complicated hardware design and focus on the high level system design which increases the efficiency and validity of SoC design, and reduces the cost at the same time. As a result, this technology has got much recognition from academe and industry, since it is brought forward and will be promising in the future.Of particular interests to this paper are sliding-window applications, which is widely used in signal, image and video processing and requires much computation and data manipulation. Many HLS systems start with this kind of application because of its particularity of memory accessing. Unfortunately, there are still various limitations of current works. Some of them do not put forward the memory architecture definitely, some do not realize data reuse adequately, some use large numbers of memory elements and registers, and some of them do not discuss the problem of design space exploration. We deeply study some key problems in HLS of IP core for sliding-window operations in this thesis which is outlined as followed.Aiming at the inherent characteristics of sliding-window operations and the limitation of current works, we propose a parameterized memory architecture to generate the hardware frames for all sliding-window applications automatically. The object of our work is to realize data reuse as fully as possible, so as to reduce the number of memory accesses and speedup the execution. A three levels memory structure is designed to realize inner-loop and outer-loop data reuse, and at the same time shifted registers are used to make hardware design simpler. The architecture is decided by some parameters, the values of which are achieved from the compiler. We proposed the parameters’s generation algorithm according to different kinds of data reuse. Compared to related works, our approach which uses only a small number of memory elememts and registers can reduce the execution clock cycles by 2.13X and up to 3.8X, and enhance the frequency from 69MHz to more than 200MHz.Based on the parameterized memory architecture, we study the generation of RTL level hardware description, the aim of which is to generate Verilog code of IP core automatically. There are three parts of work: automatic generation of controllers, automatic generation of pipelined operations and generation of holistic encapsulation module. Firstly, the compiler partitions the source codes into two parts: control cell and operation cell. The control cell is analyzed in the compiler, then the value of some parameters are obtained, including the information of loop (the initial value, end value and step-length value of the loop) and the information of data reuse. A algorithm of controllers’ generation is presented in this paper, and the controllers can be generated automatically according to these parameters. The operation cell is disposed in the compiler via a series of steps: defining data structure, analyzing dependency, then the description of data dependence flow is created. Based on it, we partition the datapath into pipelined stages, and express the source program in a new IR (intermediate representation). And then, the pipelined operations are generated. Finally, the holistic encapsulation module integrates the controller module, operation module and Ram module etc, and realize the RTL level hardware description’s generation. Our approach can avoid the complexity and inefficient of handiwork, and the result is comparatively better.Then, this paper studies the design space exploration technology further according to the sufficiency of resources on chip. We present a design space exploration approach when the resources on-chip is abundant, the aim of which is to use the resources completely, increase parallelism, and reduce the clock cycles of execution. By finding three upper bounds according to area constraints (which is scaled by the number of logic operation units), memory bandwidth constraints and on-chip memory constraints, the block structure of the design, which can fully utilized the available resources on the board is determined. Loop unrolling is adopted as much as possible when the area on-chip is abundant. The input data array is partitioned into several pieces horizontally once the memory elements are insufficient. And the data in a piece is processed in pipeline in order to reduce the number of memory accesses as many as possible. Experiments show that the efficiency of memory using can increase to 85% and compared to current work, the number of memory accesses can reduce by 2% to 20%.There are some large applications which consist of many loop nests. Map these loop nests in an application onto a target chip maybe impractical because of the area limitation on-chip. Traditional method of designing special IP core for every loop nest is awkward. This paper presents a pipelined template, which is universal to all loop nests in an application. These loop nests can be executed on the template orderly. We decide the number of FUs (function units) according to the resources on-chip and the character of specific application. Based on the iterative modulo scheduling of software pipelinging and the ShiftQ architecture, we schedule the instructions of each loop nest and realize the automatic generation of the registers which are used to keep the intermediate results. Experiments show that the pipelined template can achieve a comparative execution cycles for a loop comparing with the special hardware, and at the same time our approach save the time of designing specific IP core for every loop nests.In summary, our works study the HLS of IP core for sliding-window operations, present solutions to several key problems of memory architecture, hardware description code generation and design space exploration of two situations. Our works have academic and practical value for advancing the theory and practicability of HLS of IP core for specific applications.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络