节点文献

异构多核处理芯片设计及优化

Design and Optimization of Heterogeneous Multi-Core Processing Chip

【作者】 周帅

【导师】 李丽;

【作者基本信息】 南京大学 , 微电子学与固体电子学, 2012, 硕士

【摘要】 异构多核是当今多核处理器设计的主流趋势。其核心思想是处理器中只有一个(或几个)通用的核心完成任务调度功能,主要的计算任务(如浮点运算、信号处理、图像处理等)则由专门的高性能计算核心来完成,从而大幅度提升处理器执行效率和性能。影响异构多核处理器性能的因素有很多,最主要的是架构和计算核心的性能。本文详细介绍了一款异构多核处理芯片。该芯片顶层架构为NoC(片上网络),集成了52个异构核,包括ARM处理器、协处理器、FFT/IFFT加速单元和转置加速单元。在FPGA上实现该芯片的结果表明,它能够满足实时成像算法的实时性要求,成像效果良好。在原有的异构多核处理芯片的设计基础上,本文针对其中3个关键技术点进行了优化。针对NI(网络接口),本文提出了一种基于微码控制器的设计方法,实现了一款同时支持3种链路通信协议的网络接口。可编程的设计使得该网络接口具有很强的灵活性、适应性。相比于传统的基于FSM(有限状态机)的网络接口设计,新的设计消耗的硬件资源减少了约10%。针对Sin/Cos运算模块,本文从理论上分析了原有设计的误差,并提出了一种通过补偿求余来提高相位精度的方法。基于这种方法,本文设计出一款高精度的Sin/Cos运算模块,大幅度提高了求Sin函数值和Cos函数值的精度。为了节省的硬件资源消耗,改进的设计对中间数据的表示格式做了一定的优化。逻辑综合结果表明,硬件资源消耗量减少了约32%。针对转置加速单元,本文一方面论述了在分布式存储系统下转置大矩阵的方法,另一方面改进了原有的转置簇(含转置加速单元)。理论分析和实验结果表明,新的设计大幅度提高了转置的速度,硬件资源的消耗却减少了约15%。影响转置效率的因素有很多,例如矩阵的规模、矩阵的形状、拆分矩阵的方式、缓冲区大小等等,在实验过程中进行了分组测试,分别统计出各种因素的影响程度,为高效的使用转置簇提供了参考。

【Abstract】 Heterogeneous multi-core is the trend of today’s multi-core processor design. Its key idea is that one (or several) general-purpose core in the processor handles the task scheduling, while dedicated computing cores handle main computing tasks (such as floating-point operations, signal processing, image processing, etc.) to improve the efficiency and performance of processor. There are many factors that can affect the performance of heterogeneous multi-core processors, architecture and functionality of the cores are the most important. In this paper, a heterogeneous multi-core processing chip is introduced. Using NoC(Network on Chip) as its top-level architecture, this chip integrates52heterogeneous cores including ARM, Coprocessor, FFT/IFFT Accelerator and Matrix Transpose Accelerator. Experiment results of implementing this chip on FPGAs show that it meets the real-time requirements of the imaging algorithm.Based on the original designs of this heterogeneous multi-core processing chip, some optimizations have been done in this paper.For the NI (Network Interface), this paper presents a design method based on micro-code controller and the realization of a new NI that supports three kinds of link communication protocol. Because it can be programmed using micro-code, this NI has strong flexibility and adaptability. Compared to the original design based on FSM(Finite State Machine), the overall hardware resource consumption of the new NI is reduced by about10%.For the Sin/Cos Computing Unit, this paper theoretically analyzes the computing deviation of the original design and proposed a new algorithm which improves the precise of phase by compensation. Based on this algorithm, a high-precision Sin/Cos computing module is proposed, and this module improves the accuracy of Sin and Cos significantly. Optimization on the representation format of data has been done in order to save hardware resource. Logic synthesis results show this new design reduces about32%hardware resource consumption.For the Matrix-Transpose Accelerator, this paper discusses the method of transposing large matrix in a distributed memory system and shows a improved design of the Transpose Cluster (including the Matrix-Transpose Accelerator). Theoretical analysis and experimental results are both indicating the new design can increase the speed of transposing matrices greatly while reducing about15%hardware resource consumption. As there are many factors (such as the size and shape of transposed matrix, method chosen to divide large matrix into smaller ones, the depth of Buffer, etc.) affecting the performance of Transpose Cluster, some statistical results on them have been derived from experiments providing a reference for the efficient utilization of this cluster.

  • 【网络出版投稿人】 南京大学
  • 【网络出版年期】2012年 10期
节点文献中: