节点文献

同步数据触发多核处理器体系结构关键技术研究

Research on the Design Techniques of Synchronous Data Triggered Multi-core Architecture

【作者】 赖明澈

【导师】 王志英; 戴葵;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2008, 博士

【摘要】 随着VLSI技术的迅猛发展与应用需求的不断提高,单纯依靠提升主频已经很难进一步提高处理器性能,采用以多核处理器为代表的先进体系结构已经逐渐成为提高处理器性能的主要途径。受当前集成电路工艺条件的推动,片内多核处理器结构已初现端倪,但尚有一系列科学技术问题亟待解决,主要包括多核并行体系结构问题、多核互连通信问题、多核多级存储问题等。针对多核处理器体系结构面临的核心理论与设计技术问题进行研究,可为未来超高性能多核处理器芯片的设计与实现提供坚实的理论和技术基础,具有重要理论意义和应用价值。本文针对超高性能多核处理器,主要深入研究了一种同步数据触发(Synchron- ous Data Triggered Architecture,SDTA)多核体系结构,它包括了大量高性能SDTA计算内核,每个内核具有结构简单、计算资源利用率高、计算能力强、可扩展性好等优势。结合同步数据触发多核处理器特点,本文重点对SDTA处理单元设计关键技术进行研究,采用资源优化途径来提高执行性能并降低其代价开销,同时利用指令压缩技术来解决其代码体积问题。继而,本文还对SDTA多核片内互连通信结构进行建模,研究并实现了具有高带宽、低延迟、低代价特点的多核互连通信系统。取得的主要研究成果如下:1.提出了一种同步数据触发多核体系结构,它包括SDTA单元计算内核、SDTA单元存储系统、片上通信互连结构、多核同步机制等部分。单个处理单元结构简单,设计灵活,可扩展性强,有效支持SIMD和MIMD,允许开发多个层次上的并行性。另外,设计了包括指令Cache、局部存储器、DMA部件及二级Cache的多核存储系统,采用了片上网络基本通信构架,支持与SPARC体系结构兼容的同步机制。2.提出了一种代价解析模型用来评价处理单元的面积与功耗,满足精度要求的同时具有较好灵活性与较高工作效率。还提出了适应于SDTA处理单元的硬件资源优化方法,在建立软硬件设计工具链的基础之上,开展启发式搜索算法指导的计算内核局部优化与解析式处理单元全局优化等过程,具有优化效率高、效果好等特点。3.提出了一种模板式垂直字典压缩技术,用于解决SDTA体系结构中的代码稀疏问题,它强调代码压缩比、解压实时性与资源开销三个方面的因素。还继续提出了分流并行解压硬件模型,并修改了软件工具链。该技术以较小执行周期为代价,极大减少了代码体积,降低了芯片面积与功耗开销。4.提出了面向片上互连网络的解析式性能分析方法。建立了基于M/G/1/N排队系统的片上网络数学模型,分析精度好、效率高,有助于片上网络结构设计及应用程序拓扑映射优化。为解决单通道结构所暴露的性能瓶颈,还提出了两种改进的多通道结构数学模型,借助各项性能指标,最终指导完成了SDTA多核片上互连网络的微体系结构设计与实现。5.提出了一种基于拥塞缓解的动态虚拟通道结构,用于解决片上路由器缓冲利用率低、阻塞现象频繁等缺陷。改进了典型路由器结构设计,完成了动态多通道路由器的VLSI实现。实验表明,它能自适应于网络流量特征动态调整虚拟通道组织方式,改善网络性能,同时,还采用了链表方式来组织虚拟通道共享缓冲,具有较小代价开销,通过提高缓冲利用率,节省了大量芯片面积与功耗。实验结果表明,面向多媒体信号处理领域,经硬件资源优化后的SDTA处理单元具有硬件代价小、执行性能高等特点,其内核性能与TI-C64 DSP相当,整个处理单元对多媒体应用具有显著加速效果。另外,SDTA片上互连网络具有高带宽、低延迟等特点,尤其是,提出的动态虚拟通道技术能有效降低代价开销,继续改善网络性能。相关研究成果为SDTA多核处理器提供了较好的解决方案和理论分析基础,能够直接适用今后的多核处理器芯片的设计与实现。

【Abstract】 With the rapid development of very large scale integration technology and the increasing magnitude of application requirements, the advanced multi-core architecture has been the prevalent approach to further improve the processor performance instead of high frequency. Recently, with the promotion of integrate circuit conditions, the multi-core processor has come into sight. However, there still remain lots of problems to be solved, including multi-core parallelism architecture, the solution for on-chip communication, the bandwidth-balanced multi-level memory system and so on. The in-depth study on these theories and design problems will provide the implementation of further high-performance multi-core with great theoretical and practical significance.During the research on high-performance processor, this dissertation presents a syn- chronous data triggered multi-core architecture, where each processor element with scalability characteristics provides high performance, while corresponding to the simple structure and high utilization of transistor resources. Combining with the synchronous data triggered multi-core architecture, some key design techniques on SDTA processor element have been well studied. The novel resource optimization approach is used to improve the performance and save the hardware cost, and then the code compression method is deeply studied to solve code density problem. Following, an accurate analytical performance analysis approach for network on chip is developed, and the on-chip communication structure with characteristics of high-bandwidth, low latency and low cost is implemented. The main contributions are listed as follows.1. We propose a synchronous data triggered multi-core architecture, which is composed of SDTA computing cores, SDTA memory system, the on-chip com- munication structure, the multi-core synchronization mechanism and so on. Each processor with simple and flexible structure supports both SIMD and MIMD, and it has the high performance ability by exploiting the parallelisms during different levels. Besides, the memory system includes the instruction cache, local memory, DMA engine as well as secondary eDRAM-based cache. The network on chip is introduced for the on-chip commication structure, while the effective synchroni- zation mechanism is adpoted to be compatible with SPARC architecture.2. We develop the software and hardware utility suits for synchronous data triggered processor element, and introduced an analytical approach for cost estimation, which meets the precision requirement and has the advantages of flexibility and high- efficiency. Also, we proposed a novel automated approach to explore and design the high-efficiency processor element. The design space is explored using a divide- and-conquer approach, where heuristic-based search process is followed for optimal computing cores and the analytical method using trace-driven simulation is for overall processor element.3. We put forward a template vertical dictionary-based program compression scheme to solve poor code density problem of synchronous data triggered architecture. This scheme emphasizes three aspects, involving the low compression ratio, the limited hardware cost and the run-time decompression. Furthermore, we develop the multi- stream parallel decompression engine and update the software utility suits. This scheme achieves the ultra-low compression ratio with the expense of little execution overhead, while the area and power consumption are saved efficiently.4. We propose a novel performance analysis approach for network on chip based on analytical router modeling. According to the generalized router architecture, the analytical router model which uses M/G/1/N queuing system is established, and it may be used to explore the communication architecture and guide the application mappings. To eliminate the bottleneck during the performance analysis, the analytical models for the improved multi-channel structures are described, which may be used to further guide the design of on-chip routers. By the analytical analysis results, the on-chip network micro-architecture for multi-core processor is designed and implemented in the end.5. We further present the novel dynamic virtual channel architecture with congestion awareness scheme to solve the low buffer utilization and eliminate various blockings. By modifying the previous high speed router, the VLSI implementation of router with dynamic channels is completed. The modified router may regulate the channel organization according to traffic conditions, and it provide throughput increase and latency decrease with the obvious savings of silicon area and power consumption.Plenty of experiments are completed. Towards multimedia and signal processing domains, the optimized processor element has the characteristics of high performance and low cost. The computing core is similar with TI TMS320-C64 series DSP and the overall processor element does the obvious acceleration in the multimedia applications. Then, the communication structure with low-latency and high throughput is presented, and the measure for low hardware cost is put forward. These key techniques with sufficient theory basis may be directly applied to the design and implementation of further multi-core processor.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络