节点文献

面向科学计算的PIM体系结构技术研究

Studies on the PIM Architectures and Techniques for Scientific Applications

【作者】 温璞

【导师】 杨学军;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2007, 博士

【摘要】 现代高性能计算机系统通常采用处理器和存储器相分离的结构设计,即以处理器为中心、通过层次式Cache和复杂存储互连网络把分离的处理器和存储器相连接。在分离结构设计中,处理器以高速度为设计目标、存储器以高集成度为设计目标,不同的目标追求导致不同的发展趋势,并且随着工艺水平的进步和处理器体系结构的发展,处理器和存储器之间的速度差距越来越大,形成“存储墙”,而在分离结构中,大量的芯片资源则用于缓解日益增大的速度差距。PIM技术把处理器和存储器紧密结合在一个芯片上,把分离的结构统一起来,具有高访存带宽、低访存延时、低功耗的优点,成为解决“存储墙”问题的一种有效方法。当前基于PIM技术的研究主要有:PIM微体系结构研究、PIM并行体系结构研究、PIM编程模型研究以及基于PIM的编译优化技术研究等方面,研究的出发点是为了最大限度发挥PIM高带宽、低延迟的结构特性。我们的研究重点集中在两个方面:一是PIM微体系结构研究,即PIM中的处理器采用何种结构才能充分发挥PIM高带宽、低延迟的结构特点是我们的研究重点之一;另一方面是PIM并行系统相关问题的研究。文章通过对面向科学计算的PIM体系结构研究,提出并设计实现了一种具有向量处理能力的高性能PIM结构——V-PIM结构,探讨了基于V-PIM的并行系统以及面向V-PIM结构的软件优化技术。文章的创新主要体现在以下几点:1、提出并设计了具有向量处理能力的V-PIM结构V-PIM是基于向量处理逻辑的PIM结构。向量具有成熟的编程模型和强大的数据并行表达能力,PIM具有高性能的存储系统,因此向量和PIM结合比较自然。我们从基于面积的效用(Performance/Area)角度分析了基于寄存器和基于存储器结构的两种向量结构,认为存储器-存储器结构与PIM结构结合在芯片资源利用以及降低功耗等方面具有优势,因此采用了基于存储器.存储器的向量结构形式。文章描述了V-PIM的结构设计,给出了扩展的向量指令集,并通过FPGA进行了验证。2、提出面向V-PPIM并行系统的V-Parcels通信机制并行系统中的通信子系统对系统的计算效率、可扩展性和适用性具有重要影响。为了减少在并行V-PIM系统中执行向量指令时的数据通信开销,我们提出了面向并行V-PPIM系统的V-Parcels通信机制。其基本思想是根据向量元素分布规律,动态产生合理的计算-通信模式,在保持性能最大化的前提下,尽可能降低向量指令执行时的通信开销。V-Parcels通信机制可以把计算传递到数据所在节点而无需在节点间频繁地传递数据。3、基于V-PIM并行系统的线程判别模型V-PIM处理器通常执行时间局部性较差但空间局部性较好的线程,这类线程我们通常称为轻量级线程;而其它适合Host处理器处理的线程通常被称为重量级线程。为了提高系统整体的性能,需要利用软件的方法将这两类线程区别开来。我们设计了一种编译级的两类线程区分算法,依据一段线程分别在V-PIM和Host处理器上的执行性能来确定,当通过编译识别出线程类型后再调度到相应的处理器上进行处理,从而加速整体性能。算法实现简单,对编译器改动较少,运行效果接近实际情况,具有较好的实用性。4、提出一种多运算核构成的运算簇和PIM主存相结合的结构——COPE提出一种多运算核构成的运算簇和PIM主存相结合的结构——COPE(Composite Organization for Push Execution,COPE)。COPE具有主存向执行部件“推”数据进行执行的特点。COPE以存储器为中心,由作为存储器的PIM向运算簇“推”数据,运算簇负责执行,只具有简单的控制功能,通过片上互连网络相连。程序存储在PIM存储器中,由PIM中的处理器静态调度部分程序块在运算簇上执行,所需数据由PIM存储器提供,运算簇以数据驱动方式执行,中间结果通过寄存器通信直接传递到下一个运算簇中,无需把临时值传递到寄存器中,从而避免使用大量既影响性能又不可扩展的硬件机制。

【Abstract】 Current high performance computer systems usually adopt decoupled architectures in which the processor and memory is separated and connected by a hierarchy of caches and complexly interconnection systems. In these processor-centric designs, the processor and memory use high-volume semiconductor processes and is highly optimized for the particular characteristics and features demanded by the product, e.g. high switching rates for logic processes, and long retention times for DRAM processes. With the progress of semiconductor and manufacture techniques and the fast development of processor architectures, the speed of processors has far exceeded the speed of memories, the processor-memory gap become larger and larger, which leads to the occurrence of "memory wall". And these processor-centric designs invest a lot of power and chip area to bridge the widening gap between processor and main memory speed.PIM (Processor-In-Memory), which merging processor and memory into a single chip, has made the processor and memory reunited and has the well-known benefits of allowing high-bandwidth and low-latency communication between processor and memory, and reducing energy consumption. As a result, many different systems based on PIM architectures have been proposed. The researches are mainly focused on PIM micro-architecture, PIM parallel system, PIM program model, and PIM compiler optimization. One common in these researches is to maximum exploit the benefits of high-bandwidth, low-latency of PIM architecture.Our researches about PIM architecture techniques are mainly stressed on two aspects: one is about the PIM micro-architecture, that is to say, finding the suitable processor architecture in PIM and making the best of the benefits that the PIM architecture supplied. The other aspect of our researches stressed on is relative to the PIM parallel system. After researching on the scientific computation-oriented PIM architectures, a Vector-based Processor-In-Memory (V-PIM) architecture, which coupled with the characteristics of vector processing and PIM architecture, is proposed, the parallel system based on V-PIM is presented, and the software optimal technology is discussed. Primary researches and innovative work in this paper can be summarized as following:1. Put forward and design V-PIM architecture the Vector-based Processor-In-Memory architectureV-PIM is a Vector-based Processor-In-Memory architecture. Vector architecture has a mature program model and powerful ability to express data parallelism explicitly. And PIM architecture has the high-performance memory system. So it’s nature to union the vector and PIM architecture together. After comparing the register-register vector architecture and memory-memory vector architecture based on the utility of performance to area (performance/area), the results show that the union of memory-memory vector architecture and PIM architecture is superior to that of register-register vector architecture and PIM for that it has lower power and better on-chip resource utility. We adopt memory-memory vector architecture in our V-PIM design. This paper describes the designation of V-PIM architecture, presents the extended vector instruction set, and verifies the V-PIM architecture by FPGA-based platform.2. Propose the V-Parcels communication mechanism for V-PPIM parallel systemThe communication sub-system is important to the computing efficiency, scalability and suitability of parallel system. For reduce the communication traffic and improve the computing performance of vector exectution, V-Parcels communication mechanism for V-PPIM parallel system is proposed. Supporting vector operations transfer between V-PIM nodes is the main characteristic of the V-Parcels. Based on the analysis of vector elements distribution, it can dynamically generate V-Parcels communication package to transfer data or operations so as to local the computation, minimum the communication, and maximum the computing performance.3. Compile-time thread distinguishment algorithm on V-PIM-based architectureOn V-PIM-based architecture, the low temporal locality thread running on V-PIM processor is called Light-Weight Thread (LWT), while the low cache miss rate thread running on host processor is called Heavy-Weight Thread (HWT). The way of thread distinguishment can impact the system performance directly. For improving the system performance, we need a suitable thread distinguishment algorithm. Based on the execution performance on V-PIM and host of a thread, we present a compile-time method to distinguish the LWT and HWT. Once the compiler identifies the type of the thread, it can schedule the thread to the proper processor and accelerate the system performance. The thread distinguishment algorithm is simple and easily implemented, and the running result approaches the real situation.4. Put forward COPE architecture a composite organization of PIM and multiple computing clusters for push executionWe present COPE (Composite Organization for Push Execution), a new PIM architecture that combines PIM memory and multiple execution clusters on a chip to overcome the challenges of power, wire latency, memory wall, and so on that facing the future teraflops chips. In the memory-centric COPE architecture, the PIMs play the role of smart memories, and the multiple execution clusters play the role of processing units. The data is pushed to the execution clusters and executed by execution clusters. Multiple execution clusters are interconnected by on-chip operation network. As smart memory, PIM memory holds both code and data, and steers the instruction execution in clusters. The execution clusters are data-driven execution model. Temporal computing results can pass to the next processing unit through register communication directly, and it is not necessary to write them into registers again. It can avoid using massively hardware mechanisms which neither improve the performance nor enhance the scalability.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络