节点文献

面向众核GPU的编程模型及编译优化关键技术研究

Research on the Key Techniques of Programming Model and Compiler Optimization for Many-core GPU

【作者】 甘新标

【导师】 王志英;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2012, 博士

【摘要】 GPGPU(General Purpose computing on Graphics Processing Units)已广泛应用于高性能计算领域,但是GPU体系结构和编程模型不同于传统的CPU体系结构和编程模式,开发高效的GPU应用程序仍然极具挑战性。本文重点围绕面向众核GPU的编程模型及编译优化关键技术进行了研究,集中解决了众核GPU编程模型及编译优化中的若干关键理论与技术问题,取得的主要研究成果和技术创新如下:1.提出了一种众线程并行编程模型。多核、众核时代的到来使得并行编程模型研究正处于蓬勃发展的阶段。然而,到目前为止,仍然没有一个被普遍接受的多核、众核并行编程模型。本文基于流并行编程思想,综合考虑典型并行编程模型的优缺点,首次提出了一种众线程编程模型ab-Stream。ab-Stream编程模型能够很好地屏蔽众核体系结构差异并且给程序员提供了一个易于并行、易于编程、易于扩展和易于调优的并行编程模型。2.提出了面向GPGPU应用映射的多层次计算粒度并行方法。GPU拥有成百上千个计算核,如何划分并行任务确定并行计算粒度以最大限度挖掘GPU强大的并行计算能力是一项艰巨且富有挑战性的工作。因此,本文以GPGPU应用程序输入集特征为导向,面向链式依赖关系输入集提出了一种面向链式依赖结构的片段级松弛并行方法。同时,面向2D数据结构输入集提出了一种像素级映射并行方法。实验结果表明,本文提出的两种不同计算粒度的并行方法能够充分挖掘GPGPU应用潜在的并行性,并且具有简明直接、实现简单的特点。3.提出了基于数据分类的存储传输优化技术。GPGPU体系结构是一款存储受限的高性能处理器体系结构。为有效利用GPGPU体系结构中多样化存储资源,首先提出了一种基于分类存储的数据布局优化技术,该布局优化方法将不同类别的数据显式地分派到能够充分利用数据特性的存储器空间以最大化存储访问效率。然后,针对Strided data数据结构提出了一种基于预变换技术的Strided data数据传输优化技术。实验结果表明,本文提出的基于数据分类的存储传输优化技术能够显著提升GPGPU应用程序性能。4.提出了一种面向计算密集型应用的负载均衡计算协作框架。CPU+GPU异构计算系统经常会在很长一段时间内处于超载和轻载的状态,为了充分利用GPU+CPU异构系统的计算资源,该计算协作框架让CPU和GPU以流水模式并行执行,同时,将GPU提升为数据消费者或部分数据的生产者,并且将零加载和缓存加载等优化技术整合到负载均衡计算协作框架中,以提升整个协作框架的性能。实验结果表明,本文提出的负载均衡计算协作框架能够显著提高GPU+CPU异构系统的计算资源利用率。为了验证ab-Stream编程模型及其关键支撑技术的可行性和有效性,本文基于ab-Stream编程框架设计实现了一款原型系统ab-Stream4G,其中包含了面向众线程体系结构的应用映射方法、众线程体系结构存储优化技术和众线程异构系统负载均衡策略等关键支撑技术。实验结果表明原型系统ab-Stream4G能够正确高效的运行。

【Abstract】 GPGPU (General Purpose computing on Graphics Processing Units) has beenwidely applied to high performance computing. However, GPU architecture andprogramming model are different from that of traditional CPU. Accordingly, it is ratherchallenging to develop efficient GPU applications. This thesis focuses on the keytechniques of programming model and compiler optimization for many-core GPU, andaddresses a number of key theoretical and technical issues. The primary contributionsand innovations are concluded as follows.1. We propose a many-threaded programming model. There is no authorizedparallel programming model for multi-core and many-core processors. Accordingly,after understanding stream-based and classical parallel programming models, wepropose a many-threaded programming model ab-Stream, which would transparentizearchitecture differences and provide an easy to parallel, easy to program, easy to extendand easy to tune programming model.2. We propose parallelizing approaches with hierarchy computing granularities tomap GPGPU applications. There are hundreds of computing cores in GPUs. However, itis difficult to identify an appropriate computing granularity to map GPGPU applicationsfor maximizing GPU productivity. Orienting application inputs, firstly, we propose aparallelizing approach with relaxation to parallelize GPU applications characterizedwith chain dependence inputs. Secondly, we propose another pixel-level parallelizingapproach to map GPU applications with2D inputs. Experimental results show thatproposed approaches are easy to implement and would exploit potential parallelism inGPGPU applications efficiently.3. We propose memory optimization and data transfer transformation according todata classification. GPGPU architecture is memory-bound and high-performancearchitecture. In order to effectively utilize diverse GPU storage resources, firstly, wepropose data layout pruning based on classification memory, and then we propose TaT(Transfer after Transformed) for transferring Strided data between CPU and GPU.Experimental results demonstrate that proposed techniques would significantly improveperformance for GPGPU applications.4. We propose a collaborative framework with load-balance for compute-intensiveapplications. Heterogeneous systems composed of CPU and GPU are often not in stateof load-balance. In order to take full advantage of GPU+CPU heterogeneous systems,data transfer and computations would be overlapped in pipeline mode in collaborativeframework proposed. Additionally, optimization techniques including zero-loading andcache loading are integrated into collaborative framework for maximizing performance of heterogeneous systems. Experimental results demonstrate that proposed collaborativeframework would maximize utilization of heterogeneous systems.In order to validate correctness and high productivity of ab-Stream programmingmodel, we design a prototype ab-Stream4G for CUDA-enabled GPU based on proposedtechniques. Experimental results show that ab-Stream4G would work correctly andefficiently.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络