节点文献

面向GPU计算平台的若干并行优化关键技术研究

Research of Parallel Optimization Technicals on GPU Computing Platforms

【作者】 贾海鹏

【导师】 徐建良;

【作者基本信息】 中国海洋大学 , 计算机应用技术, 2012, 博士

【摘要】 随着计算能力和可编程性的不断增强,GPU被越来越多的应用开发人员用作性能加速器以提高程序性能。然而,如果没有经过精心优化,很难在GPU上实现理想性能。这是因为GPU程序的优化工作已经从硬件设计者转移到应用开发人员手中。而GPU程序的性能优化是一个非常困难的过程,其实质是实现算法特性向底层硬件特征的高效映射。一方面这个过程需要对GPU底层硬件有着深入的认识,而现代GPU架构的日益多样性,无疑加剧了本已困难的优化工作;另一方面,移植到GPU上的应用的程序特性也日益多样化,从整体上看,这些应用可分为规则应用和非规则应用两大类。不同的程序特性在不同硬件架构上具有不同的优化方法和策略。为简化GPU程序的性能优化工作,使应用开发人员能够更加容易的实现高性能GPU程序。针对不同的应用特点,本文的主要工作可分为两部分:针对规则应用,我们提出性能优化链的概念,并根据GPU计算和访存的特点,将性能优化链划分为绝对性能优化链和相对性能优化链两类。通过引入Roofline模型,实现了性能优化链的可视化,建立了针对特定硬件平台的可视化GPU程序性能优化指导模型:GPURoofline。该模型可通过提供性能信息来确定GPU程序在特定硬件平台上的性能瓶颈以及应选择的优化策略和方法,以此来指导应用开发人员特别是对GPU底层架构不熟悉的应用开发人员更加容易的实现高性能GPU程序。本文通过三个具有不同计算密度和程序特性的典型应用验证了GPURoofline模型的可用性和正确性。针对非规则应用,以Viola-Jones人脸检测算法为例,引入了非规则应用在GPU上实现和优化的五大关键技术:粗粒度并行、Uberkernel、Persistent Kernel、本地队列和全局队列。并通过性能特征参数的定义和抽取,完成了可调优GPUkernel的初步实现,并以此实现了Viola-Jones人脸检测算法在不同GPU平台上的性能移植。实验表明,经过优化的Viola-Jones人脸检测算法比OpenCV库中同样经过精心优化的CPU版本在AMD HD5850GPU、AMD HD7970GPU和NVIDIA C2050GPU三个GPU平台上分别达到了5.19~27.724、6.468-35.080和5.850~28.768的性能提升。本文的创新点如下:(1)分析和比较当前主流GPU架构的异同,提出了GPU程序性能优化的三大有效途径:提高片外带宽利用率,提高计算资源利用率和数据本地化。(2)提出算法计算密度和硬件计算密度两个概念,并通过这两个概念的比较将GPU kernel分为访存密集型和计算密集型两大类。提出并构建针对特定硬件平台的性能优化链。并根据访存和计算优化的特点,将性能优化链划分为绝对性能优化链和相对性能优化链两类。(3)构建完成了一个可视化的GPU性能指导模型:GPURoofline.通过引入Roofline模型实现了性能优化链的可视化,以一种更加直观的形式指导GPU程序的优化。(4)引入非规则应用在GPU实现和优化的五大方法和策略:粗粒度并行、Uberkernel、Persist Thread、本地队列和全局队列。并通过Viola-Jones人脸检测算法说明了这五种方法的具体应用方式。最后,通过对性能参数的定义和抽取,初步完成了可调优kernel的实现,验证了在不同GPU硬件平台间实现性能移植的可能性。

【Abstract】 More and more application developers have been adopting GPUs as standard computing accelerators because of their increasing computing power and programmability. However, it’s hard to get the required performance without careful optimizations because the performance problem has shifted from hardware designers to application developers. Unfortunately, performance optimizations of GPU programs are very difficult. The essence of this progress is to achieve the best match between algorithm features and the underlying hardware characteristics. On the one hand, this optimization process requires deep technical knowledge of the underlying hareware. Modern GPU architectures are becoming more and more diversified, which further exacerbates the already difficult problem of performance optimization. On the other hand, the characteristics of application programs ported to GPUs are also becoming increasingly diverse. Overall, these applications can be divided into two categories:regular applications and irregular applications. Optimization methods and strategies are very different for different programs running on different hardware platforms. In order to simplify optimizations of GPU programs and enable application developers write high performance GPU programs more easily. Considering the different characteristics of the differnent GPU applications, we divide our work into two parts:For regular applications, we propose the concept of performance optimization chain, and divide it into two categories:threshold optimization chain and tradeoff optimization chain according to the differences between GPU computing and memory access. We also make the optimization chain insightful by introducing Roofline model, and establish an insightful performance model for guiding optimizations on GPUs: GPURoofline. This model can provide performance information to identify GPU program performance bottlenecks and decide which optimization methods should be adopted. This model is useful for programmers, especially non-expert programmers with limited knowledge of GPU architectures to implement high performance GPU kernels directly. We aslo demonstrate the usage of GPURoofline by optimizing three representative GPU kernels with different compute intensity and program characteristics.For irregular applications, we take the Viola-Jones face detection algorithm as an example to intruoduce five key technologies for optimizing irregular applications on GPUs:coarse-grained parallelism, Uberkernel, Persistent Thread, local queue and global queue. We also propose a tunable GPU kernel by defining and extracting performance parameters and achieving the performance portability across different GPU platforms for the Viola-Jones face detection algorithm. We also demonstrate the high performance of our implementation by comparing it with a well-optimized CPU version from OpenCV library. Experimental results show that the speedup reaches up to5.19~27.724,6.468~35.080and5.850~28.768on AMD HD5850GPU, AMD HD7970GPU and NVIDIA C2050GPU respectively.In summary, our key contributions are as follows:1. Comparison and analysis of differences and similarities among the current mainstream GPU architectures. We propose three effective ways to improve performance of programs on GPUs:improving the utilization of the off-chip memory bandwidth, improving the utilization of the computing resource and data locality.2. Definitions of hardware compute intensity and algorithm compute intensity respectively. Starting from these definitions, we classified algorithms as memory-bound or computation-bound by measuring such features. Furthermore, we also build performance optimization chainm, and divide it into two categories:threshold optimization chain and tradeoff optimization chain according to the differences between GPU computing and memory access.3. GPURoofline:an empirical and insightful performance model for guiding performance optimizations. We make the optimization chain insightful by introducing Roofline model, so we can guide optimizations in a more intuitive way.4. We introduce five key technologies for optimizing irregular applications on GPUs:coarse-grained parallelism, Uberkernel, Persistent Thread, local queue and global queue. We demonstrate the usage of these five methods through implementing and optimizing the Viola-Jones face detection algorithm on GPUs. Finally, we complete a tunable GPU kernel by defining and extracting performance parameters. So as to vertify the possibility of performance portability across different GPU platforms.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络