节点文献

面向共享存储结构的并行编译优化技术研究

Research on Automatic Parallelization and Optimization Technologies for Shared Memory Architecture

【作者】 刘晓娴

【导师】 赵荣彩;

【作者基本信息】 解放军信息工程大学 , 计算机科学与技术, 2013, 博士

【摘要】 在计算机体系结构的发展过程中,并行结构的出现与不断发展将高性能计算机系统的峰值速度一次又一次推向新的高峰。但与硬件的峰值性能相比,用户程序所能获得的持续性能相去甚远,其中一个主要原因是并行程序设计带来的挑战。程序的自动并行化是实现并行程序设计的一条有效途径,编译器通过对串行程序中蕴含并行性的分析与发掘,自动生成适合并行体系结构运行的并行程序。自动并行化编译技术对于继承现有的软件财富,促进高性能计算机的应用具有重要作用。共享存储结构在高性能计算机体系结构中占据着重要地位,面向共享存储结构的并行化编译技术经过几十年的发展,已较为成熟。但是,要实现共享存储平台上高效并行代码的自动生成,仍面临若干技术挑战,如:存在跨迭代依赖循环的有效并行;自动并行化过程中程序并行收益的精确评估;异构平台上多层次存储系统的有效使用。本文以并行编译器SW-VEC的研发为背景,探讨了面向共享存储结构的并行编译优化技术,主要贡献和创新包括:1、提出了一种基于OpenMP的规则DOACROSS循环流水并行代码自动生成和流水粒度优化算法,设计实现了计算划分层和循环分块层的启发式选择算法,有效提高了规则DOACROSS循环的自动并行性能。2、提出了一种基于OpenMP的PS-DSWP自动并行改进算法,以基本块而非指令作为构建程序依赖图的基本单位,增大了并行的粒度;使用OpenMP应用编程接口实现并行时线程之间的任务分配和数据共享,有效实现了PS-DSWP算法的应用扩展和目标代码的性能提升。3、建立了一种新型的OpenMP代价分析模型,采用模块化和层次化的策略,将模型分为循环执行模型和硬件模型两个层次,既能灵活地实现模型扩展,又便于移植和运用于不同的目标体系结构。4、提出了一种基于多面体模型和精确数组区域表示的数据传输优化方法,设计了一组实现异构平台上数据传输控制的OpenMP扩展子句,定义了分块规则数组区域及其合并操作实现数组区域的精确表示,提升了异构平台中多层次存储系统的使用效率。本文提出的算法和模型已在并行编译器SW-VEC中得到了实现和应用,验证了算法的正确性和高效性。

【Abstract】 The peak speeds of high performance computer are pushed to new pinnacle with the use and innovations of parallel architectures again and again. However, the difficulty of programming parallel architectures is a huge challenge for programmers. Among several approaches to address the parallel programming problem, one that is very promising but simultaneously very challenging is automatic parallelization. Automatic parallelization is the process of automatically converting a sequential program to a version that can directly run on multiple processing elements without altering the semantics of the program. This process requires little effort from programmers and is therefore very attractive.Shared memory architectures always play an important role during the course of high performance architecture development. The automatic parallelization techniques for shared memory architecture have been widely studied in the domain of scientific and numeric applications, but there are still many challenges in automatic generation of high performance parallel programs for shared memory architectures, such as parallelization of loops with inter-iteration dependences, cost model used for profit estimation and data transfer optimization for heterogeneous platform.This dissertation, based on the development and research of an automatic parallelizing compiler SW-VEC, focuses on the issues of the automatic parallelization and optimization technologies for shared memory architecture; the main contribution and innovation of this dissertation are as follows:1. A pipelining granularity optimizing algorithm based on DOACROSS cost model is proposed to obtain the optimal pipelining granularity, and an automatic pipelining parallelization algorithm for regular DOACROSS loops based on OpenMP is proposed for shared memory parallel platforms. By using above algorithms, automatic generation of effective pipelining code is implemented.2. An improved PS-DSWP algorithm based on OpenMP is proposed, which is implemented without relying on CPU architectures by using a high level intermediate representation. Moreover, the Program Dependence Graph (PDG) used in the algorithm is built based on the basic blocks, which exploits coarser-grained parallelism than the original PS-DSWP transformation with PDG based on instructions. OpenMP is employed in our algorithm to assign task and implement synchronization among threads while avoiding platform limitation.3. A compile-time cost model for automatic parallelization profit estimation based on a modularized and hierarchized strategy is built, which is partitioned into two hierarchies. The major benefit of the modularized and hierarchized strategy is that the design and implementation could be easily achieved and the resulting model should be flexible and extensible.4. An approach of managing data storage and data transfer between the main memory and local memory is proposed by designing a potential extension to OpenMP. Meanwhile, blocking regular array region and its union operation are defined to describe the set of transferred array data, and develop a method to compute the array region by applying the polyhedral model.The algorithms and cost model presented in the dissertation have been carried out and applied in the SW-VEC system, and the validity has been proved.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络