节点文献

面向多核的系统级MPI通信优化关键技术研究

Research on the System-Level Optimizing Key Techniques for MPI Communication on Multicore Systems

【作者】 刘志强

【导师】 宋君强;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2011, 博士

【摘要】 消息传递接口(Message Passing Interface,简称MPI)自20世纪90年代以来一直是高性能计算(High Performance Computing,简称HPC)领域并行程序开发的事实标准。在基于MPI编写的并行程序中,MPI通信性能通常对程序整体性能起着关键作用,优化MPI通信具有重要意义。近年来,在多核技术高速发展的背景下,MPI通信亟待针对多核系统特点进行优化。然而,现有优化工作主要停留在基于进程MPI的通信技术,普遍存在处理开销大、访存需求高等不足,限制了通信性能进一步提高。本文针对多核系统诸多特点和现有优化方法不足,从基于线程MPI的通信技术方向入手,系统研究了多核系统MPI通信优化的关键技术,探索了共享内存系统上更高效的消息传递通信接口。取得的主要成果如下:1、面向多核系统,提出了一种高效线程MPI支撑软件技术——MPI通信加速器(MPI Communication Accelerator,简称MPIActor)。MPIActor通过自身专门设计的接口聚合技术在传统进程MPI支撑环境的基础上建立线程MPI支撑环境。相比传统MPI支撑软件的开发方法,采用MPIActor技术构建线程MPI支撑软件的开发工作量小,且MPIActor应用更灵活,能横向支持符合MPI-2标准的传统进程MPI支撑软件。实验采用双路Nehalem-EP处理器系统上的OSU_LATENCY基准程序进行测试,结果表明传输8K至2M字节长度消息时,加入MPIActor的MVAPICH2 1.4在处理器内通信性能提升了37%以上,最高可达114%;处理器间通信性能提升30%以上,最高可达144%;而对加入MPIActor的Open MPI 1.5测试结果也表明,处理器内通信性能能提升48%以上,最高可达106%,处理器间则能提高46%以上,最高可达98%。2、针对多核系统上的集合通信优化,基于MPIActor提出了一套新的分级集合通信算法框架(MPIActor Hierachical Collective Algorithm Framework,简称MAHCAF)和一组高效的基于线程MPI的节点内集合通信算法。MAHCAF采用模板方法设计分级集合通信算法,将节点内和节点间集合通信过程作为模板的可扩展步骤,并将它们通过流水化并行方法组织,能够充分发挥子集合通信过程间的并发性。基于线程MPI设计的节点内集合通信算法能够充分利用共享内存系统的优势实现通信过程,相比传统基于进程MPI的集合通信算法处理代价小,访存需求低。Nehalem集群系统上的IMB实验表明:与MVPAICH2 1.6相比,采用节点内集合通信通用算法的MAHCAF能够对广播、多对多广播、归约和全归约在绝大多数条件下带来显著的性能提升;不仅如此,将专门针对Nehalem体系结构设计的多级分段归约算法(HSRA)加入MAHCAF后,归约和全归约通信的性能还能够被进一步提高。3、针对非平衡进程到达影响广播通信性能的问题,基于MPIActor的特有结构提出了一种竞争式流水化优化(Competitive and Pipelined,简称CP)方法以提高非平衡进程到达模式下的广播通信性能。该方法利用多核/多处理器系统节点内运行多个进程的优势,将节点内最早到达的进程作为执行节点间通信的引导进程,能在最早时间启动节点间集合通信过程,减少广播通信平均等待时间。微性能测试实验表明,采用CP方法优化的广播性能显著优于传统算法,而两个实际应用实例的性能测试也表明CP方法能够显著改善广播性能。4、面向多核/多处理器系统上的节点内MPI通信优化,在MPIActor基础上提出了一套高效的共享内存消息传递接口(Shared-Memory Message Passing Interface,简称SMPI)。相比传统MPI,该接口能支持运行在同一节点上的MPI进程通过传递消息地址直接读取进程间发送的消息数据,而不是复制消息数据到当前进程,因此极大减少了访存开销。实验表明,在8个节点上用64个MPI进程进行4000阶矩阵乘,利用该接口设计的cannon矩阵乘算法较利用MPI设计的算法加速比达到了约1.14。

【Abstract】 Over the last decades of the 20th century, MPI (Message Passing Interface) have become a de facto standard of programming model in High Performance Computing (HPC) domain. The performance of MPI communications usually play a key role on the global performance of MPI-based programs. Thus Optimizing MPI communication is becoming extremely important.Recently, with the rapid development of multicore technologies, The optimization of MPI communication in multicore systems is strongly expected to be improved by combining their new characteristics with multicores. However, the existing optimizing techniques still remains at the technologies of process-MPI based communication which often incur performance issues such as large overhead, high memory visit, thus the cur-rent methods still have some limitations on further improving the communication per-formances. Towards addressing these issues of optimizing the performance of MPI communication on multicore systems, we concentrate on studying key optimization strategies from the views of threaded-MPI based communication techniques in this pa-per. As a result of our invesitigation, the following contributions have been achieved:(1) An effective threaded-MPI software technology, called MPI communication ac-celerator (MPIActor), is proposed for multicore systems. Compared with the traditional MPI implementation development methods to develop threaded-MPI, MPIActor with smaller development workload, more flexible usage performs better. Not only that, MPIActor can support all traditional process-based MPI softwares satisfying MPI-2 standards, and can inherit the performance advantages of inter-node communication supported by traditional MPI implementations. The experimental results of OSU_LATANCY benchmark on dual-way Nehalem-EP processor system show that the performance of intra-socket communication can be increased by 37% to 114% and the performance of inter-socket communication can be improved by 30% to 144% for MVAPICH2 1.4 supported by MPIActor in comparison with the pure MVAPICH2 1.4 when transferring 8KB to 2MB messages. At the same time, the experiments also show that the performance of intra-socket communication is increased by 48% to 106% and the performance of inter-socket communication is increased by 46% to 98% for Open MPI 1.5 supported by MPIActor compared to the pure Open MPI 1.5.(2) A novel hierarchical collective communication algorithm framework (MAH-CAF) and a group of effective threaded-MPI based intra-node collective communication algorithms are proposed. MAHCAF constructs hierarchical collective communication algorithms by using template design pattern. The intra-node and inter-node collective communication sub-processes (IntraCP and InterCP) of MAHCAF play extensible roles. IntraCP can not only be implemented by general independent algorithms regardless of multicore architecture but also be designed by the specific algorithms considering the characteristics of multicore architecture in multicore architecture drivers. The experi-ment results of Intel MPI benchmark show that MAHCAF integrated with general in-tra-node collective algorithms can remarkably improve the performance of MPI_Bcast, MPI_Allgather, MPI_Reduce and MPI_Allreduce compared with MVAPICH2 1.6,. In addition, the intra-node reduce algorithm for Nehalem architecture, called hierarchical segment reduce algorithm (HSRA), can greatly improve the performance of MPI_Reduce and MPI_Allreduce.(3) Towards reducing negative performance impact on MPI broadcast incurred by unbalanced processes arrival (UPA) patterns , a novel Competitive and Pipelined (CP) method based on MPIActor is proposed. CP method can regard the first arriving process as a leading process to execute inter-node collective communication by making better use of the advantages of running multiple processes in the intra-node of multicore sys-tem. By doing so, it can start the inter-node collective communication process as soon as possible and reduce the waiting cost. The experiment results of a micro benchmark show that the performance of broadcast algorithms enhanced by CP-method signifi-cantly outperforms the performance brought by other traditional algorithms. In addi-tion, the extensive experiments through running two real world applications also prove that CP method can greatly improve the performance of broadcast in real scenarios.(4) An efficient and effective Shared-memory Message Passing Interface (SMPI) on threaded-MPI is proposed for optimizing intra-node communication on multicore systems. Unlike copying the message from the source process to the destination process, SMPI-supported MPI processes can communicate with each other in manner of directly accessing the buffer of posted message on the same node. In particular, SMPI can be efficiently implemented by utilizing the existing pattern of MPIAcotor. The result of 4000 order square matrix multiplication computed by 64 processes on 8 nodes shows the performance of SMPI based cannon matrix multiplication algorithm can achieve a speedup of about 1.14 in contrast to MPI based algorithm.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络