节点文献

大规模并行计算通信可扩展性—分析、优化与模拟

Communication Scalability of Large-scale Parallel Computing-Analysis, Optimization and Simulation

【作者】 林宇斐

【导师】 杨学军;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2013, 博士

【摘要】 随着系统规模的扩大和结点计算能力的提高,通信已经成为制约并行计算可扩展性的重要瓶颈。通信可扩展性问题,即分析通信受何种因素影响并且该影响增大到何种程度会限制系统的可扩展性,是并行计算领域最具挑战性的理论问题之一。本文针对通信可扩展性问题,首次从性能加速比的角度量化了并行计算的通信墙,并建立了通信可扩展性模型。基于通信可扩展性模型的分析结论,本文分别针对程序优化和任务分配优化,提出了消息独立性指导下的程序优化技术和面向多作业的分配优化技术。最后,设计和实现了一款针对大规模并行计算的性能预测模拟器,该模拟器可用于验证通信可扩展性模型的正确性以及并行系统的各种相关优化技术的可扩展性。具体而言,本文的主要工作和创新点体现在:1.建立了通信可扩展性模型(第二章)目前,国际上对于通信可扩展性问题大多是感性上的认识,并未对其进行系统的定量研究。本文首次提出了通信墙的定量化描述,给出了通信墙存在性定理。由此,本文建立了通信可扩展性模型,提出了系统度量方法及基于通信可扩展性模型的并行系统分类方法,量化了系统的通信可扩展性强弱和广义通信可扩展性强弱。最后结合具体案例,分析了程序、并行机拓扑以及常见优化方法对通信可扩展性的影响,比较了常见的巨型机拓扑的广义通信可扩展性强弱,指出优化系统通信可扩展性和广义通信可扩展性的方向。2.提出了消息独立性指导下的程序优化技术(第三章)基于指令重排的通信隐藏技术是优化程序性能的主要手段之一,然而除去该技术自身面临的问题,它还会导致消息间产生严重的网络资源竞争。本文通过分析网络资源竞争的产生原因,首次提出了消息独立性的概念并研究了其具体涵义;然后针对MPI(Message Passing Interface)程序,建立了基于指令重排的消息独立性指导下的程序优化模型;基于上述优化模型,设计并实现了基于指令重排的消息独立性指导下的程序优化方法,该方法可以在保证通信隐藏最大化的前提下减少消息间的网络资源竞争;针对并行CFD(Computational Fluid Dynamics)应用的实验表明,该方法能够很好的减少程序的通信开销并提升程序的性能。3.提出了面向多作业的分配优化技术(第四章)合理地为多个作业分配计算资源以满足作业的性能需求,对于那些使用大规模并行计算系统的用户来说十分重要。本文首次提出将多作业分配优化问题分解为多作业分布优化和单作业任务映射优化两个子问题。针对多作业分布优化问题,本文首次提出闭合最小图划分模型,将多作业分布优化问题转化为闭合最小图划分问题;针对单作业任务映射优化问题,本文分析了通信协议对通信开销的影响,首次为MPI程序提出了协议感知的进程映射模型—PaPP。基于上述两个模型,本文设计并实现了面向多作业的分配优化方法。实验表明,对于NPB(NAS Parallel Benchmarks)测试集,面向多作业的分配优化方法有很好的性能优化效果。4.设计并实现虚实结合的执行驱动模拟器—VACED-SIM(第五章)离散事件模拟是大规模并行计算常用的性能预测方法之一。本文基于对离散事件模拟方法的深入分析,提出了虚模拟和实模拟的概念;通过对虚模拟和实模拟以及轨迹驱动和执行驱动方法的对比,首次从两个正交的角度(模拟机制和事件驱动方法)将基于离散事件模拟的性能预测方法分为四类;针对大规模并行计算可扩展性预测的特点,首次提出了第四类模拟方法—虚实结合执行驱动(VACED)模拟方法的模型。基于该模型,本文设计和实现了一款轻量级的虚实结合执行驱动模拟器—VACED-SIM。在该模拟器中,本文首次提出并采用了细粒度的活动和事件定义方法,从而提高模拟的精度。在Tianhe-1A子系统上的实验结果表明,VACED-SIM具有很高的准确性与效率。

【Abstract】 With the increasing of system scale and the rising of processing node performance,communication has become the key bottleneck that limits the scalability of parallel com-puting. The communication scalability problem, which analyzes the factors affecting thecommunication and that to what extend the influence of these factors increases will limitthe system scalability, is one of the most challenging academic problems in the field ofparallel computing.Aiming at the communication scalability problem, this paper quantifies for the firsttime the communication wall for parallel computing from the perspective of the perfor-mance speedup, and builds the model of communication scalability. Based on the analyt-ical results of the model, this paper proposes the program optimization technique underthe guidance of the message independency and the optimization technique for multi-joballocation, for the program optimization problem and for the task allocation optimizationproblemrespectively. Atlast,thispaperdesignsandimplementsaperformancepredictionsimulatorforlarge-scaleparallelcomputing, whichcanbeusedtovalidatethecorrectnessof the model and the scalability of various system optimization techniques.Specifically, the main work and contributions of this paper are as follows:1. Building the model of communication scalability (Chapter2)Currently, there is only perceptual knowledge to rather than qualitative research onthe communication scalability problem in the world. This paper proposes for thefirst time the quantitative description of the communication wall and the commu-nication wall existing theorem. After that, this paper builds the model of commu-nication scalability, proposes the system measurement method and the classifica-tion of parallel systems based on the model, and then quantifies the extend of thecommunication scalability and the extend of the general communication scalability.Combined with some specific cases, we analyze the effects of program, topologyand optimization methods on the communication scalability, compare the extendof general communication scalability among the common topologies, and point outthe way to optimize the communication scalability and the general communicationscalability for the parallel systems. 2. Proposing the program optimization technique under the guidance of the messageindependency (Chapter3)Thecommunicationhidingtechniqueusinginstructionreorderingisoneofthemainprogram performance optimization methods. Besides the problems of its own, it isalso incurs serious network resource contentions among messages. By analyzingthe reasons for the network resource contentions, we unprecedentedly propose theconcept of the message independency and study its specific meaning. As for MPI(Message Passing Interface) programs, we build the program optimization modelunder the guidance of message independency based on instruction reordering. Byusingthemodel, wedesignandimplementaprogramoptimizationapproach, whichcan reduce the network resource contentions among messages in the prerequisiteof the maximal communication hiding. The experimental results show that, forthe CFD (Computational Fluid Dynamics) applications, this approach can greatlyreduce the communication overhead and improve the program performance.3. Proposing the optimization technique for the multi-job allocation (Chapter4)Allocating the computing resources for multiple jobs to satisfy their performanceneeds is greatly desired by the users of large-scale parallel computing systems.In this paper, we unprecedentedly propose the idea that decomposes the multi-job allocation optimization problem to two sub-problems: multi-job assignmentoptimization and single-job task mapping optimization. For the multi-job assign-mentoptimizationproblem,weunprecedentedlyproposetheclosed-minimalgraph-partitioning model, which transforms the multi-job assignment optimization prob-lem to a closed-minimal graph-partitioning problem; for the single-job task map-ping optimization, we analyze the impacts of the communication protocols on thecommunication overheads, and unprecedentedly propose the protocol-aware pro-cess mapping model for MPI programs—PaPP. Based on the above two models,we design and implement the multi-job allocation optimization approach. The ex-perimental results show that, for the NPB (NAS Parallel Benchmarks) applications,this approach has a high performance optimization efficiency.4. Designing and implementing a virtual-actual combined execution-driven simulator—VACED-SIM (Chapter5) Discrete event simulation is one of the most common performance prediction ap-proaches in the field of large-scale parallel computing. By deeply analyzing thediscrete event simulation approaches, we propose the concepts of virtual simula-tion and actual simulation. Based on the comparison between actual and virtualsimulations and that between trace-driven and execution-driven methods, our pa-per categorizes for the first time the discrete event simulation approaches into fourkindsfromtwoorthogonalaspects(simulationmechanismandevent-drivenmecha-nism). Based on the characteristics of scalability prediction for large-scale parallelcomputing, we propose for the first time the model of the fourth kind of discretesimulation approach—the virtual-actual combined execution-driven simulationapproach. With the model, we design and implement a light-weight virtual-actualcombined execution-driven simulator—VACED-SIM. In this simulator, we un-precedentedly propose and use the fine-grained activity and event defining meth-ods to increase the precision of the simulation. The experiments on the Tianhe-1Asub-system shows that, VACED-SIM has high accuracy and efficiency.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络