节点文献

GRAPES有限区域切线/伴随模式高效并行算法研究

Studies on High Performance Parallel Computing of GRAPES’ Tangent & Adjoint Model

【作者】 任迪生

【导师】 赵文涛;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2010, 硕士

【摘要】 四维变分同化技术作为数值天气预报的关键技术之一,可将不同地区、不同性质的观测资料随时间的变化信息融入到初始场,从而提高系统的预报质量,因而当前在国际上被认为是最有效的资料同化方案。但其计算过程非常复杂,程序占用内存量巨大,系统的运行时间较长。我国自主研发的新一代数值天气预报系统GRAPES(Global/Regional Assimilation and Prediction System)的四维变分同化系统(GRAPES-4DVAR)也有计算量大,占用内存多,运行时间长的特征。如何针对GRAPES有限区域模式在算法或代码上进行改进,提高其运行效率和并行可扩展性,是本文研究的关键与重点。文章主要从优化程序代码、改进伴随算法、开展混合并行等方面来提高程序的运行效率和可扩展性,研究并实现减少程序运行时间的有效方法。主要内容概述如下:(1)对GRAPES有限区域模式的代码进行调整优化。研究提高内存系统资源利用率和处理器运算部件运行效率的方法,消除代码中对性能有着显著影响的瓶颈因素。通过有效的代码实现,非线性模式的运行效率提高约25%。(2)提出了一种新的伴随模式计算方法—极限断点存储技术。用增加约30%的内存代价换取了程序运行性能100%的提升。(3)提出了一种可实现数据块先进先出与先进后出关系的内存数据管理技术,并实现了该结构-嵌套多链栈。(4)针对GRAPES伴随模式并行读写外部存储器可扩展性受限的问题,提出一种增强性能的改进方案。用有限的内存空间来实现大量中间数据的管理方法,替换了影响性能的外部存储器读写过程,实现了当扩展处理器规模超过128时,可减少70%程序墙钟时间。(5)实现GRAPES的混合并行计算。立足当前流行的集群系统结构,实现了在节点内使用OPENMP线程级并行,节点间使用MPI进程级并行的混合并行来替代纯MPI并行的GRAPES计算方法。得出了当纯MPI并行效率下降到90%以下时,使用混合并行方式,可提高5%到10%左右的结论。

【Abstract】 Four-dimensional variational assimilation as one of the key technologies of numerical weather prediction’s can take the information related in time for observed data into account to improve the quality of init data which determine the effect of forecast. It can be assimilated the different times, different regions, different types of observational data be considered the most effective scheme international in data assimilation currently. But its calculation is very complicated and needs more computations and more time to compute. The four-dimensional variational assimilation system of GRAPES ( Global/Regional Assimilation and Prediction System ) called GRAPES-4DVAR for short which is a new generation of numerical weather prediction system be developed by Chinese independently have the similar feature with a large amount of computations, needing more memory and longer time when running. How to reduce the elapsed time by improving the code efficiency, changing the algorithm, enhancing the parallel scalability is the key and focus of this article. This article mainly focus on how to obtain the performance from optimized code for improving efficiency, how to analysis the impact on program performance by using a different way through the quantitative method, and how to use a mixed parallel mode for increase scalability of parallel computing. The main work is summarized as follows:(1) Adjusted and optimized the GRAPES regional mode code. Focus on the research of enhancing the performance of memory system and the basic components of the processor. Analyzed what the reasons caused pipeline stalled and remove the bottleneck in code which has a significant impact on the performance when running. Through these, nonlinear mode obtained a benefit 25% improved by adjusting and optimizing code.(2) Put forward a limit solution between the Checkpointing strategy and Store-All strategy. Trade an increase of about 30% of the memory cost for 100% performance increased.(3) Put forward a technique that can manage the data blocks in memory supporting both First In First Out and First In Last Out. Nested Multi-Chained Stack be implement satisfy the need of the improved adjoint algorithm excellent.(4) Improved the Input and Output problem of parallel performance. By comparing the gap of maximum iteration the adjoint mode could running and actual demanding, determined which method can obtain the most performance and satisfy the actual need under stationary computation scale and stationary number of processors. Also given the result that using limited memory space replace the reading/writing external storage when the number of processors more than 128, the wall clock time decline up to 70%. (5) Implement the mixed-mode of parallel computation. For the popular structure of modern cluster system, by using thread-level parallelism through OPENMP method in the node and using the message passing through MPI method internal nodes will display an excellent parallel performance and scalability. Conclude the result that the parallel efficiency of mixed parallel mode can be increased 5% to 10% than of the pure MPI mode when dropped below 90%. Last analyzed the advantages and disadvantages of data division statically for threads.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络