èŠ‚ç‚¹æ–‡çŒ®

GRAPESæœ‰é™åŒºåŸŸåˆ‡çº¿/ä¼´éšæ¨¡å¼é«˜æ•ˆå¹¶è¡Œç®—æ³•ç ”ç©¶

Studies on High Performance Parallel Computing of GRAPESâ€™ Tangent & Adjoint Model

åˆ†é¡µä¸‹è½½
åˆ†ç« ä¸‹è½½
æ•´æœ¬ä¸‹è½½
åœ¨çº¿é˜…è¯»
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ ä»»è¿ªç”Ÿï¼›

ã€ä½œè€…åŸºæœ¬ä¿¡æ¯ã€‘ å›½é˜²ç§‘å¦æŠ€æœ¯å¤§å¦ ï¼Œ è®¡ç®—æœºç§‘å¦ä¸ŽæŠ€æœ¯ï¼Œ 2010ï¼Œ ç¡•å£«

ã€æ‘˜è¦ã€‘ å››ç»´å˜åˆ†åŒåŒ–æŠ€æœ¯ä½œä¸ºæ•°å€¼å¤©æ°”é¢„æŠ¥çš„å…³é”®æŠ€æœ¯ä¹‹ä¸€,å¯å°†ä¸åŒåœ°åŒºã€ä¸åŒæ€§è´¨çš„è§‚æµ‹èµ„æ–™éšæ—¶é—´çš„å˜åŒ–ä¿¡æ¯èžå…¥åˆ°åˆå§‹åœº,ä»Žè€Œæé«˜ç³»ç»Ÿçš„é¢„æŠ¥è´¨é‡,å› è€Œå½“å‰åœ¨å›½é™…ä¸Šè¢«è®¤ä¸ºæ˜¯æœ€æœ‰æ•ˆçš„èµ„æ–™åŒåŒ–æ–¹æ¡ˆã€‚ä½†å…¶è®¡ç®—è¿‡ç¨‹éžå¸¸å¤æ‚,ç¨‹åºå ç”¨å†…å˜é‡å·¨å¤§,ç³»ç»Ÿçš„è¿è¡Œæ—¶é—´è¾ƒé•¿ã€‚æˆ‘å›½è‡ªä¸»ç ”å‘çš„æ–°ä¸€ä»£æ•°å€¼å¤©æ°”é¢„æŠ¥ç³»ç»ŸGRAPES(Global/Regional Assimilation and Prediction System)çš„å››ç»´å˜åˆ†åŒåŒ–ç³»ç»Ÿ(GRAPES-4DVAR)ä¹Ÿæœ‰è®¡ç®—é‡å¤§,å ç”¨å†…å˜å¤š,è¿è¡Œæ—¶é—´é•¿çš„ç‰¹å¾ã€‚å¦‚ä½•é’ˆå¯¹GRAPESæœ‰é™åŒºåŸŸæ¨¡å¼åœ¨ç®—æ³•æˆ–ä»£ç ä¸Šè¿›è¡Œæ”¹è¿›,æé«˜å…¶è¿è¡Œæ•ˆçŽ‡å’Œå¹¶è¡Œå¯æ‰©å±•æ€§,æ˜¯æœ¬æ–‡ç ”ç©¶çš„å…³é”®ä¸Žé‡ç‚¹ã€‚æ–‡ç« ä¸»è¦ä»Žä¼˜åŒ–ç¨‹åºä»£ç ã€æ”¹è¿›ä¼´éšç®—æ³•ã€å¼€å±•æ··åˆå¹¶è¡Œç‰æ–¹é¢æ¥æé«˜ç¨‹åºçš„è¿è¡Œæ•ˆçŽ‡å’Œå¯æ‰©å±•æ€§,ç ”ç©¶å¹¶å®žçŽ°å‡å°‘ç¨‹åºè¿è¡Œæ—¶é—´çš„æœ‰æ•ˆæ–¹æ³•ã€‚ä¸»è¦å†…å®¹æ¦‚è¿°å¦‚ä¸‹:(1)å¯¹GRAPESæœ‰é™åŒºåŸŸæ¨¡å¼çš„ä»£ç è¿›è¡Œè°ƒæ•´ä¼˜åŒ–ã€‚ç ”ç©¶æé«˜å†…å˜ç³»ç»Ÿèµ„æºåˆ©ç”¨çŽ‡å’Œå¤„ç†å™¨è¿ç®—éƒ¨ä»¶è¿è¡Œæ•ˆçŽ‡çš„æ–¹æ³•,æ¶ˆé™¤ä»£ç ä¸å¯¹æ€§èƒ½æœ‰ç€æ˜¾è‘—å½±å“çš„ç“¶é¢ˆå› ç´ ã€‚é€šè¿‡æœ‰æ•ˆçš„ä»£ç å®žçŽ°,éžçº¿æ€§æ¨¡å¼çš„è¿è¡Œæ•ˆçŽ‡æé«˜çº¦25%ã€‚(2)æå‡ºäº†ä¸€ç§æ–°çš„ä¼´éšæ¨¡å¼è®¡ç®—æ–¹æ³•â€”æžé™æ–ç‚¹å˜å‚¨æŠ€æœ¯ã€‚ç”¨å¢žåŠ çº¦30%çš„å†…å˜ä»£ä»·æ¢å–äº†ç¨‹åºè¿è¡Œæ€§èƒ½100%çš„æå‡ã€‚(3)æå‡ºäº†ä¸€ç§å¯å®žçŽ°æ•°æ®å—å…ˆè¿›å…ˆå‡ºä¸Žå…ˆè¿›åŽå‡ºå…³ç³»çš„å†…å˜æ•°æ®ç®¡ç†æŠ€æœ¯,å¹¶å®žçŽ°äº†è¯¥ç»“æž„-åµŒå¥—å¤šé“¾æ ˆã€‚(4)é’ˆå¯¹GRAPESä¼´éšæ¨¡å¼å¹¶è¡Œè¯»å†™å¤–éƒ¨å˜å‚¨å™¨å¯æ‰©å±•æ€§å—é™çš„é—®é¢˜,æå‡ºä¸€ç§å¢žå¼ºæ€§èƒ½çš„æ”¹è¿›æ–¹æ¡ˆã€‚ç”¨æœ‰é™çš„å†…å˜ç©ºé—´æ¥å®žçŽ°å¤§é‡ä¸é—´æ•°æ®çš„ç®¡ç†æ–¹æ³•,æ›¿æ¢äº†å½±å“æ€§èƒ½çš„å¤–éƒ¨å˜å‚¨å™¨è¯»å†™è¿‡ç¨‹,å®žçŽ°äº†å½“æ‰©å±•å¤„ç†å™¨è§„æ¨¡è¶…è¿‡128æ—¶,å¯å‡å°‘70%ç¨‹åºå¢™é’Ÿæ—¶é—´ã€‚(5)å®žçŽ°GRAPESçš„æ··åˆå¹¶è¡Œè®¡ç®—ã€‚ç«‹è¶³å½“å‰æµè¡Œçš„é›†ç¾¤ç³»ç»Ÿç»“æž„,å®žçŽ°äº†åœ¨èŠ‚ç‚¹å†…ä½¿ç”¨OPENMPçº¿ç¨‹çº§å¹¶è¡Œ,èŠ‚ç‚¹é—´ä½¿ç”¨MPIè¿›ç¨‹çº§å¹¶è¡Œçš„æ··åˆå¹¶è¡Œæ¥æ›¿ä»£çº¯MPIå¹¶è¡Œçš„GRAPESè®¡ç®—æ–¹æ³•ã€‚å¾—å‡ºäº†å½“çº¯MPIå¹¶è¡Œæ•ˆçŽ‡ä¸‹é™åˆ°90%ä»¥ä¸‹æ—¶,ä½¿ç”¨æ··åˆå¹¶è¡Œæ–¹å¼,å¯æé«˜5%åˆ°10%å·¦å³çš„ç»“è®ºã€‚æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ Four-dimensional variational assimilation as one of the key technologies of numerical weather predictionâ€™s can take the information related in time for observed data into account to improve the quality of init data which determine the effect of forecast. It can be assimilated the different times, different regions, different types of observational data be considered the most effective scheme international in data assimilation currently. But its calculation is very complicated and needs more computations and more time to compute. The four-dimensional variational assimilation system of GRAPES ( Global/Regional Assimilation and Prediction System ) called GRAPES-4DVAR for short which is a new generation of numerical weather prediction system be developed by Chinese independently have the similar feature with a large amount of computations, needing more memory and longer time when running. How to reduce the elapsed time by improving the code efficiency, changing the algorithm, enhancing the parallel scalability is the key and focus of this article. This article mainly focus on how to obtain the performance from optimized code for improving efficiency, how to analysis the impact on program performance by using a different way through the quantitative method, and how to use a mixed parallel mode for increase scalability of parallel computing. The main work is summarized as follows:(1) Adjusted and optimized the GRAPES regional mode code. Focus on the research of enhancing the performance of memory system and the basic components of the processor. Analyzed what the reasons caused pipeline stalled and remove the bottleneck in code which has a significant impact on the performance when running. Through these, nonlinear mode obtained a benefit 25% improved by adjusting and optimizing code.(2) Put forward a limit solution between the Checkpointing strategy and Store-All strategy. Trade an increase of about 30% of the memory cost for 100% performance increased.(3) Put forward a technique that can manage the data blocks in memory supporting both First In First Out and First In Last Out. Nested Multi-Chained Stack be implement satisfy the need of the improved adjoint algorithm excellent.(4) Improved the Input and Output problem of parallel performance. By comparing the gap of maximum iteration the adjoint mode could running and actual demanding, determined which method can obtain the most performance and satisfy the actual need under stationary computation scale and stationary number of processors. Also given the result that using limited memory space replace the reading/writing external storage when the number of processors more than 128, the wall clock time decline up to 70%. (5) Implement the mixed-mode of parallel computation. For the popular structure of modern cluster system, by using thread-level parallelism through OPENMP method in the node and using the message passing through MPI method internal nodes will display an excellent parallel performance and scalability. Conclude the result that the parallel efficiency of mixed parallel mode can be increased 5% to 10% than of the pure MPI mode when dropped below 90%. Last analyzed the advantages and disadvantages of data division statically for threads.æ›´å¤š è¿˜åŽŸ

ã€å…³é”®è¯ã€‘ GRAPESï¼› æœ‰é™åŒºåŸŸï¼› åˆ‡çº¿æ¨¡å¼ï¼› ä¼´éšæ¨¡å¼ï¼› å¹¶è¡Œè®¡ç®—ï¼›
ã€Key wordsã€‘ GRAPESï¼› Regionalï¼› Tangent modelï¼› Adjoint modeï¼› Parallel computingï¼›

ã€ç½‘ç»œå‡ºç‰ˆæŠ•ç¨¿äººã€‘ å›½é˜²ç§‘å¦æŠ€æœ¯å¤§å¦

ã€åˆ†ç±»å·ã€‘TP301.6
ã€ä¸‹è½½é¢‘æ¬¡ã€‘55
æ”»è¯»æœŸæˆæžœ

çŸ¥ç½‘èŠ‚ä¸‹è½½

èŠ‚ç‚¹æ–‡çŒ®ä¸ï¼š

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æœ¬æ–‡çš„å¼•æ–‡ç½‘ç»œ

èŠ‚ç‚¹æ–‡çŒ®

èŠ‚ç‚¹æ–‡çŒ®

GRAPESæœ‰é™åŒºåŸŸåˆ‡çº¿/ä¼´éšæ¨¡å¼é«˜æ•ˆå¹¶è¡Œç®—æ³•ç ”ç©¶

Studies on High Performance Parallel Computing of GRAPESâ€™ Tangent & Adjoint Model

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

GRAPESæœ‰é™åŒºåŸŸåˆ‡çº¿/ä¼´éšæ¨¡å¼é«˜æ•ˆå¹¶è¡Œç®—æ³•ç ”ç©¶