èŠ‚ç‚¹æ–‡çŒ®

é¢å‘å¤šæ ¸çš„ç³»ç»Ÿçº§MPIé€šä¿¡ä¼˜åŒ–å…³é”®æŠ€æœ¯ç ”ç©¶

Research on the System-Level Optimizing Key Techniques for MPI Communication on Multicore Systems

åˆ†é¡µä¸‹è½½
åˆ†ç« ä¸‹è½½
æ•´æœ¬ä¸‹è½½
åœ¨çº¿é˜…è¯»
ä¸æ”¯æŒè¿…é›·ç‰ä¸‹è½½å·¥å…·ï¼Œè¯·å–æ¶ˆåŠ é€Ÿå·¥å…·åŽä¸‹è½½ã€‚

ã€ä½œè€…ã€‘ åˆ˜å¿—å¼ºï¼›

ã€ä½œè€…åŸºæœ¬ä¿¡æ¯ã€‘ å›½é˜²ç§‘å¦æŠ€æœ¯å¤§å¦ ï¼Œ è®¡ç®—æœºç§‘å¦ä¸ŽæŠ€æœ¯ï¼Œ 2011ï¼Œ åšå£«

ã€æ‘˜è¦ã€‘ æ¶ˆæ¯ä¼ é€’æŽ¥å£(Message Passing Interface,ç®€ç§°MPI)è‡ª20ä¸–çºª90å¹´ä»£ä»¥æ¥ä¸€ç›´æ˜¯é«˜æ€§èƒ½è®¡ç®—(High Performance Computing,ç®€ç§°HPC)é¢†åŸŸå¹¶è¡Œç¨‹åºå¼€å‘çš„äº‹å®žæ ‡å‡†ã€‚åœ¨åŸºäºŽMPIç¼–å†™çš„å¹¶è¡Œç¨‹åºä¸,MPIé€šä¿¡æ€§èƒ½é€šå¸¸å¯¹ç¨‹åºæ•´ä½“æ€§èƒ½èµ·ç€å…³é”®ä½œç”¨,ä¼˜åŒ–MPIé€šä¿¡å…·æœ‰é‡è¦æ„ä¹‰ã€‚è¿‘å¹´æ¥,åœ¨å¤šæ ¸æŠ€æœ¯é«˜é€Ÿå‘å±•çš„èƒŒæ™¯ä¸‹,MPIé€šä¿¡äºŸå¾…é’ˆå¯¹å¤šæ ¸ç³»ç»Ÿç‰¹ç‚¹è¿›è¡Œä¼˜åŒ–ã€‚ç„¶è€Œ,çŽ°æœ‰ä¼˜åŒ–å·¥ä½œä¸»è¦åœç•™åœ¨åŸºäºŽè¿›ç¨‹MPIçš„é€šä¿¡æŠ€æœ¯,æ™®éå˜åœ¨å¤„ç†å¼€é”€å¤§ã€è®¿å˜éœ€æ±‚é«˜ç‰ä¸è¶³,é™åˆ¶äº†é€šä¿¡æ€§èƒ½è¿›ä¸€æ¥æé«˜ã€‚æœ¬æ–‡é’ˆå¯¹å¤šæ ¸ç³»ç»Ÿè¯¸å¤šç‰¹ç‚¹å’ŒçŽ°æœ‰ä¼˜åŒ–æ–¹æ³•ä¸è¶³,ä»ŽåŸºäºŽçº¿ç¨‹MPIçš„é€šä¿¡æŠ€æœ¯æ–¹å‘å…¥æ‰‹,ç³»ç»Ÿç ”ç©¶äº†å¤šæ ¸ç³»ç»ŸMPIé€šä¿¡ä¼˜åŒ–çš„å…³é”®æŠ€æœ¯,æŽ¢ç´¢äº†å…±äº«å†…å˜ç³»ç»Ÿä¸Šæ›´é«˜æ•ˆçš„æ¶ˆæ¯ä¼ é€’é€šä¿¡æŽ¥å£ã€‚å–å¾—çš„ä¸»è¦æˆæžœå¦‚ä¸‹:1ã€é¢å‘å¤šæ ¸ç³»ç»Ÿ,æå‡ºäº†ä¸€ç§é«˜æ•ˆçº¿ç¨‹MPIæ”¯æ’‘è½¯ä»¶æŠ€æœ¯â€”â€”MPIé€šä¿¡åŠ é€Ÿå™¨(MPI Communication Accelerator,ç®€ç§°MPIActor)ã€‚MPIActoré€šè¿‡è‡ªèº«ä¸“é—¨è®¾è®¡çš„æŽ¥å£èšåˆæŠ€æœ¯åœ¨ä¼ ç»Ÿè¿›ç¨‹MPIæ”¯æ’‘çŽ¯å¢ƒçš„åŸºç¡€ä¸Šå»ºç«‹çº¿ç¨‹MPIæ”¯æ’‘çŽ¯å¢ƒã€‚ç›¸æ¯”ä¼ ç»ŸMPIæ”¯æ’‘è½¯ä»¶çš„å¼€å‘æ–¹æ³•,é‡‡ç”¨MPIActoræŠ€æœ¯æž„å»ºçº¿ç¨‹MPIæ”¯æ’‘è½¯ä»¶çš„å¼€å‘å·¥ä½œé‡å°,ä¸”MPIActoråº”ç”¨æ›´çµæ´»,èƒ½æ¨ªå‘æ”¯æŒç¬¦åˆMPI-2æ ‡å‡†çš„ä¼ ç»Ÿè¿›ç¨‹MPIæ”¯æ’‘è½¯ä»¶ã€‚å®žéªŒé‡‡ç”¨åŒè·¯Nehalem-EPå¤„ç†å™¨ç³»ç»Ÿä¸Šçš„OSU_LATENCYåŸºå‡†ç¨‹åºè¿›è¡Œæµ‹è¯•,ç»“æžœè¡¨æ˜Žä¼ è¾“8Kè‡³2Må—èŠ‚é•¿åº¦æ¶ˆæ¯æ—¶,åŠ å…¥MPIActorçš„MVAPICH2 1.4åœ¨å¤„ç†å™¨å†…é€šä¿¡æ€§èƒ½æå‡äº†37%ä»¥ä¸Š,æœ€é«˜å¯è¾¾114%;å¤„ç†å™¨é—´é€šä¿¡æ€§èƒ½æå‡30%ä»¥ä¸Š,æœ€é«˜å¯è¾¾144%;è€Œå¯¹åŠ å…¥MPIActorçš„Open MPI 1.5æµ‹è¯•ç»“æžœä¹Ÿè¡¨æ˜Ž,å¤„ç†å™¨å†…é€šä¿¡æ€§èƒ½èƒ½æå‡48%ä»¥ä¸Š,æœ€é«˜å¯è¾¾106%,å¤„ç†å™¨é—´åˆ™èƒ½æé«˜46%ä»¥ä¸Š,æœ€é«˜å¯è¾¾98%ã€‚2ã€é’ˆå¯¹å¤šæ ¸ç³»ç»Ÿä¸Šçš„é›†åˆé€šä¿¡ä¼˜åŒ–,åŸºäºŽMPIActoræå‡ºäº†ä¸€å¥—æ–°çš„åˆ†çº§é›†åˆé€šä¿¡ç®—æ³•æ¡†æž¶(MPIActor Hierachical Collective Algorithm Framework,ç®€ç§°MAHCAF)å’Œä¸€ç»„é«˜æ•ˆçš„åŸºäºŽçº¿ç¨‹MPIçš„èŠ‚ç‚¹å†…é›†åˆé€šä¿¡ç®—æ³•ã€‚MAHCAFé‡‡ç”¨æ¨¡æ¿æ–¹æ³•è®¾è®¡åˆ†çº§é›†åˆé€šä¿¡ç®—æ³•,å°†èŠ‚ç‚¹å†…å’ŒèŠ‚ç‚¹é—´é›†åˆé€šä¿¡è¿‡ç¨‹ä½œä¸ºæ¨¡æ¿çš„å¯æ‰©å±•æ¥éª¤,å¹¶å°†å®ƒä»¬é€šè¿‡æµæ°´åŒ–å¹¶è¡Œæ–¹æ³•ç»„ç»‡,èƒ½å¤Ÿå……åˆ†å‘æŒ¥åé›†åˆé€šä¿¡è¿‡ç¨‹é—´çš„å¹¶å‘æ€§ã€‚åŸºäºŽçº¿ç¨‹MPIè®¾è®¡çš„èŠ‚ç‚¹å†…é›†åˆé€šä¿¡ç®—æ³•èƒ½å¤Ÿå……åˆ†åˆ©ç”¨å…±äº«å†…å˜ç³»ç»Ÿçš„ä¼˜åŠ¿å®žçŽ°é€šä¿¡è¿‡ç¨‹,ç›¸æ¯”ä¼ ç»ŸåŸºäºŽè¿›ç¨‹MPIçš„é›†åˆé€šä¿¡ç®—æ³•å¤„ç†ä»£ä»·å°,è®¿å˜éœ€æ±‚ä½Žã€‚Nehalemé›†ç¾¤ç³»ç»Ÿä¸Šçš„IMBå®žéªŒè¡¨æ˜Ž:ä¸ŽMVPAICH2 1.6ç›¸æ¯”,é‡‡ç”¨èŠ‚ç‚¹å†…é›†åˆé€šä¿¡é€šç”¨ç®—æ³•çš„MAHCAFèƒ½å¤Ÿå¯¹å¹¿æ’ã€å¤šå¯¹å¤šå¹¿æ’ã€å½’çº¦å’Œå…¨å½’çº¦åœ¨ç»å¤§å¤šæ•°æ¡ä»¶ä¸‹å¸¦æ¥æ˜¾è‘—çš„æ€§èƒ½æå‡;ä¸ä»…å¦‚æ¤,å°†ä¸“é—¨é’ˆå¯¹Nehalemä½“ç³»ç»“æž„è®¾è®¡çš„å¤šçº§åˆ†æ®µå½’çº¦ç®—æ³•(HSRA)åŠ å…¥MAHCAFåŽ,å½’çº¦å’Œå…¨å½’çº¦é€šä¿¡çš„æ€§èƒ½è¿˜èƒ½å¤Ÿè¢«è¿›ä¸€æ¥æé«˜ã€‚3ã€é’ˆå¯¹éžå¹³è¡¡è¿›ç¨‹åˆ°è¾¾å½±å“å¹¿æ’é€šä¿¡æ€§èƒ½çš„é—®é¢˜,åŸºäºŽMPIActorçš„ç‰¹æœ‰ç»“æž„æå‡ºäº†ä¸€ç§ç«žäº‰å¼æµæ°´åŒ–ä¼˜åŒ–(Competitive and Pipelined,ç®€ç§°CP)æ–¹æ³•ä»¥æé«˜éžå¹³è¡¡è¿›ç¨‹åˆ°è¾¾æ¨¡å¼ä¸‹çš„å¹¿æ’é€šä¿¡æ€§èƒ½ã€‚è¯¥æ–¹æ³•åˆ©ç”¨å¤šæ ¸/å¤šå¤„ç†å™¨ç³»ç»ŸèŠ‚ç‚¹å†…è¿è¡Œå¤šä¸ªè¿›ç¨‹çš„ä¼˜åŠ¿,å°†èŠ‚ç‚¹å†…æœ€æ—©åˆ°è¾¾çš„è¿›ç¨‹ä½œä¸ºæ‰§è¡ŒèŠ‚ç‚¹é—´é€šä¿¡çš„å¼•å¯¼è¿›ç¨‹,èƒ½åœ¨æœ€æ—©æ—¶é—´å¯åŠ¨èŠ‚ç‚¹é—´é›†åˆé€šä¿¡è¿‡ç¨‹,å‡å°‘å¹¿æ’é€šä¿¡å¹³å‡ç‰å¾…æ—¶é—´ã€‚å¾®æ€§èƒ½æµ‹è¯•å®žéªŒè¡¨æ˜Ž,é‡‡ç”¨CPæ–¹æ³•ä¼˜åŒ–çš„å¹¿æ’æ€§èƒ½æ˜¾è‘—ä¼˜äºŽä¼ ç»Ÿç®—æ³•,è€Œä¸¤ä¸ªå®žé™…åº”ç”¨å®žä¾‹çš„æ€§èƒ½æµ‹è¯•ä¹Ÿè¡¨æ˜ŽCPæ–¹æ³•èƒ½å¤Ÿæ˜¾è‘—æ”¹å–„å¹¿æ’æ€§èƒ½ã€‚4ã€é¢å‘å¤šæ ¸/å¤šå¤„ç†å™¨ç³»ç»Ÿä¸Šçš„èŠ‚ç‚¹å†…MPIé€šä¿¡ä¼˜åŒ–,åœ¨MPIActoråŸºç¡€ä¸Šæå‡ºäº†ä¸€å¥—é«˜æ•ˆçš„å…±äº«å†…å˜æ¶ˆæ¯ä¼ é€’æŽ¥å£(Shared-Memory Message Passing Interface,ç®€ç§°SMPI)ã€‚ç›¸æ¯”ä¼ ç»ŸMPI,è¯¥æŽ¥å£èƒ½æ”¯æŒè¿è¡Œåœ¨åŒä¸€èŠ‚ç‚¹ä¸Šçš„MPIè¿›ç¨‹é€šè¿‡ä¼ é€’æ¶ˆæ¯åœ°å€ç›´æŽ¥è¯»å–è¿›ç¨‹é—´å‘é€çš„æ¶ˆæ¯æ•°æ®,è€Œä¸æ˜¯å¤åˆ¶æ¶ˆæ¯æ•°æ®åˆ°å½“å‰è¿›ç¨‹,å› æ¤æžå¤§å‡å°‘äº†è®¿å˜å¼€é”€ã€‚å®žéªŒè¡¨æ˜Ž,åœ¨8ä¸ªèŠ‚ç‚¹ä¸Šç”¨64ä¸ªMPIè¿›ç¨‹è¿›è¡Œ4000é˜¶çŸ©é˜µä¹˜,åˆ©ç”¨è¯¥æŽ¥å£è®¾è®¡çš„cannonçŸ©é˜µä¹˜ç®—æ³•è¾ƒåˆ©ç”¨MPIè®¾è®¡çš„ç®—æ³•åŠ é€Ÿæ¯”è¾¾åˆ°äº†çº¦1.14ã€‚æ›´å¤š è¿˜åŽŸ

ã€Abstractã€‘ Over the last decades of the 20th century, MPI (Message Passing Interface) have become a de facto standard of programming model in High Performance Computing (HPC) domain. The performance of MPI communications usually play a key role on the global performance of MPI-based programs. Thus Optimizing MPI communication is becoming extremely important.Recently, with the rapid development of multicore technologies, The optimization of MPI communication in multicore systems is strongly expected to be improved by combining their new characteristics with multicores. However, the existing optimizing techniques still remains at the technologies of process-MPI based communication which often incur performance issues such as large overhead, high memory visit, thus the cur-rent methods still have some limitations on further improving the communication per-formances. Towards addressing these issues of optimizing the performance of MPI communication on multicore systems, we concentrate on studying key optimization strategies from the views of threaded-MPI based communication techniques in this pa-per. As a result of our invesitigation, the following contributions have been achieved:(1) An effective threaded-MPI software technology, called MPI communication ac-celerator (MPIActor), is proposed for multicore systems. Compared with the traditional MPI implementation development methods to develop threaded-MPI, MPIActor with smaller development workload, more flexible usage performs better. Not only that, MPIActor can support all traditional process-based MPI softwares satisfying MPI-2 standards, and can inherit the performance advantages of inter-node communication supported by traditional MPI implementations. The experimental results of OSU_LATANCY benchmark on dual-way Nehalem-EP processor system show that the performance of intra-socket communication can be increased by 37% to 114% and the performance of inter-socket communication can be improved by 30% to 144% for MVAPICH2 1.4 supported by MPIActor in comparison with the pure MVAPICH2 1.4 when transferring 8KB to 2MB messages. At the same time, the experiments also show that the performance of intra-socket communication is increased by 48% to 106% and the performance of inter-socket communication is increased by 46% to 98% for Open MPI 1.5 supported by MPIActor compared to the pure Open MPI 1.5.(2) A novel hierarchical collective communication algorithm framework (MAH-CAF) and a group of effective threaded-MPI based intra-node collective communication algorithms are proposed. MAHCAF constructs hierarchical collective communication algorithms by using template design pattern. The intra-node and inter-node collective communication sub-processes (IntraCP and InterCP) of MAHCAF play extensible roles. IntraCP can not only be implemented by general independent algorithms regardless of multicore architecture but also be designed by the specific algorithms considering the characteristics of multicore architecture in multicore architecture drivers. The experi-ment results of Intel MPI benchmark show that MAHCAF integrated with general in-tra-node collective algorithms can remarkably improve the performance of MPI_Bcast, MPI_Allgather, MPI_Reduce and MPI_Allreduce compared with MVAPICH2 1.6,. In addition, the intra-node reduce algorithm for Nehalem architecture, called hierarchical segment reduce algorithm (HSRA), can greatly improve the performance of MPI_Reduce and MPI_Allreduce.(3) Towards reducing negative performance impact on MPI broadcast incurred by unbalanced processes arrival (UPA) patterns , a novel Competitive and Pipelined (CP) method based on MPIActor is proposed. CP method can regard the first arriving process as a leading process to execute inter-node collective communication by making better use of the advantages of running multiple processes in the intra-node of multicore sys-tem. By doing so, it can start the inter-node collective communication process as soon as possible and reduce the waiting cost. The experiment results of a micro benchmark show that the performance of broadcast algorithms enhanced by CP-method signifi-cantly outperforms the performance brought by other traditional algorithms. In addi-tion, the extensive experiments through running two real world applications also prove that CP method can greatly improve the performance of broadcast in real scenarios.(4) An efficient and effective Shared-memory Message Passing Interface (SMPI) on threaded-MPI is proposed for optimizing intra-node communication on multicore systems. Unlike copying the message from the source process to the destination process, SMPI-supported MPI processes can communicate with each other in manner of directly accessing the buffer of posted message on the same node. In particular, SMPI can be efficiently implemented by utilizing the existing pattern of MPIAcotor. The result of 4000 order square matrix multiplication computed by 64 processes on 8 nodes shows the performance of SMPI based cannon matrix multiplication algorithm can achieve a speedup of about 1.14 in contrast to MPI based algorithm.æ›´å¤š è¿˜åŽŸ

ã€å…³é”®è¯ã€‘ MPIé€šä¿¡åŠ é€Ÿå™¨ï¼› å¤šæ ¸å¤„ç†å™¨ï¼› MPIé€šä¿¡ä¼˜åŒ–ï¼› çº¿ç¨‹MPIï¼› åˆ†çº§é›†åˆé€šä¿¡ç®—æ³•ï¼› å…±äº«å†…å˜æ¶ˆæ¯ä¼ é€’æŽ¥å£ï¼› ç«žäº‰å¼æµæ°´åŒ–æ–¹æ³•ï¼›
ã€Key wordsã€‘ MPIActorï¼› Multicore Processorï¼› MPI Communication Optimizingï¼› Threaded-MPIï¼› Hierachical Collective Communicaiton Algorithmï¼› Competitive and Pipelined Methodï¼› Shared-Memory Message Passing Interfaceï¼›

ã€ç½‘ç»œå‡ºç‰ˆæŠ•ç¨¿äººã€‘ å›½é˜²ç§‘å¦æŠ€æœ¯å¤§å¦

ã€åˆ†ç±»å·ã€‘TP332
ã€ä¸‹è½½é¢‘æ¬¡ã€‘341
æ”»è¯»æœŸæˆæžœ

çŸ¥ç½‘èŠ‚ä¸‹è½½

èŠ‚ç‚¹æ–‡çŒ®ä¸ï¼š

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

æœ¬æ–‡çš„å¼•æ–‡ç½‘ç»œ

èŠ‚ç‚¹æ–‡çŒ®

èŠ‚ç‚¹æ–‡çŒ®

é¢å‘å¤šæ ¸çš„ç³»ç»Ÿçº§MPIé€šä¿¡ä¼˜åŒ–å…³é”®æŠ€æœ¯ç ”ç©¶

Research on the System-Level Optimizing Key Techniques for MPI Communication on Multicore Systems

æœ¬æ–‡é“¾æŽ¥çš„æ–‡çŒ®ç½‘ç»œå›¾ç¤º:

é¢å‘å¤šæ ¸çš„ç³»ç»Ÿçº§MPIé€šä¿¡ä¼˜åŒ–å…³é”®æŠ€æœ¯ç ”ç©¶