节点文献

基于多核集群的并行离散事件仿真性能优化技术研究

Research on Performance Optimization for Parallel Discrete Event Simulaiton on Multi-core Cluster

【作者】 陈莉丽

【导师】 姚益平;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2011, 博士

【摘要】 随着仿真应用的不断深入,系统仿真规模越来越大,个体模型复杂度越来越高,使得仿真系统对计算资源的需求不断增加,如何缩短大规模仿真系统的运行时间,提高仿真应用的效率,是并行离散事件仿真领域目前研究的热点问题。而随着多核计算革命的兴起,通用的多核CPU并行方式正成为当前主流的发展趋势。多核CPU同一片上的多个核之间的低通信延迟能够提供可观的性能潜力。然而,目前在多核集群上运行的仿真系统由于大多沿用以往传统集群上的并行仿真技术,虽然能取得一定的并行加速比,但仍然难以充分发挥多核处理器的性能潜力,因此,开展基于多核集群的并行离散事件仿真性能优化技术研究,对于充分发挥多核处理器的性能潜力,提高仿真应用的运行效率,促进我国大规模仿真应用的发展等具有十分重要的理论和实践意义。论文针对当前并行离散事件仿真系统难以有效利用多核计算资源的问题,以进一步提高并行离散事件仿真应用运行效率为根本目标,对时间管理算法、负载平衡、共享状态属性访问机制、通信优化等影响并行离散事件仿真性能的关键问题进行了深入研究,主要工作和创新点如下:(1)提出了支持多核多线程与MPI相结合并行模式的多核集群并行仿真时间管理机制。仿真时间同步是决定并行离散事件仿真运行性能的核心因素。多线程并行调度方式能够充分发挥多核处理器共享内存地址空间低通信延迟的优势,但目前的仿真引擎在多核多线程与MPI相结合的并行模式方面缺乏高效的同步支持。论文针对此问题,在深入分析并行离散事件仿真多线程并行调度机制和分布式同步算法的基础上,提出了支持多核多线程与MPI相结合并行模式的多核集群并行仿真时间管理机制,该机制采用经修改的Mattern算法以适应多线程和MPI异构的并行方式,通过一个有限状态机对每个消息事件的生命周期及状态转换过程进行管理,通过设计无锁的事件状态修改机制来避免锁开销。实验结果表明,在合理的运行配置下,所提出的并行仿真时间管理机制在多核集群上随计算节点数目的增加表现出良好的并行加速比;当仿真实体的核间交互概率达到40%以上时,相对多进程MPI并行方式的加速比可达到1.8左右。(2)提出了并行离散事件仿真系统在多核处理器上自动负载平衡的事件调度机制。在并行离散事件仿真系统中,负载平衡影响着同步和通信开销,从而影响着整个系统的运行性能。当前的并行仿真动态负载平衡机制很难达到事件调度开销小和负载平衡能力强兼得的目标。论文针对此问题,提出了一种基于分布式队列的全局调度机制,该机制通过全局调度方式来达到动态负载平衡,为降低全局调度开销,该机制设计了分布式的事件队列结构和无锁的事件数据结构;实验结果表明在采用传统算法回滚量较大或难以实现动态负载平衡的情况下,基于论文提出的机制不仅事件调度开销小,而且回滚率能够降低10%~80%,体现了良好的负载平衡能力。(3)提出了多核环境下并行离散事件仿真系统基于事务内存的共享属性访问机制。并行离散事件仿真系统中往往存在大量的状态数据通信,这种大量的通信容易导致仿真系统性能下降。目前的共享属性通信机制大多采用对象代理技术,不能充分发挥多核处理器共享内存地址空间低通信延迟的优势。论文针对此问题,设计了基于事务内存的共享属性访问机制PDES-STM。该机制根据并行离散事件仿真的特点将每个仿真事件对应成一个事务,仿真事件并发执行时对共享状态属性访问的正确性由事务内存系统中的冲突检测与解决机制来保证。实例分析结果显示PDES-STM能够有效减少内存开销和消息数目;在测试平台上的运行结果表明论文提出的PDES-STM相对基于消息实现的“拉”方式访问机制的性能优势随外部属性访问概率的增大而越来越明显。(4)提出了多核集群上的并行仿真系统通信延迟隐藏算法。通信延迟是制约分布式计算环境下仿真可扩展性和性能提高的主要因素之一。目前对大规模细粒度模型的仿真应用在并行与分布环境下的通信优化技术尚不能达到足够的通信延迟隐藏,针对此问题,论文提出了“以计算换通信”的优化的B+2R延迟隐藏算法—O(B+2R)算法,该算法依据网络传输时间选取合理的R值,通过对接收到的落伍的实体状态值追加R个时间步的计算来隐藏通信延迟,从而获得更多的并发性。理论分析显示,在合理的配置下该算法可以实现完全的“通信延迟隐藏”。同时,针对CPU和GPU各自的特点,将O(B+2R)算法扩展到GPU平台设计了B+2(R×r)算法。实验结果表明论文提出的通信延迟隐藏机制能够隐藏更多的通信延迟,在合理的运行配置下性能可提高40%以上,可以用于平衡多核集群各种层次上的计算资源之间的计算和通信开销。在上述研究成果的基础上,论文在Musik仿真引擎的基础上设计并实现了基于多核集群的层次式并行离散事件仿真框架;并通过突发公共事件条件下的民意趋势模型对其在多核集群上进行了综合测试,结果表明所采用的优化技术使得整体仿真系统性能提高了45%左右,并且显现出良好的可扩展性。

【Abstract】 The current trend in processor architecture design is the integration of multiple cores on a single processor. Clusters made of such microprocessors are widely adopted by Parallel Discrete Event Simulation (PDES) developers for large-scale simulation applications. The tightly integrated processing cores in one chip with communication latencies substantially lower than those present in conventional clusters provide potential performance improvement especially for the fine-grained PDES. Thus, in the PDES domain, one of the research focuses is on modifying software platforms to efficiently utilize the computation resources of multi-core processors.Considering the characteristics of the Multi-core clusters and Parallel Discrete Event Simulation System, this dissertation investigates solutions to improve the performance of large-scale or complex simulation programs from various factors which may affect the performance of parallel discrete event simulation, including event-scheduling, shared-attribute access and communication optimization. The innovations of this paper are as follows:Firstly, a synchronization algorithm which combines multi-thread parallel mode with MPI is proposed. Experimental results show that multi-thread parallel scheme outperform multi-process parallel mode in most cases. But current available simulation engines lack supports to combining multi-thread parallel mode with MPI for distributed compute environment or the relative technologies are not mature. In this paper, the compatibility of multi-thread parallel mode to cluster computing platform is considered and a time management mechanism combining multi-thread parallel mode on each machine with MPI communication for all the machines in cluster. A group of tests have been performed and the results show that this hybrid mechanism runs very well on multi-core cluster.Secondly, a global schedule mechanism based on a distributed event queue to improve the performance of Time Warp system on multi-core systems is proposed. The current dynamic load balancing technologies cann’t reach the twin goals of good balance and low event-scheduling overhead. In this paper, taking advantage of multi-core architecture with shared memory address space and low communication, a global schedule mechanism based on a distributed event queue is proposed. Its specially designed data structures and algorithms reduce the cost of lock operations much. Comparing with the distributed event queue local schedule mechanism, the experiment results show that the distributed queue global schedule mechanism can effectively decrease rollback rate and balance the workloads at a low event scheduling cost for Time Warp system on multi-core platforms.Thirdly, a shared attribute/state access mechanism based on transactional memory to make users easier to model their system and improve the performance of Time Warp system on multi-core systems is proposed. This mechanism implements transparent access to shared attributes with simple API and provides more powerful modeling ability for agent-based simulation application. A case study is given to demonstrate how to use this mechanism and what merits it brings. Theoretical analysis shows that this access mechanism is able to not only ease the attribute-publishing/subscribing burden on simulation model developers but also reduce the number of messages. The experiment results show that the STM-based shared attribute access mechanism prominently outperforms the conventional“pull”mechanism on multi-core platforms.Fourthly, a more effective latency-hiding mechanism in the parallelization of agent-based model simulations (ABMS) with millions of agents is proposed. The current B+2R latency-hiding algorithm only hides part of communication latency. In this paper, a new latency-hiding algorithm is proposed. The principle of this algoritm is that certain redundant computation trade communication. An analytical model for this algorithm is given and theoretical analysis shows that this algorithm can hide all the communication latency when a proper R is selected. In addition, a B+2(R×r) algorithm which combines the new and old B+2R algorithm is designed to make the new B+2R algorithm is effective on GPU platform. The experiment results indicate the benefits of the new B+2R latency-hiding scheme, delivering as much as over 40% improvement in runtime for certain benchmark ABMS application scenarios with several billion agents.Finally, much performance optimization work on a simulation application to forecaste the trend of public opinion under critical condition has been done to reduce the memory overhead and get scalability. The experimental results demonstrate that the system scale increases one order on single multi-core machine and good scalability is shown on multi-core cluster.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络