节点文献

面向嵌入式多核系统的并行程序优化技术研究

Research on Optimization of Parallel Programs for Embedded Multicore System

【作者】 王庆

【导师】 季振洲;

【作者基本信息】 哈尔滨工业大学 , 计算机系统结构, 2013, 博士

【摘要】 传统上嵌入式系统设计是以低功耗为首要目标,但是随着计算密集型的嵌入式应用不断扩展,对性能要求、功耗要求的不断提高,嵌入式系统最近已经转向高性能嵌入式计算。面对日益复杂的嵌入式应用,片上多核处理器(CMP)已成为高性能嵌入式计算的一个有效解决方案。CMP采用多个性能适中的处理核心提高能量效率,使用高的任务级并行或者线程级并行提高整个处理器的性能。在嵌入式领域,如何充分利用CMP带来的高性能和低功耗技术对并行程序在嵌入式多核平台上的应用提出了很大挑战。对于嵌入式多核系统来说,低功耗和高性能是其核心特征之一,因而,如果无法有效地利用片上多核处理器技术并对应用程序进行有效地的并行计算,势必会影响建立在其上的各种应用的性能,并且造成资源和能源的浪费,这对资源和能耗要求甚高的嵌入式领域来说,这种情况是不可忍受的。因此,针对嵌入式应用,设计和实现高性能低功耗的并行计算方法,是嵌入式多核系统能否取得广泛应用需解决的核心问题之一。鉴于上述原因,本文深入分析了当前高性能嵌入式计算采用的性能和功耗优化方法,重点围绕嵌入式多核平台的并行编译设计及并行程序优化等问题进行研究,本文的主要工作和技术创新概述如下:首先,提出了面向嵌入式多核系统平台的OpenMP并行编译方法,并在此基础上扩展OpenMP并行指导语句,实现了OpenMP并行优化。以嵌入式操作系统eCos为实例,基于共享式存储并行编程模型OpenMP为嵌入式多核平台设计并实现了一个源到源的并行编译器。提出了基于嵌入式多核层次存储结构的OpenMP并行循环优化算法,扩展了OpenMP循环的并行制导语句tiling,从而提高嵌入式多核平台上的并行编程效率和并行性能,最后通过实验验证了扩展语句在嵌入式多核平台上的有效性和应用性能。其次,提出了面向并行程序应用的嵌入式多核系统运行时动态优化方法。针对在受带宽、数据竞争及数据同步不当等因素影响的多线程并行程序中增加线程的数量会明显降低性能的问题,本文提出了一个基于并行程序结构的性能分析模型,该模型把程序的并行区划分为完全并行和临界区部分,使得在运行时能够动态分析出具有最佳性能时的线程数。为了减少因线程之间的负载不均衡造成的性能和能耗浪费,本文还提出了基于该运行时框架的动态调度方法,该方法针对并行循环动态选择调度方法,并根据线程负载状况调整调度块大小实现性能均衡。最后基于嵌入式多核平台对运行时动态优化框架进行了验证和评估,实验表明,该框架以及运行时优化方法能够很好的适用于嵌入式多核系统,为并行应用提升性能。第三,提出了面向并行线程负载的低功耗执行模型。为了避免并行应用程序在嵌入式多核平台上因负载不均衡造成的能耗浪费,本文首先对并行线程执行负载进行分析,结合动态电压频率调整(Dynamic Voltage and FrequencyScaling,DVFS)提出并实现了一个低功耗执行模型,然后,本文提出并实现了一个基于该模型的线程执行频率控制算法,使得运行时系统可以根据并行线程的负载不均衡性状况动态调节运行频率,在不影响并行程序运行性能的情况下,降低程序运行的能耗。最后基于模拟的嵌入式多核平台对模型进行验证。实验表明,本文设计的低功耗执行模型能够在2.2%的性能损失的情况下为嵌入式多核平台上的并行应用程序节省平均13%的能量消耗。第四,提出了基于能量效率的反馈式动态电压频率调整(DVFS)方法。根据并行应用的特点,该方法将将并行程序的性能和能量消耗综合考虑,采用能量效率的能量延迟积(Energy-Delay Product,EDP)衡量基准,通过反馈式的动态电压频率控制框架,在并行程序运行初期发现适合每个核心最佳的DVFS档位,在不影响程序性能的条件下,减少能耗提高能量效率。最后通过实验对反馈式DVFS进行了验证和评估。

【Abstract】 Low-power computing is the primary objective for the traditional embeddedsystem design. However, with embedded computing-intensive applicationscontinuing to expand, performance requirements, power consumption requirementscontinuing to increase, embedded systems have recently turned to high-performanceembedded computing (HPEC). In order to face the situation of increasingcomplexity in embedded applications, chip multi-processor (CMP) can be used asan effective solution for high-performance embedded computing. It combines somemoderate performance processing cores to improve the energy efficiency. And italso makes use of high task-level parallelism or thread level parallelism to improvethe whole performance of the applications. In the embedded computing field, howto take full advantage of CMP which brings high-performance and low-powertechnology in the embedded multicore platform is becoming a great challenge forparallel applications.Low-power consumption and high performance are the core issues in theembedded multi-core systems. However, if we can not make full use of the on-chipmulti-core technology and applications to the parallel computing, it will causenegative impact for the performance of a variety of applications, and result in wasteof resources and energy. This situation is intolerable in the embedded field, wherethe resources and energy consumption is critical. Therefore, for embeddedapplications, the design and implementation of high-performance and low-powerparallel computing are one of core issues whether the embedded multicore systemcan be used widely.For these reasons, this thesis takes the in-depth analysis of the current high-performance embedded computing, and focuses on the design of parallel compiler inembedded multicore platform and the methods of parallel optimization. The maincontributions and technological innovation are as follows:First, an OpenMP parallel compiler framework for the embedded multi-coreplatform is proposed. And an OpenMP parallel guidance statement is extended forOpenMP parallel programs optimization on this basis. This compiler is a source-to-source compiler for embedded multi-core platform based on the shared memoryparallel programming model OpenMP. It is designed and implemented in the eCosembedded system. On this basis, an optimization algorithm based on the embeddedmulticore hierarchical storage structure is proposed for the OpenMP parallel loops.Then the OpenMP loop parallel guidance statement: tiling is extended for theembedded multicore platform. The availability and performance of the extended statement is verified by experiments.Second, a run-time dynamic optimization framework for the parallelapplications on the embedded multi-core systems is proposed. Continuing toincrease the number of running threads for multi-threaded parallel programs whichare affected by the factors in bandwidth, data competition and data impropersynchronization, may result in performance degradation for the applications. Thisthesis presents a performance analysis model based on the parallel programstructure. This model divides the parallel programs into fully parallel sections andthe critical sections. This framework can gain the number of threads when parallelapplications have the best performance by the dynamic analysis at runtime. In orderto reduce the waste of performance and energy which is caused by the unbalancedload among the threads, this thesis also proposes the dynamic scheduling methodbased on the runtime framework. This method is used to select the properscheduling scheme dynamically for the the parallel loops and adjust schedulingchunk size to achieve a balanced performance based on the load status amongthreads. This runtime optimization framework based on embedded multicoreplatform is validated and evaluated. The experiments show that this runtimeoptimization framework is suitable for parallel applications on embedded multicoresystems to improve performance.Third, a low-power execution model based on multithread load imbalance forthe parallel programs is proposed. In order to avoid the waste of energyconsumption in embedded multicore platform due to load imbalance of parallelthreads, this thesis first analyses the load of parallel threads performance, combinesthe dynamic voltage and frequency scaling (DVFS) technique, and proposes a low-power model for multithread execution. Then, this thesis also proposes an algorithmfor controling frequency which threads executed at based on this low-power model.The run-time system can dynamically adjust the thread operating frequencyaccording to the load imbalance situations of the parallel threads. This algorithmcan reduce the energy consumption without affecting the performance of the parallelprograms. Finally, this model is validated on simulation-based embedded multi-coreplatform. The experiments show that the proposed low-power execution model cansave an average of13%of the energy consumption for parallel applications onembedded multicore platform with the case of the2.2%loss of performance.Fourth, this thesis proposes a feedback DVFS method based on the energyefficiency. According to the characteristics of parallel applications, this thesisimplements a feedback framework guiding DVFS based on energy efficiency. Thismethod takes the performance and energy consumption into account. So this thesistakes the energy-delay product (EDP) as the main metrics, and determines the per-core DVFS level at the beginning of parallel programs running. Without affecting the performance of applications, this method and reduce energy consumption andimprove energy efficiency. Finally, the feedback DVFS is validated and evaluatedby experiments.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络