节点文献

多核系统中的程序性能优化研究

Study on Program Performance Optimization in Multi-Core Systems

【作者】 张琦

【导师】 许胤龙;

【作者基本信息】 中国科学技术大学 , 计算机软件与理论, 2010, 博士

【摘要】 多核处理器在一个处理器芯片上集成多个处理器核心,可同时执行多个线程。长期以来,处理器芯片上的晶体管数目不断增加,处理器的设计越来越复杂,但因为功耗和工艺等方面的限制,处理器的时钟频率无法再继续提高。随着处理器厂商纷纷推出各自的多核处理器,多核系统在我们的工作和生活中迅速得到普及,并且每个处理器中的核数目还在不断的增加。多核处理器的普及给应用程序的发展带来了巨大的挑战,多核处理器中每个核的计算能力并没有增强,它是通过组合多个处理核来提供强大的计算能力。传统的串行应用程序无法方便的直接借助处理器核数目的增加提升性能,必须通过并行化或者同时执行多个程序才能充分发挥多核系统的计算能力。本文从应用程序性能优化和系统整体性能优化两个角度,研究了多核系统中的程序性能优化方法,并验证其有效性。本文的主要工作和创新点如下:1.对于多核系统中的应用程序性能优化,本文分别研究了串行程序性能优化方法,并行程序设计方法和并行程序性能优化方法。通过为程序设计并行算法并实现,可以使程序同时利用多个核的计算能力。通过对并行程序进行优化,可以使程序更充分的发挥多个核的计算能力,其方法包括增加任务数量改善负载均衡,选择最优的线程与处理核之间关联策略,设计无锁机制减少同步开销,消除线程间高速缓存伪共享等等。2.本文通过对多个图像特征提取和马尔可夫决策过程求解程序进行性能优化,使这些应用程序在多核系统中的性能获得了较大提升,并验证了所采用的性能优化方法能够有效的提高应用程序在多核系统中的性能。3.对于多核系统整体性能的优化,本文研究了多线程之间对共享缓存空间的竞争问题,这种竞争会损害整个系统以及各个程序的性能。本文提出了基于工作集模型分析和预测共享缓存上线程竞争情况的方法,并发现如果同时运行线程的工作集大小之和超出共享缓存容量,或者同时运行线程的时间局部性强度差异较大时,线程受到的干扰就会比较剧烈,性能损失比较严重。4.本文提出了一种基于工作集模型的线程调度方法。本方法通过一组监测单元以较小的代价获得线程的工作集大小和时间局部性强度属性,并根据一套线程调度策略,选取合适的线程同时运行,保证线程的工作集数据可以保存在高速缓存之中。实验结果表明,基于工作集模型的线程调度方法较好的缓解了共享缓存上线程间的互相竞争,有效提高了整个系统和各个程序的性能。

【Abstract】 The multi-core processor integrates multiple processor cores on a single chip which can run multiple threads simultaneously. Over the years, the number of transistors on a processor chip grows constantly, and the design of processor becomes more and more complex. But for the power and some other aspects, the clock frequency can not increase more. With the processor manufacturers introducing their multi-core processors, multi-core systems become popular in our work and life. And the number of cores in a chip is increasing constantly. The popularity of multi-core processors brings enormous challenges to application program. In a multi-core processor, the computing power of each core is not enhanced. It combines multiple processor cores to provide powerful computing ability. The traditional serial applications can not directly improve performance by the increasing of processor cores. To fully use computing power of multi-core processor, we must run parallel program or multiple programs concurrently on the system.This paper studies program performance optimization on multi-core system by two perspectives, application program performance optimization and overall system performance optimization. The major work and innovation of this paper are as follows:1. For application program performance optimization in multi-core systems, this paper studies serial program performance optimization methods, parallel program design methods and parallel program performance optimization methods. To use computing power of multiple cores, we need design and implemation parallel algorithm for programs. To use computing power of multiple cores more fully, we need optimize prallel program performance. The parallel program performance optimization methods studied in this paper include improving load balance by increasing tasks number and reducing tasks size, choosing the best affinity policy between threads and processor cores, designing lock-free structure to reduce synchronization overhead, eliminating false sharing of cache between threads and so on.2. This paper applies these performance optimization methods to several image future extraction and Markov decision process solving programs. Experiment results show that the performance of these programs is improved much, and this verifies the program performance optimization methods studied in this paper can effectively improve the program performance in multi-core systems.3. For overall system performance optimization, this paper studies the threads contention problem on the shared last level cache of multi-core systems, which may reduce performance of each thread and overall system. We proposes a method based on working set model to analysis and estimate the thread contention problem on the multi-core systems. We find that once total working set size of the threads exceed the shared cache space or the difference of temporal locality is great, the interference thread suffer is severe and the performance degradation is serious.4. This paper proposes a thread scheduling method based on the working set model. In this method, we design a set of monitoring unit on shared cache, which collect the working set size and temporal locality of threads by low overhead. We also propose an operating system thread scheduling policy, which select appropriate threads running simultaneously to ensure the working sets of threads can be kept in the shared cache. The experimental results show that the thread scheduling method based on working set model remits the threads contention on the shared cache, effectively improves the performance of overall system and each program.

节点文献中: