节点文献
面向高效能计算的大规模资源管理技术研究与实现
Research & Implementation of Large-Scale Resource Management Technology for High Productivity Computing
【作者】 卢宇彤;
【导师】 杨学军;
【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2009, 博士
【摘要】 高性能计算发展到今天,已经从单一地追求高性能转向致力于实现系统的高效能,提高系统的实际性能、可编程性、可移植性和健壮性,降低系统的开发、运行以及维护成本。然而,由于百千万亿次以上高性能计算机系统具有规模庞大、结构复杂和组成异构多样等特点,为了实现高效能目标,系统必须解决实际应用的持续性能难以提升、管理效率低、可靠性差、能源消耗巨大等多个挑战性问题。这些问题对高效能计算机系统的大规模资源管理系统在性能、功能和可扩展性等多个方面提出了很高的要求,大规模资源管理技术成为高效能计算机系统实现的一个重大挑战性技术问题。论文以我们自行研制的可扩展共享存储(S2MP:Scalable Shared Memory Processing)体系结构的高性能计算机系统上的大规模资源管理系统实现为基础,以面向高效能大规模并行计算机系统的高效资源管理技术为主要研究内容,在资源管理模型、资源管理系统的可扩展技术、综合优化的调度机制、用户作业自动恢复的容错管理方法以及系统能耗管理技术等方面展开研究,本文主要工作和贡献如下:1、提出了大规模并行计算机系统的深度资源信息模型DRIM,克服了传统资源管理系统所关注的资源对象粒度过粗和资源描述能力不足的问题,针对高效能计算系统的特点建立了实体模型、功能模型和应用模型,更加全面、准确地描述了计算资源、通信资源、存储资源、多模式应用等各方面的特征,并将资源对象之间的关系模型化,使得管理策略更有效,管理功能可扩展性更好,为大规模并行系统高效的作业调度与资源分配提供了有力支撑。2、设计了动态层次式级联资源管理结构,提出了基于自组织方式的级联服务动态创建方法,优化了资源管理系统的通信协议,设计了轻载的传输协议来减少大规模资源管理开销,采用硬件通讯机制实现高效的控制消息传递,通过全局操作与综合优化实现大规模作业的快速加载,解决了资源管理系统的规模可扩展问题。采用基于构件的系统实现结构支持资源管理的功能扩展。在由2048个多核处理器构成的S2MP体系结构的系统上进行了资源管理系统的实现和测试,测试结果表明系统具有良好的可扩展性。3、提出了基于综合优先级的调度策略,综合考虑系统的作业属性、资源属性和服务属性中的多个因素,提升了调度机制的灵活性和有效性;设计了可变深度的回填调度策略MC-Backfill,实现了根据队列实际状态对Backfill的深度和频度的动态调整,较好地协调了系统的公平性目标和高吞吐率目标的实现。系统测试表明,MC-Backfill策略可以在用户对作业执行时间估计不准确的情况下较好地减少作业平均等待时间,提高系统吞吐率。4、建立了一种高性能计算系统的故障分布模型,提出了基于Checkpoint/Restart的作业容错执行时间模型;设计了面向可靠性的检查点周期选择算法和最优结点集合选择方法,增强了系统中作业运行的可靠性;实现了基于Checkpoint机制的作业自动容错,避免了系统运行过程中的人工干预,降低了系统的平均故障恢复时间,提高了系统的可用性。5、结合系统级和应用级的能耗管理技术,从资源管理系统的角度研究了全系统能耗管理,设计了能耗约束条件下的资源分配方法进行系统级的结点能耗管理;提出了基于负反馈的两级能耗管理模型进行应用级的能耗管理,基于访存带宽和I/O带宽的利用率,采用线性控制和模糊控制相结合的方法动态调整并行应用线程和进程数目,适时将空闲处理器核关闭以节约系统能耗。并给出了对能耗控制管理有效性的测试和分析。
【Abstract】 The technology trend in supercomputing is changing from purely pursuing the peak performance to comprehensively pursuing the high productivity. High productivity computing system (HPCS) aims to improve the programmability, portability, robustness of the system, and reduce the development, running and maintenance costs. However, due to the various features such as very large scale, complex and heterogeneous architecture, next-generation teraflops and petaflops systems face some vital challenges when aiming at implementing the high productivity target. Specifically, these challenges include how to improve the sustained performance, reliability, scalability, flexibility, and how to significantly reduce the power consumption during the overall design. Particularly, these challenges have become several critical research issues in large scale resource management system (RMS) of HPCS.Our research work is based on the implementation of the large-scale resource management system for our own high performance computer system which has the Scalable Shared Memory Processing (S2MP) architecture. Focusing on the development of high productivity resource management system for large-scale parallel systems, in this thesis, we systematically investigate some key techniques in efficient resource model, scalable RMS architecture, optimized scheduling policy, fault-tolerance job management, and power management and other related techniques. The main contributions of this thesis are as follows:1. A deep resource information model (DRIM) for the large-scale parallel computing system, has been proposed. DRIM not only addresses the disadvantage of the coarse grain resource definitions in traditional resource management systems, but also provides more comprehensive and realistic resource objects. Specifically, DRIM establishes entity model, function model and application model, which can accurately characterize the computing resources, communication resource, storage resource and different types of applications. DRIM also abstracts the relationship between the resources to make the management policy more effective and the management capability more viable. In a word, DRIM could provide powerful support for the job scheduling and resource allocation in RMS.2. A dynamic cascade resource management architecture has been proposed to create the cascade services dynamically based on self-organization mode. A light-weight optimized transportation protocol has been designed to reduce the management overhead and optimize the communication performance of control messages. A fast job-launching mechanism has been presented by using low-level hardware communication mechanism and collective operations. These could improve the scalability of RMS. The component-based system architecture has been used to support the function scalability of RMS. MCRM, Multiple Case Resource Management system, has been realized for the system with S2MP architecture. The experiments on a S2MP system with 2048 processors show that MCRM has a better scalability.3. An integrated-priority scheduling policy has been proposed, which considers various factors of job attributes, resource attributes and service attributes in system, it can promote the flexibility and efficiency of the scheduling mechanism. MC-backfill scheduling policy has been designed, which could adjust the backfill depth and frequency according to the status of the job queue. MC-backfill can not only improve system throughput, but also consider system fairness. The experiment results show that with MC-backfill policy, even in the case of inaccurate estimation of job running time by users, the average waiting time of jobs can decrease, and the throughput of system is improved.4. A model for the fault-tolerance job running time using checkpoint/restart technique based on Weibull failure distribution model for high performance computing system, has been proposed. Algorithms for calculating the best checkpoint interval and selecting the best collection of processors have been designed to increase the reliability of the system. An automatic job recovery mechanism has been implemented for the S2MP system. With checkpoint, the jobs can recovery automatically when system failure occurs. This method can avoid manual intervention, reduce the average time of fault recovery and increase the availability of system.5. Two approaches for power management has been proposed for the large-scale RMS. An algorithm for properly scheduling jobs and allocating resources under the constraints of system energy consumption has been presented as the system-level approach. A model of Feedback based Two-Level Power Management (FTLPM) has been presented as the application level approach, which can reduce the redundant parallelism in the applications to decrease the energy consumption. FTLPM combines the linear control model and fuzzy control model to control the concurrency of threads and processes according to the memory bandwidth of multi-core processor and I/O bandwidth of file system. The experiment results show the effectiveness of our approaches.
【Key words】 High Productivity Computing; Resource Management System (RMS); Scalability; Reliability; Power Management;