节点文献

专用计算集群组环境中作业管理调度系统的设计与实现

The Design and Implementation of a Job Management and Scheduling System in a Special Computing Cluster Group

【作者】 张毅

【导师】 龚正虎;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2005, 硕士

【摘要】 高性能计算集群系统(HPC cluster)具有强大的并行计算能力和规模处理能力,能够很好地满足各类应用需求。大规模并行计算是集群系统的主要应用模式,但是,大量批作业提交与处理也是一种较广泛的应用模式。大量批作业应用模式下需要采取一些有针对性的资源分配和调度策略来优化集群系统资源利用率,这是本文的研究重点。本文针对我单位的专用于计算流体力学(CFD)批量作业的计算集群组环境下作业调度和运行效率优化问题开展研究,主要研究内容包括专用计算集群组中作业调度技术、作业迁移技术、系统快速备份和恢复技术以及作业提交管理技术四个方面。我单位计算机系统由多个专用于CFD的计算集群系统构成,因而称之为”专用计算集群组”。为了解决资源保障计划和作业内存使用效率问题,作者设计并实现了一套专用的作业调度系统,制订了有针对性的作业调度算法,提出了对实际使用内存持续变化的运行作业的内存估计值算法。针对集群系统之间负载平衡问题,作者及课题组研制了集群系统间的作业迁移管理系统,实现了集群组之间的资源保障计划和作业通过迁移利用集群组空闲资源的机制,制订了迁移目标比较算法。该系统采用了PVFS并行文件系统来提高大规模集群的I/O性能。针对影响PVFS可用性的关键问题,作者提出并实现了一种系统快速备份和恢复技术。本文还讨论了基于WEB的集群作业提交管理系统的设计方案。上述研究成果已得到实际应用,取得良好效果。繁忙期间的系统利用率从集群系统初建时的80%左右提高到95%以上,只要集群系统中有CPU空闲就不会出现作业排队等待的现象。

【Abstract】 The HPC cluster system has powerful parallel computing and large-scale batch computing ability. Thus, it can meet various application requirements. Large-scale parallel computation is main model of cluster-based applications. At the same time, large-scale batch computation also is an important model of cluster-based applications. When a lot of jobs are submitted to clusters, special resource allocation and scheduling policies need to be implemented for system optimization. The paper focuses on this subject.This paper researches the optimization of job scheduling and running efficiency in the special multi-cluster environment for CFD, and discusses the techniques of job scheduling, job migration, file system backup and restoration as well as web-based job submitting. The computer sysytem in my unit consists of a group of computing clusters, and it is called the special cluster group.For the problems of resource reserving plan and memory use efficiency, the author designs and implements a job scheduling system with dedicated job scheduling algorithm, and puts forward a special algorithm to estimate the memory usage value of the jobs whose memory usage are changing continuesly.For the problems of load balance between clusters, the author and his research team build a job migration management system which is able to work with dedicated job scheduling system of every cluster, set up the multi-cluster resource reserving plan and the mechanism to enable jobs to utilize the idle resources in whole cluster group by migrating, and formulates a algorithm to compare migrating destinations.The system selects the PVFS parallel file system to increase the I/O performance of large-scale clusters. To solve the key problems that bring low the availability of PVFS, the author proposes an original technique to rapidly backup and restore the file system.This paper also introduces the design of the web-based job submitting and management system.The above research results are used in my unit, and get a good effect. At the busy time, the utilization of the system resources increases from nearly 80% to 95% or up, and there is no job queuing problem when one or more CPUs are idle.

【关键词】 集群调度web文件系统
【Key words】 clusterschedulemigrationfile system
  • 【分类号】TP311.52
  • 【被引频次】5
  • 【下载频次】162
节点文献中: 

本文链接的文献网络图示:

本文的引文网络