节点文献

基于日志的任务建模及调度优化的研究

The Research of Trace-based Workload Modeling and Scheduling Optimization

【作者】 曹志波

【导师】 董守斌;

【作者基本信息】 华南理工大学 , 计算机应用技术, 2014, 博士

【摘要】 随着高性能计算和云计算的快速发展,高性能计算系统中的任务规模也呈现快速的增长。尽管通过扩大系统的规模可以应付大规模的任务,但是大规模的计算系统需要昂贵的IT设施和高的能耗,因此这不是有效的解决方法。有效的解决方法是通过高效的任务调度来提高计算系统资源的利用率。在高性能计算和云计算中,任务调度是任务到资源的关系映射,而任务对资源的使用产生任务日志。因此通过任务日志的分析和建模可以深入挖掘任务调度的性能特征,从而提出优化的调度策略来改善调度性能。基于此,本文研究了日志的任务建模及调度优化,主要研究内容如下:(1)通过对基于日志的任务建模方法的相关研究,本文提出了一种基于日志的任务建模的通用性框架。通用性框架以最原始的日志文件为输入,将其转化为标准的任务格式;通过标准任务格式可将日志分为可塑性任务日志和刚性任务日志;按照目标需要对任务日志中相应的任务属性进行分析,应用合适的概率分布拟合方法,并依据相应的任务日志计算出拟合方法的各项参数值;最后通过合并各项任务属性的拟合方法建立完整的任务模型,该任务模型可产生与实际环境一致分布的任务负载。本文利用通用性分析框架分别对实际环境下的虚拟机CPU使用率日志和生物基因测序任务日志进行分析和建模。评估结果显示两种任务模型可以产生与实际环境一致分布的任务负载,因此通用性框架在对不同类型任务的分析上具有很好的通用性。(2)针对可塑性任务对资源的非抢占特性(任务之间可共享资源),本文提出一种使用一维装箱问题来描述虚拟机迁移中虚拟机同资源之间的关系模型,同时用能耗约束该关系模型。然后针对该关系模型的现有调度策略提出两种优化调度策略。策略一针对现有计算节点过载判定和选择算法存在的不足,同时利用虚拟机CPU使用率任务模型的任务特性,提出了一种新的计算节点过载判定算法和一种优化的虚拟机选择算法。新的过载判定和选择算法利用虚拟机的CPU使用率的期望值和标准差来判定计算节点是否过载,然后在虚拟机的选择阶段,通过最小正相关系数选择虚拟机进行迁移。实验结果显示,现有的过载判定和选择算法获得的最好能效比为3.84,而本文提出的过载判定和选择算法的能效比则为1.28。策略二针对现有的虚拟机融合框架的设计存在的问题本文提出一种重设计的动态虚拟机融合框架。在重设计框架中,本文利用虚拟机CPU使用率任务模型的任务特性,提出一种SLA冲突决定算法。利用该算法来判定虚拟机所在的计算节点是否产生SLA冲突,然后利用本文提出的最小能耗和最大使用率算法来对产生SLA冲突的计算节点进行虚拟机迁移。最后利用虚拟机CPU使用率的日志及其通用性框架构建的任务模型对重设计框架进行了实验评估,实验结果显示重设计框架在能耗,SLA冲突以及能效比上远好于现有的虚拟机融合框架。(3)针对刚性任务的抢占特性(任务间不可共享资源),本文提出一种使用排队论来描述任务同资源之间的关系模型,目前针对该关系模型的调度策略有FCFS和回填。然后针对现有调度策略,本文提出了一种优化调度策略。首先利用生物基因测序任务模型中的任务特性,提出一种加权平均的方法计算用户的信任度,通过该信任度和用户提交的任务请求运行时间来预测任务的运行时间(Trust)。然后在仿真器中通过日志驱动的方式评估Trust同现有方法的优劣。实验结果显示,本文提出的Trust在预测准确度上,平均等待时间和标准响应时间均有获较好的性能表现。最后利用生物基因测序日志及其通用性框架构建的任务模型对Trust的性能进行了实验评估,实验结果显示Trust在平均等待时间和标准响应时间上均优于现有方法。本文通过任务建模的任务特性对相关调度策略进行了调度优化,实验结果显示利用任务特性的优化策略在性能上获得显著的提升;本文还利用任务模型对优化策略进行了性能评估,得到了与真实任务日志在趋势上相一致的实验结果,说明了任务模型在性能评估中的可行性和优化策略的健壮性。下一步可以考虑将任务建模的方式移植到更为真实的环境下,对在线任务日志进行分析和建模,同时通过实际计算环境来验证优化策略的可靠性和健壮性。

【Abstract】 With the rapid growth of High Performance and Cloud Computing, their workload scalegrows rapidly too. Although the much larger computing system can deal with the much largerworkload, it is not the efficient one. The efficient method is to make higher use of resourcethrough efficiency scheduling. In High Performance and Cloud Computing, scheduling is therelational mapping between workload and resource. Therefore, the trace analysis andmodeling can identify the performance deficiency of scheduling, thus we can improvescheduling and evaluate the performance of the improved scheduling. To this end, ourcontributions include:(1)Firstly we propose and construct a universal framework for workload modeling fromrelated work. The universal framework input and standardize raw trace, classify the workloadinto rigid job and moldable job, then analyze the workload characteristics of raw traceaccording to requirements and find suitable distributions for these characteristics, thencalculate the parameters of the distributions for the trace, finnaly construct a workload modelwhich can be used to generate uniform distributed workload with the trace. Secondly wemodel VM CPU usage rate trace and Biological Gene Sequencing (BGS) trace through theuniversal framework, and assessment results show that the two models can be used togenerate uniform distributed workload with the traces, which demonstrates the universality ofthe framwork.(2) For the non-preemptive of the rigid job, we propose and use one-dimensionalbin-packing to describe the relationship between the VM and the resource, and limit it withenergy consumption. Then we have proposed two proposals to improve the existing ones.1)Firstly, for the disadvantages of current VM Overload Decision Algorithms (ODA) and VMSelection Algorithms (VMSA), we propose a novel ODA based on VM CPU usage ratemodel, and an improved VMSA. The novel ODA use the mean and standard deviation of VMCPU usage rate to assess the host overload or not. if the host is overload, then the improvedVMSA select a suitable VM on the host to migrate by the minimum positive correlationcoefficient. The experiments show that the novel ODA and improved VMSA outperform thecurrent ODAs and VMSA. The current ones get3.84for Energy-Performance Tradeoff (EPT),but ours get1.28for EPT.2) Secondly, for the disadvantages of current heuristic frameworkfor VM consolidation, we propose a redesigned framework for VM consolidation. In theredesigned one, we propose a SLA Violation Decision Algorithm (SLAVDA) based on VMCPU usage rate model, and an improved Minimum Power Minimum Utilization policy (MPMU). SLAVDA can be used to assess a host SLA violation or not. If the host is SLAviolation, then we can migrate some selected VMs (selected by VMSA) on some suitable hostaccording to MPMU. We apply the VM CPU usage rate trace and VM CPU usage rate modelto evaluate the redesigned framework. Experimental results show that the redesignedframework outperforms the current one on energy consumption, SLA violation and EPT.(3) For the preemptive of the moldable job, we propose and use queuing theory todescribe the relationship between the job and the resource. Firstly we propose an AverageWeighted to calculate the User-Trust based on BGS model. Backfilling can predict VMruntime through User-Trust and VM’s require runtime (Trust). Then we evaluate theperformance of the Trust and existing proposal (Tsafrir) through the trace-driven simulation.Experimental results show that Trust outperforms Tsafrir on accuracy, average wait-time andbounded slowdown. Finally, the performance evaluation also shows that Trust outperformsTsafrir on average wait-time and bounded slowdown with BGI trace and BGIModel.We have improved scheduling through the workload characteristics of workloadmodeling, and evaluated the performance of the improved scheduling. The experimentalresults show that the improved one outperforms the previous design. Meanwhile, we haveevaluated the improved scheduling with workload models and get uniform results with traces,which verify the feasibility of the models and the robustness of the improved scheduling. Infuture, we can port the improved scheduling to the reality and use it to analyze the onlinetrace and verify its reliability and robustness.

节点文献中: