节点文献
基于hadoop平台作业调度算法的研究
【作者】 余正祥;
【导师】 谢戈;
【作者基本信息】 云南大学 , 计算机应用技术, 2011, 硕士
【摘要】 互联网技术的迅猛发展,互联网数据呈现爆炸性的增长,面临海量数据处理问题。云计算作为—种新的模型提出来,发展极为迅速。云计算开源系统Hadoop模仿和实现了Google云计算的主要技术,并获得广泛的应用。Hadoop是一个在不断发展和完善的平台,在Hadoop研究中作业调度的研究是学术界和工业界的热点问题之一。改进和提高作业调度能力,能提升海量数据处理的能力。对提高Hadoop平台的性能和资源利用效率都有重要的现实意义。本文首先介绍了Hadoop的技术背景,其次介绍了Hadoop平台的核心部分,即Hadoop的分布式文件系统(HDFS)和MapReduce计算框架,详细分析了Hadoop的作业调度流程。接着研究了Hadoop平台下现有的调度算法,即FIFO算法,计算能力算法,公平调度算法。详细研究了公平调度算法。在对Hadoop平台深入了解和对其作业调度算法进行详细研究下,提出对作业调度算法的改进。首先,分析了公平调度算法的数据本地化问题,分析其中的延迟改进算法,在此算法的基础上,提出保证响应时间T的延迟算法,来保证特殊用户(如:付费用户)的服务水平协议(SLA)要求,这里主要针对短作业。其次,希望通过利用过去的节点历史记录和学习作业属性来不断的改进作业调度,提出应用基于特征加权的朴素贝叶斯分类器算法来改进作业调度的任务分配,详细分析了算法的设计思想,并进行原型的设计和实现。最后通过实验环境的搭建来测试改进算法,首先测试了保证特定响应时间T的延迟算法,实验证明到达了响应时间T的要求,但损失了部分的数据本地化。其次,测试了基于特征加权的朴素贝叶斯分类调度算法,对其学习的能力,特征加权对性能的影响,决策的正确率以及与现有调度算法的性能进行试验对比分析。
【Abstract】 The rapid development of Internet technology, the explosive growth of Internet data, is facing massive data processing problems. Cloud computing as a new model proposed, developed with great speed. Hadoop which is open source cloud computing system, imitats and realizes the main Google cloud computing technology and accesses to a wide range of use. Haoop is a platform for continuous development and improvement. In the Hadoop job scheduling is the academic research and industry hot topics. Improving and enhancing the job scheduling capabilities can enhance the ability of massive data processing. Hadoop platform for improving the performance and efficiency of resource use has important practical significance.This paper describes the technical background of Hadoop, and then introduces the core of the Hadoop platform that is Hadoop Distributed File System and the MapReduce computation framework, a detailed analysis of the Hadoop job scheduling process. Then, I researched Hadoop platform of existing scheduling algorithms, namely FIFO algorithm, capacity algorithm, fair scheduling algorithm. A detailed analysis of fair scheduling algorithm.In-depth understanding of the Hadoop platform job scheduling algorithm and its detailed study, I proposed improvements for the job scheduling algorithm. First, the analysis of fair scheduling algorithm for data localization, then I analyzes the delay algorithm based on this algorithm and proposed the response time T of the delay improved algorithm that guarantees Service Level Agreement(SLA) for specific users (such as:paying customers) of requirement, this is mainly for short job. Secondly, I hope nodes through the use of past history and learning job properties to improve job scheduling, I proposed Feature Weighting-based Naive Bayes classification algorithm to improve scheduling of task allocation, detailed analysis of the algorithm ideas, and finished the prototype design and implementation.And then I builded the lab environment for test the performance of improved algorithm in our lab, the first test is guaranting a specific response time T delay algorithm. Experiments showed that it reached the requirements for the response time T to, but the loss of part of the data localization. Second, the experiment based on Feature Weighting-based Naive Bayes classification scheduling algorithm, testing its ability for learning, feature weightied impacting on performance of job, the performance of decision-making accuracy and performance comparison of scheduling algorithms for existing scheduling algorithms.
【Key words】 Cloud Computing; MapReduce; Job Scheduling; Feature Weighted Naive Bayes;