节点文献

多核环境下面向数据并行编程模型的性能和可伸缩性研究

Research on the Performance and Scalability of Data-Parallel Programming Model on Multicore

【作者】 陈榕

【导师】 臧斌宇;

【作者基本信息】 复旦大学 , 计算机系统结构, 2011, 博士

【摘要】 近10年来对于大规模数据处理的需求变的日益迫切,等待处理的数据如雪崩一般不断增长。据权威咨询公司IDC于2007年统计,截至2006年存储于电子介质中的数据量达到惊人的161艾字节(Exabyte),并且预计至2010年这一数字将来到998艾字节。毫无疑问数据密集型应用已经成为当今最为重要的计算机应用之一。与此同时,随着多核技术的日益普及,片上核数目的快速增长,多核平台在大规模数据处理领域呈现出极为广阔的应用前景。然而这些以多核形式提供的强大计算能力,只有通过并行程序才能得以充分利用,发挥出与核数目的增长一致的实际效果。高效并行程序的编写历来是困扰程序员的难题,因为除了业务逻辑本身,程序员还必须面对包括数据分布、可伸缩性、负载平衡和系统容错在内的大量与并行性相关的复杂问题。权威调研机构Gartner于2008年列出了未来25年IT市场面临的七大挑战,多核时代的并行编程位居第二。面向数据并行编程模型无疑是这一挑战的最好解答,通过合理的抽象向应用程序员隐藏并行性相关问题,在将业务逻辑开发留给应用程序员的同时,将实现并行的挑战留给并行计算专家。然而现有的面向数据并行编程模型和运行时支持大多针对集群平台设计和实现,并没有充分考虑到多核平台的自身特点,比如高速核间通信、共享缓存竞争和整机故障模型等,因此也就不能有效的利用多核技术带来的强大计算能力。此外,现有的并行编程模型设计更多的关注于通用性而缺乏针对性,限制了模型在某些应用领域和计算需求下的执行效果。本文在深入分析现有MapReduce并行编程模型在多核平台上存在的性能和可伸缩性问题的基础上,提出了一个系统的解决方案。首先以MapReduce模型为基础采用分治策略针对多核平台特点进行扩展,然后基于分治MapReduce模型提出了针对内存占用、缓存局部性和任务并行性三个方面的多个优化,最后以在线聚集计算和增量计算为例分析并验证了分治MapReduce模型对于不同领域和不同需求应用的高效支持。相对于之前的研究而言,该研究致力于设计和实现针对多核平台的面向数据并行编程模型,充分利用资源获得与之相匹配的性能和可伸缩性,并为更多的领域和应用提供高效地支持。具体而言,本文的主要贡献如下:1.从面向数据并行编程模型的角度深入分析多核平台与集群平台间存在的主要差异,并在此基础上揭示了面向集群平台设计的MapReduce并行编程模型在多核平台上存在的主要问题。提出利用分治策略对MapReduce并行编程模型进行扩展,将大型任务分解为多个子任务迭代执行,并改进原有的容错机制,以达到充分适应多核平台特点的目标。2.提出基于分治MapReduce模型,涉及内存、缓存和处理器三个方面的多个运行时优化。采用动态数据加载和缓冲区重用技术减少并缩短内存资源占用,采用面向非一致缓存/内存访问(NUCA/NUMA-aware)的调度策略提高缓存局部性,采用软件流水线技术(Software Pipeline)和任务窃取技术(Work Stealing)消除处理器空闲。3.基于分治MapReduce模型以及相关运行时优化,在多核平台设计并实现了名为Ostrich白勺原型系统。深入评测的结构表明,分治MapReduce模型的接口扩展相对于其它MapReduce模型实现并不会对程序员产生额外负担。其次,在16核Intel处理器构成的测试平台上,Ostrich运行时不但在所有基准测试中都具有更好的可伸缩性,并且在性能测试中节省高达85%的内存,降低3.1倍至7.1倍的缓存缺失率,以及提高整体性能1.2倍至3.3倍。4.利用分治MapReduce模型提供的强大支持,设计并实现了两个针对不同领域和不同计算需求的案例应用。Oops系统实现了对在线聚集计算的支持,能够在执行过程中向用户反馈当前进度下的近似结果,并能够高效地支持多级在线计算。Ostrichlnc系统提出在子任务级别实现计算复用,实现了对严格增量计算和部分增量计算的高效支持。评测结果表明,分治MapReduce模型在保持原有通用性的前提下,对多种不同领域和不同需求的应用能够提供高效支持。

【Abstract】 The demand of processing large-scale data emerges and has rapidly increased over past 10 years. The dramatic growth of data is an "information avalanche". According to statistics from IDC Inc., the amount of information stored in digital form in 2006 is at 161 exabytes, and it reaches to 988 exabytes in 2010. There is no doubt that data-intensive applications have become one of the most important application domains.Meanwhile, with the prevalence of multicore and the continuously increasing number of cores on a single machine, large-scale data processing has been shown to be a promising alternative to harness abundant resources in multicore platforms. The most usual way of harness multicore resources is through parallel applications. Unfortunately, it is a nightmare for average programmers to write an efficient parallel application. In additional to functionality, they also have to consider many issues including data distribution, scalability, load balance and fault tolerance. As a result, Gartner Inc. has identified seven grand challenges in IT for the next 25 years, and parallel programming is ranked the second in 2008.Data-parallel programming model is an appealing proposal to face the chal-lenges, by providing easy-to-use abstraction to hide parallelism issues from users. However, current data-parallel programming models designed to program large clusters do not seriously consider several unique features of multicore platform, such as low-latency inter-core communication, contention for shared cache, and pervasive global failures. Furthermore, current data-parallel programming models emphasize on generality while ignoring specialization in some cases, which limits its usage in some application domains.Based on a detailed analysis on the performance and scalability problems of running Map Reduce programming model on multicore platform, this disserta-tion proposes a practical and systematic solution. First, we extend the general MapReduce programming model with "tiling strategy", called Tiled-MapReduce (TMR). Second, we propose three optimizations that aim at improving memory efficiency, data locality and task parallelism. Finally, we demonstrate the powerful support from Tiled-MapReduce for novel applications such as online aggregation and incremental computation.The main contents and contributions of this paper include the following as- pects:·We analyze the performance issues resulted from differences between mul-ticore and cluster platform, and propose Tiled-MapReduce, which applies the "tiling strategy" to increase performance and scalability. The runtime divides a large MapReduce job into a number of independent small sub-jobs, and iteratively processes one sub-job at a time with efficient uses of resources. We also enhance the fault tolerance strategy for multicore plat-form.·Tiled-MapReduce also enables three optimizations otherwise impossible. First, we use dynamic loading and input data buffer reuse to shorten the life cycle and limit the footprint of input and intermediate data. Second, we use a NUCA/NUMA-aware scheduler to fully exploit the memory hierarchy of a multicore system, resulting in better memory and cache locality. Finally, we use software pipeline and work stealing to eliminate the idle time and fully harness the CPU cores.·We implement a prototype of Tiled-MapReduce, namely Ostrich, running on an Intel machine with 16 cores. Experiments on four different types of benchmarks show that Ostrich has much better scalability than Phoenix for all benchmarks. Meanwhile, it also saves up to 85% memory, causes less cache misses and makes more efficient uses of CPU cores, resulting in a speedup ranging from 1.2X to 3.3X.·We design and implement two prototypes to show the powerful support for different application domain from Tiled-MapReduce. First, the online ag-gregation system, namely Oops, allows the users to observe the progress and approximate result of their tasks. Second, the incremental computation system, namely OstrichInc, reuses the task results in sub-job level, which supports both fully incremental computation and partially incremental com-putation.

  • 【网络出版投稿人】 复旦大学
  • 【网络出版年期】2011年 12期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络