节点文献

逻辑核动态可重构的众核处理器体系结构

Manycore Processor Architecture with Dynamically Reconfigurable Logic Core

【作者】 任永青

【导师】 安虹; 李国杰;

【作者基本信息】 中国科学技术大学 , 计算机系统结构, 2010, 博士

【摘要】 随着半导体技术的发展,摩尔定律继续有效,单块芯片上集成的处理器核数将不断增加;同时为追求更高的性能功耗比和性能面积比,众核结构成为芯片设计的必然选择。众核处理器中丰富的计算资源和高效的片上通信使得面向吞吐率的应用具有天然的性能优势,但是由于处理器核粒度变小,在单核上执行串行应用的性能无法保证。为解决这一问题,近年来具有逻辑核构造能力的众核处理器体系结构研究备受关注,其基本思想是基于多个细粒度处理器核(称为物理核)构建粗粒度逻辑核,期望利用众核结构丰富的计算资源,将不断增加的处理器核转化为单线程串行应用的性能提升。目前已有的工作对这种众核结构的通信开销处理、逻辑核粒度配置灵活性和应用映射方式等方面尚缺乏深入的研究。本文针对细粒度众核结构上串行程序的有效执行问题,从执行模型、微结构设计和动态资源控制等几方面展开深入探讨,对于探索逻辑核动态可重构的众核处理器体系结构具有重要的学术意义和应用价值。本文主要研究内容和成果包括以下几个方面。(1)研究了具有逻辑核构造能力的众核处理器重构开销问题,提出逻辑核动态可重构的众核结构FTPA (Flexible Tiled Processor Architecture)。FTPA采用类数据流驱动执行的指令集体系结构,在不改变串行编程模型前提下,利用数据流驱动和线程级推测相结合的执行模型,同时开发单线程程序中的指令级并行和线程级并行。为解决众核处理器逻辑核重构开销过大问题,FTPA将物理核内资源通过片上路由网络划分为易重构的计算资源和不易重构的共享资源,从而使得逻辑核粒度能够在两个层面以两种频度进行异步调整,具有高度灵活性。(2)研究了串行程序采用细粒度线程级推测执行模型时,应用推测执行能力的实时评估机制。针对串行应用不同执行阶段并行性特征存在的显著差异,利用时间局部性,为众核结构逻辑核粒度动态重构进行有效指导,本文提出基于“推测执行阶段”和“推测深度”概念的线程级推测执行能力量化评估方法,并以此为基础提出利用推测深度的局部历史、全局历史和锦标赛三种推测执行能力评估器设计,只需要数十位存储资源,就可以有效预测串行程序并行性变化趋势,对推测深度作出有效估计。(3)研究了将推测执行能力评估器用于指导FTPA众核结构逻辑核动态重构的有效性。为有效处理众核结构分布式执行导致的通信开销,以指令窗口和功能部件为核心的计算资源可以按照平铺式和深度式两种映射方式构建逻辑核,从而适应具有不同并行性特征的应用。本文将线程级推测执行能力评估器用于指导FTPA逻辑核动态重构,分别从平铺式映射和深度式映射两方面对性能和资源利用进行了详细实验评估。结果表明,相对于采用固定粒度逻辑核的FTPA配置,动态逻辑核重构方式只需一半物理核计算资源就可以有效支持细粒度线程级推测执行,性能降低不到13%,资源利用率显著提高。本文的研究工作可以得出如下认识:(1)逻辑核是众核处理器上加速串行应用的有效手段,但是将细粒度物理核资源耦合在一起需要高效的结构支持,如本文提出的计算资源和共享资源的分离设计,平铺式和深度式映射方式等。(2)在众核处理器上采用细粒度线程级推测执行模型加速串行程序需要在性能和资源利用率之间进行权衡,合理的逻辑核重构必须建立在对应用执行特征精确认识的基础上,线程级推测执行能力评估器是一种有效尝试。本文提出的FTPA众核处理器所采用的计算资源和共享资源分离方法、平铺式和深度式逻辑核重构以及线程级推测执行能力评估器设计等都可以作为一般方法论进行推广,应用于其他众核结构中。

【Abstract】 With the evolving of semiconductor technology, the Moore’s Law is continuing, and the number of processor cores integrated on single chip goes on increasing. For power and area efficiency, manycore processor architecture is an unescapable choice. With abundant of computing resource and highly efficient on-chip-network, manycore is suitable for applications with throughput requierments. As the processor core integrated on manycore will be finer, the performance of single thread application may diminish while executing on a single core. For this problem, recently, manycore processor with capability of constructing reconfigurable logic core become a remarkable solution, in which several cores (named physical core) are combined as a coarse grain logic core, expecting to efficiently translate transistor resources into performance gaining of sequential programs. There is little research effert is made on communication overhead, logical core flexibility and application mapping for these manycore architecture.Aiming at efficient execution of sequential applications on manycore processor with fine grain cores, in this dissertation, intensive study is carried out on execution model, micro-architecture, and resource tuning, etc, and much academic value is achieved for manycore architecture with dynamically reconfirable logic core. The main content and achievement includes:(1) Proposed manycore processor FTP A (Flexible Tiled Processor Architecture) with dynamically reconfigurable logic core. FTPA takes advantages of dataflow-lile execution model EDGE (Explicit Dataflow Graph Execution) instruction set architecture, and ILP (instruction level parallelism) and TLP (thread level parallelism) are exploited in the way of dataflow execution and fine grain thread level speculative execution while not impacting serial programming model. To overcome the overhead of logic core reconfiguration, In FTPA, the computing resources and shared resources are separated through on-chip network, resulting in resource tuning in two levels and two frequencies, meaning much more flexbility.(2) Designed an estimator of speculative execution capability to direct the logic core dynamic reconfiguration. To achieve reasonable logic core reconfigurtion, based on temporal locality and the observing about execution phases, three estimators of fine grain thread level speculative execution capability are proposed on the conception of speculative execution phase and depth, named local history, global history and tournament estimator. Experiments results show that, the estimator of speculative execution capability is able to predict the trend of concurrency changing accurately in different execution phases, while consuming only tens of bits hardware memory resources.(3) Explored the efficiency of applying the estimator of speculative execution capability on logic core dynamic reconfiguration. For different design constraints and applications styles, the computing resources, including instruction window and function units, are able to form logic core in two ways, flat and deep. The estimator of speculative execution capability is used for logic core grain tuning of FTP A in the two ways and experiments results demonstrate that, with the direction of estimator of speculative execution capability, comparing to fixed grain of logic core, nearly half resources are enough for concurrency exploiting of sequential application, with less than 13% performance diminishing, which means much higher resource utilization ratio.Several conclusions are achieved from the work:(1) Constructing logic core is an efficient way for sequential program execution on manycore, but requires reasonable micro-architecture support, such as splitting of computing and shared resources, flat and deep models of application mapping, etc.(2) Reasonable trade-off between performance and resource utilization must be achieved while sequential program executing on manycore with fine grain thread level speculative. For this purpose, accurate understanding about concurrency variation of application execution must be obtained, and the estimator of thread level speculative execution capability is an attractive attempt.The schemes in this disseratation, such as separation of computing and shared resource, logic core constructing in flat and deep manners and the estimator of speculative execution capability, are able to be expanded as universal techniques.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络