节点文献

体系结构级Cache功耗优化技术研究

Research on Power Optimization of Cache Architecture Design

【作者】 项晓燕

【导师】 严晓浪;

【作者基本信息】 浙江大学 , 电路与系统, 2013, 博士

【摘要】 随着集成电路制造工艺的进步和微处理器性能的提高,微处理器功耗问题日益严重,成为制约微处理器发展的主要瓶颈。片上高速缓存(Cache)功耗作为微处理器功耗的重要组成部分,降低Cache功耗成为控制微处理器功耗的主要目标。由于底层功耗优化手段受工艺和材料物理特性的制约已经很难满足Cache的功耗约束,因此需要从更高层次对Cache功耗进行优化。本文从Cache功耗的组成、访问特性、功耗性能平衡等多个角度出发,提出了多项体系结构级Cache功耗优化方法。主要研究工作和创新点包括:低功耗指令Cache研究。针对指令Cache行间访问偏移范围存在明显局部性特征,提出了一种将Cache当前访问行及其若干紧邻行链接访问的低功耗指令缓存访问方法。该方法能够在发生相对跳转时依托于相邻行之间的访问链接信息,精确获得跳转目标行的路访问信息,从而减少对Cache标志和数据存储器的访问,达到降低指令Cache动态功耗的目的。在Cache行发生替换时,仅需检测并清除相邻缓存行与被替换行的链接信息,以很小的硬件代价实现链接信息的正确性。低功耗数据Cache研究。针对数据Cache与存储加载队列并行访问的功耗问题和串行访问的性能问题,提出了一种基于存储加载队列预测访问过滤无效数据Cache访问的低功耗方法。利用内存相关性的可预测特征,通过记录加载指令与存储加载队列中存在内存相关性的指令集合,预测后续仅需访问存储加载队列的加载指令,直接从存储加载队列前馈数据通路获取加载结果,关闭数据Cache的访问。Cache可重构算法研究。针对可重构Cache中重构搜索的开销问题,提出了一种基于函数转移开启Cache重新配置的可重构预测算法。利用函数转移获取新程序段的特性,以函数为单位动态监测Cache缺失率变化,通过函数历史最优Cache配置参数预测后续函数的Cache重构配置信息,减少重构过程对Cache设计空间的搜索;进一步,通过区分重构前后的缓存行,使重构后Cache能够继续使用重构前的缓存数据,降低了Cache初始化的延时和功耗。Cache无效访问研究。针对分支行为预测错误导致指令Cache的无效访问,提出了一种基于零延时分支预测的指令Cache低功耗方法,利用分支预测的行为信息参与后续分支行为预测,消除深流水、超标量处理器中由于分支代价高导致分支历史重名问题,提高分支行为的预测准确率,减少指令Cache无效访问功耗。本文提出的多项体系结构级Cache功耗优化方法能够在不影响性能的前提下,有效降低Cache功耗,改善微处理器的性能功耗比。

【Abstract】 With the development of the IC manufacture technology and the functionality progress arising from microprocessors, the power issue is more seriously and becomes the main obstacle for improving the performance of microprocessors. Obvious power consumption will not only increase the manufacture cost, but also influence the microprocessor’s stability and credibility. On-chip cache consumes a significant amount of microprocessor’s energy. So designing an energy-efficient on-chip cache memory is the main object as feature size shrinks and capability and associativity of cache increase. Since circuit-level and logic-level low power technologies are highly influenced by the progress of process technology and material physical characteristic, they cannot meet the requirement of on-chip cache energy constrain. Architectural effort to reduce on-chip cache power consumption is considered.In this thesis, we proposed multiple power optimizations for on-chip cache architecture design based on the component of cache power consumption, the access characteristics of different cache and the balance of power and performance. The main contributions are as follows:1. Low power instruction cache design. Set-associative instruction caches consume a large potion of power in modern microprocessors. This paper analyzed the behavior of cache accessing and discovered that the most accesses were sequential accesses and short distance branches whose targets were to the adjacent cache line. So the paper proposed a new low power instruction cache architecture that recorded the link information of the current cache line and its adjacent cache lines. When a cache access occured, it could reuse the adjacent cache line links to get the way information of the target line. Then it could directly access one way of data array and avoid tag lookups to reduce the power consumption. When a cache line was evicted, only its adjacent cache line links should be checked and invalidated to keep the correctness of the links. 2. Low power data cache design. Data cache access with load-store-queue in parallel consumes a large amount of energy and in serial increases load-to-use latency. A low power data cache based on load-store-queue predicting access was proposed in this paper to filter out unnecessary access to data cache. Memory dependency set was defined to recode each load’s dependent loads and stores residing in load-store-queue. When a load instruction was fetched, its memory dependency set was checked. The load which only need access load-store-queue was decided and its result was gotten from the load-store-queue forwarding data-path, excluding to data cache access. As a result, data cache based on load-store-queue predicting access reduced power consumption without performance loss.3. Cache configuration algorithm. Configurable cache suffers the problem that the tuning interval does not closely match the phase changes of an application and high cost from configuration overhead. A subroutine calling based configuration prediction algorithm was proposed in this paper to improve the tuning interval and reduce the overhead. Since cache requirements might vary greatly across different subroutines, miss rate was checked when a subroutine was called. If miss rate overpast the threshold value, cache began to tune with the history optimization cache parameter of the subroutine. Furthermore, a cache line reuse mechanism between cache tuning was proposed by identifying the cache lines which belonged to pre-configuration or post-configuration to reduce the cache initial performance loss and power consumption.4. Reducing unnecessary access of instruction cache. According to the fact that branch prediction miss results in unnecessary access to instruction cache, a low power instruction cache based on a zero-delay branch prediction mechanism was proposed. Branch prediction behavior was used to predict the subsequent branches, which eliminated the branch history alias in deep pipeline and superscalar microprocessors. The accuracy of branch prediction was improved and cache unnecessary access power was reduced. Techniques proposed in this thesis can achieve aggressive power saving without performance reduction. Energy efficient of the microprocessor is also improved.

  • 【网络出版投稿人】 浙江大学
  • 【网络出版年期】2014年 06期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络