节点文献

片上多核处理器二级Cache结构及资源管理技术研究

L2Cache Organization and Management for Chip Multiprocessors

【作者】 晏沛湘

【导师】 张民选;

【作者基本信息】 国防科学技术大学 , 电子科学与技术, 2012, 博士

【摘要】 处理器与内存之间访问速度差距日益增大,有效组织和利用片上Cache资源以减少片外存储访问对于提升处理器性能至关重要。随着多核处理器的普及和半导体工艺的进步,芯片将集成更多的核,给二级Cache结构设计带来更大的压力和挑战。当前主流多核处理器采用基于LRU替换策略的共享或者私有二级Cache结构设计。然而,单一的共享或者私有Cache结构设计不能有效权衡容量与访问延迟。共享Cache结构能够有效利用资源,但是全局线延迟导致较慢的访问速度;私有Cache结构通过数据复制获得较快访问速度,但是容量限制导致较多的访问失效。此外,受组相联度、应用等因素的影响,LRU替换策略与理论最优替换策略之间的性能差距日趋增大。针对上述问题,本文深入研究了多核处理器中二级Cache资源的组织与管理机制,提出一种基于全局替换策略的可变相联度混合Cache结构模型,研究基于存储访问需求变化的动态容量划分与组均衡管理机制,并提供低功耗与可扩展优化。论文的创新点如下:1.提出面向CMP的可变相联度混合Cache结构CMP-VH。CMP-VH将二级Cache划分成一种优化的私有/共享结构,Tag私有,数据部分私有部分共享。CMP-VH基于数据块的重用信息进行全局替换,并支持核间容量划分以适应不同应用存储访问需求的变化。使用Simics模拟器搭建8核片上多处理器平台,对SPLASH并行程序负载的模拟实验结果表明,在相同总容量前提下,CMP-VH结构下的平均二级Cache失效率与传统共享Cache结构接近,比传统私有Cache结构降低约23.37%。2.提出基于数据项动态分配的容量划分技术VH-PAD。VH-PAD根据各个核的容量需求进行资源分配,包含初始化、重划分和回退三个阶段。初始化阶段赋予每个核相同数目资源;重划分阶段基于当前划分容量的饱和程度评估容量需求以指导容量划分;回退阶段基于当前占用容量判断是否撤销重划分阶段操作。VH-PAD通过控制共享数据项资源的动态分配实施核间容量调整。在Simics搭建的模拟平台上使用PARSEC基准程序进行实验,发现在相同总容量前提下,VH-PAD机制下的平均二级Cache失效率比传统私有Cache结构降低约41.33%。3.提出基于概率控制的容量划分技术VH-PS。VH-PS根据各个核的资源利用率进行资源分配,使用概率控制各个核对共享资源的竞争能力,从而实现核间容量划分。VH-PS提供一种性能监控机制评估各个核在增加一定容量后可以获得的失效率增益,并以此为基础赋予各个核不同等级的使用共享资源的概率。通过提升失效率增益大的核的概率等级,降低失效率增益小的核的概率等级,达到降低总失效率目的。VH-PS中的概率控制可以采用伪随机数或者PSR比例实现。在Simics搭建的模拟平台上使用PARSEC基准程序进行实验,发现在相同总容量前提下,与传统私有Cache结构相比,采用伪随机数实现的VH-PS下的平均二级Cache失效率降低约46.78%;采用PSR比例实现的VH-PS下的平均二级Cache失效率降低约43.05%。4.提出基于Tag组饱和度的组均衡管理技术。由于CMP-VH中私有Tag阵列限制最大组相联度与最大可用容量,本文提出核内、核间两种Tag组均衡管理机制。将CMP-VH中的替换分成Tag项主导替换与Data项主导替换两类,并使用Tag项主导替换数目评估每个组的饱和程度,允许饱和度高的组使用核内或者核间相应饱和度低的组中资源。在Simics搭建的模拟平台上使用PARSEC基准程序进行实验,发现在相同总容量前提下,与基准CMP-VH结构相比,核内组均衡机制的平均二级Cache失效率降低约11.04%,核间组均衡机制的平均二级Cache失效率降低约18.94%。5.提出异构可变相联度Cache结构HV-Way Cache及异构可变相联度混合Cache结构模型CMP-VHR。HV-Way Cache使用异构Tag阵列优化V-WayCache结构,以降低面积、功耗等开销。同时,为了适应未来众核处理器对低功耗与可扩展性的要求,使用异构Tag阵列和可重构数据阵列搭建异构可变相联度混合Cache结构模型,支持根据应用需求进行功耗优化。实验结果表明,HV-Way Cache结构能以较少的性能损失获得面积、功耗等开销的大幅降低。

【Abstract】 With the ever widening processor-memory speed gap, it is essential to efficientlyorganize and utilize on-chip cache resources, as system performance can be improved bythe reduced memory accesses. Chip multi-processors (CMP) are very popular nowadays.The number of cores integrated in a single chip increase with the advance of semiconduc-tor technology, posing increasing pressure on L2cache design. Most of the mainstreamCMPs adopt shared or private L2cache based on LRU replacement strategy. However,neither shared nor private L2cache can provide large capacity and fast access. SharedL2cache can maximize on-chip cache capacity, but the average access latency is heavilyinfluenced by wire delays. Private L2cache has the advantage of low access latency, buthave more off-chip accesses than share L2cache. Besides, due to the set associativity anddiversity of applications, the performance gap between LRU replacement strategy andthe optimal replacement strategy is getting wider. To address these problems, this thesisfacilitates further study on the organization and management of L2cache resources forCMPs,proposesaCMPorientedvariablewayhybridcachebasedonaglobalreplacementstrategy, exploits dynamic capacity partitioning and set balancing mechanisms based onrun-time access demands, and provides schemes for low power and scalable design. Theinnovations of this paper are as follows:Firstly, propose a CMP oriented Variable way Hybrid cache (CMP-VH). CMP-VHturns the L2cache into an optimized private/shared organization. The tag array is private,while the data array is private and shared organized. Adopting a global replacement strat-egy based on reuse counts of each cache block, CMP-VH provide capacity partitioningmechanismsamongcoreswithadaptiontothevariablecacheaccessdemands. UsingSim-ics simulator to build an8-core CMP platform, the simulation results of parallel workloadSPLASH show that in condition of the same total capacity, CMP-VH achieves a compa-rable average L2miss rate with conventional shared cache organization, and reduce theaverage L2miss rate by23.37%compared with contentional private cache organization.Secondly, propose a capacity partitioning mechanism based on dynamic allocationof data entries (VH-PAD). VH-PAD assigns each core a certain amount of resources oncapacity demands, and contains three stages of initial, repartitioning and rollback. Inthe initial stage, cache resources are equally allocated among cores. In the repartitioning stage, a new allocatedcapacity isassignedtoeachcoreaccordingtothecapacitydemandspredicted by the utilization of current allocated capacity. The rollback stage determineswhether to cancel operations taken place in the repartitioning stage. Capacity partitioningis accomplished by controlling the allocation of shared data resources in VH-PAD. Usingprograms from PARSEC benchmark to run on a Simics platform, our experiments showthat in condition of the same total capacity, VH-PAD exceeds conventional private cachein the average L2miss rate by41.33%.Thirdly, propose a probabilistic controlled capacity partitioning mechanism (VH-PS). VH-PS allocates resources among cores on the utilization of capacity, and adoptsprobabilities to control the competition to shared resources to accomplish capacity parti-tioning. Providing a monitor scheme to evaluate marginal gains by some extra assignedresources, VH-PS correspondingly assign each core different levels of probabilities to usethe shared resources. Total miss rate can be achieved by upgrading probabilities of coreswith large marginal gains and downgrading probabilities of cores with small marginalgains. Probabilistic controlled partitioning is implemented by generators of pseudo ran-dom number or a PSR scheme. Using programs from PARSEC benchmark to run on aSimics platform, our experiments show that in condition of the same total capacity, VH-PS with generators of pseudo random number exceeds conventional private cache in theaverage L2miss rate by46.78%, and VH-PS with a PSR scheme exceeds conventionalprivate cache in the average L2miss rate by43.05%.Fourthly, propose two mechanisms for set balancing based on the saturation levelsof tag sets. To relieve limitation of private tag array to the maximum set associativity andthe upper bound of available capacity, intra-core and inter-core set balancing mechanism-s are proposed. Classified replacement into tag inducted and data inducted replacement,we use the number of tag inducted replacement to evaluate the set saturation levels andallows over saturated sets to use resources from other sets of the same core or from thecorresponding set of other cores that are not over saturated. Using programs from PAR-SEC benchmark to run on a Simics platform, our experiments show that in condition ofthe same total capacity, using a CMP-VH as the baseline organization, intra-core set bal-ancing reduces the average L2miss rate by11.04%and inter-core set balancing reducesthat by18.94%.Fifthly, propose a heterogenous variable way cache (HV-Way cache) and a hetero- geneous variable way hybrid cache (CMP-VHR). Adopting a heterogenous tag array tooptimize V-Way cache, HV-Way cache reduces the area and energy overhead. Besides, tomeet the low-power and scalable demands for future many-core processors, CMP-VHRcomposed of private heterogeneous tag arrays and a configurable data array is also pro-posed. It is supported in CMP-VHR to optimize power consumption. Experiment resultsshow that HV-Way cache can greatly reduce the area and power overhead at expense of alittle performance lose.

节点文献中: