节点文献

CC-NUMA系统存储体系结构关键技术研究

Research on Key Technologies of CC-NUMA Based Memory Architecture

【作者】 潘国腾

【导师】 谢伦国;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2007, 博士

【摘要】 分布共享存储(DSM)系统支持全系统统一地址编程空间,有效地将传统的共享主存多处理器系统和分布主存系统的优点结合起来,兼具可编程性好和可扩展性高的优势,成为大规模并行高性能计算机研究领域首选的硬件平台。采用CC-NUMA机制是实现DSM系统的有效技术途径,但如何高效维护Cache一致性是实现CC-NUMA系统的难点之一,它不仅决定着系统的正确性,而且对系统的性能有着极其重要的影响。目前国内外对Cache一致性的研究主要集中在目录结构的可扩展性和协议的高效实现两方面。由于CC-NUMA系统中各处理器通过共享存储器进行通信,因此,处理器访问存储器的时延,特别是当处理器数目非常大的时候,处理器访问远程存储器的时延将极大地影响计算机系统的性能。这样,如何尽可能地提高访存带宽、降低访存延迟、减小远程访存与本地访存时延的差距就成为CC-NUMA系统是否好用、实用的关键。针对这些问题,本文围绕如何实现高效的CC-NUMA系统存储体系结构,着重对基于目录的Cache一致性协议的可扩展性、目录协议的优化技术、提高访存带宽、降低访存延迟,以及大规模CC-NUMA系统模拟验证环境等关键技术展开研究探索。本文的主要工作和创新点是:1.提出了一种基于SMP结点的可扩展CC-NUMA体系结构模型—SCDSM,并在此系统上实现了一种高效、无死锁、基于目录的Cache一致性协议。在协议实现中,针对共享读总线脏命中时Cache状态和目录状态不一致的问题,提出了一种强制写回(FWB)方法,解决了目录协议和监听协议兼容的难题;提出了本地访存请求直接转发(LMRDF)技术,解决了基于SMP结构的CC-NUMA系统由于等待总线监听结果造成的请求延迟问题,SCDSM系统性能由此可以提高10%-15%。2.为多处理器系统中共享数据的分布建立了马尔科夫模型,并对共享数据的分布模式进行了分析,得出结论:CC-NUMA系统中共享数据的平均Cache副本数一般比较小。该理论分析结果对我们提出更有效的目录组织方案有很好的指导意义。3.针对目录存储开销影响Cache一致性协议的可扩展性问题,本文提出了基于目录Cache的两级目录组织方案,有效地降低了目录信息所需要的存储空间,使协议实现具有较好的可扩展性。对基于目录Cache的两级目录模型进行了模拟和性能验证,结果表明,并行测试程序的运行时间都有不同程度的减少。4.存储墙问题是影响系统性能进一步提升的瓶颈,如何降低访存延迟是存储系统设计面临的巨大挑战。本文提出了四种不同约束强度的访存调度算法,并对四种调度算法进行了性能分析,分析结果表明,带体地址冲突消解和防饿死机制的贪婪启发式访存调度算法具有最佳性价比。具体实现了采用带体地址冲突消解和防饿死机制的贪婪启发式访存调度算法的DDR2访存控制器。5.为了更有效地模拟验证复杂系统和大规模系统的正确性,本文提出了分布环境下的多结点模拟验证平台CoSim:为了配合模拟测试任务的进行以及Cache一致性协议的功能验证,本文提出了CMCV模型。在CoSim平台上,对Verilog代码编写的SCDSM系统进行了全面的功能验证。另外还使用Verilog语言构造了类似Stream Copy程序行为的QSCV模型,对SCDSM系统的LMRDF技术和访存带宽进行了评测和分析。以上关键技术和相应解决方案均已在工程项目中得到实际应用,对推进高效的CC-NUMA系统存储体系结构的进一步研究具有一定的理论意义和重要参考价值。

【Abstract】 Distributed Shared Memory(DSM) system provides a global shared address space, which trades off shared memory multi-processor and distribute memory system.With the advantages of programmability and scalability,DSM has become the preferred hardware platform for massive parallel high performance computer systems. CC-NUMA is an effective mechanism to improve the performance of DSM systems. The maintenance of cache coherence,which not only determines system correctness, but also greatly impacts system performance,has been the primary difficulty to implement CC-NUMA systems.Currently researches focus on the scalable and high performance implementation of directory-based cache coherence system.Processors in CC-NUMA systems communicate with each other through shared memory,so latency of remote memory access,especially with great number of processors,will dramatically impact the system performance.The key of effective implementation of CC-NUMA systems lies on improving the memory bandwidth, shortening memory access latency and reducing the gap between local and remote memory access latency.This dissertation is devoted to the implementation of effective CC-NUMA systems memory architecture.It researches the scalability of directory-based cache coherence, the optimization of directory protocols,the simulation and verification platform for CC-NUMA systems,and the technology of improving memory bandwidth and reducing access latency.The main work and contributions of the dissertation are as follows:1.A new scalable CC-NUMA architecture based on SMP nodes,called SCDSM,is proposed.A lock-free,high performance directory-based cache coherence protocol is implemented based on SCDSM.A FWB strategy is proposed to address the inconsistent problem between cache state and directory state when read request hits dirty cache block on the bus of SMP node.The strategy solves the difficult problem of compatibility of directory protocols and snooping protocols.A LMRDF strategy is proposed to decrease request sending delay caused by waiting the hit result on bus in CC-NUMA system based on SMP node.This technique improves the performance of SCDSM system by 10%-15%.2.A Markov Chains model is built for the distribution of shared data in CC-NUMA systems.We analyze the distributing pattern of shared data based on this model.It is proved that,the average number of cache copies of shared data is small in CC-NUMA systems.This theoretical analysis of distributing pattern for shared data in CC-NUMA systems can be helpful in proposing more effective directory organization methods.3.A two-level directory organization scheme based on directory cache is proposed to address the problem of directory memory overhead prohibiting the scalability of cache coherence protocol.This scheme can reduce the memory overhead of directory information and improve the scalability of the protocol.Simulation and analysis showed that the execution times of a number of parallel benchmarks were shortened to various degrees.4.Memory wall is the bottleneck of system performance.To reduce memory access latency is the challenge we have to face.Four memory scheduling algorithms with different constraint degrees are presented,the simulation and analysis showed that the greedy memory scheduling algorithm with conflict elimination of bank address and starvation avoidance strategy is effective.The DDR2 based memory controller is implemented on hardware.5.A distributed multi-node simulation and verify platform named CoSim is proposed to effectively verify the correctness of complex or large systems.To assist simulation tests and verification of cache coherence protocol,the CMCV model is proposed.A QSCV model similar to stream copy with Verilog hardware description language is built to evaluate the LMRF technical and the memory bandwidth of SCDSM system.In summary,the dissertation provides a feasible solution for a number of challenging problems of CC-NUMA systems,and these solutions have been implemented in engneering.It is believed that the research will make a nice groundwork for the further research and engineering on CC-NUMA based memory architecture.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络