节点文献

基于SMP的CC-NUMA类大规模系统中Cache一致性协议研究与实现

Research and Implementation of the Cache Coherence Protocol for the Large Scale System of the SMP-based CC-NUMA Category

【作者】 庞征斌

【导师】 周兴铭;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2007, 博士

【摘要】 随着对高性能计算需求的日益增强,对高性能计算机的架构与实现提出了越来越高的要求。提高系统的可编程性、可用性和系统综合效能,成为当前高性能计算机的设计目标。分布共享存储多处理机系统以其方便的编程环境和较好的可扩展性而成为高性能计算机体系结构发展的主流,CC-NUMA(Cache Coherent Non-UniformMemory Access)结构成为高性能计算领域实现高效能的重要体系结构。构造大规模CC-NUMA系统受诸多因素制约,其中Cache一致性协议是限制系统可扩展性的关键因素,同时也对系统性能产生重要影响。由于Cache一致性实现的复杂性,当前多数CC-NUMA系统规模较小,可扩展性有限。许多高性能计算平台利用CC-NUMA计算机构建集群,但这样严重影响了大系统的可编程性。因此为大规模CC-NUMA系统设计扩展性好、简洁高效的Cache一致性协议十分必要。本论文主要工作是针对基于SMP(Symmetric Multi-Processors)结点的大规模CC-NUMA新系统——SCCMP(Scalable Cache Coherence Multi-Processors)的要求,分析其体系结构特点,设计了可扩展、低复杂性和高效的Cache一致性协议,设计了可扩展的目录结构,实现并优化了与Cache一致性处理紧密相关的目录访问,提供了Cache一致性的高效消息传递通信支持,最后验证了协议的正确性及高效性。论文的具体工作和创新点如下:(1)研究了SCCMP的构成层次和结构特点,设计和实现了可扩展、高效的混合Cache一致性协议——HYSCC(HYbrid Scalable Cache Coherence)协议。HYSCC协议通过融合监听协议特点的可扩展目录协议实现,有效支持了SCCMP系统内部两个不同层次的Cache一致性实现要求,降低了协议设计的复杂性,实现协议的简洁高效。HYSCC协议通过多虚信道网络传输技术、非阻塞并发处理和精简协议消息类型等技术实现协议自身的高效性。HYSCC协议增加一类专门处理SMP结点内部脏数据共享的命令类型和协议处理方法,降低了SMP结点因内部共享导致脏数据副本写回所带来的协议处理复杂性,大大简化了SCCMP结点控制器内部协议设计的复杂度。(2)通过分析SCCMP系统中分布共享I/O访问对系统Cache一致性实现的影响,在HYSCC协议中设计和实现了支持I/O属性访问的Cache一致性命令类型和协议处理流程,设计和实现了I/O访问数据一致性的硬件维护机制,高效实现了全局共享I/O的并发访问。(3)研究了目录结构的可扩展实现方法,设计了符合SCCMP系统特点的有限指针(Dir5NB)和组合粗向量CCV(Combined Coarse Vector)的混合表示——Dir5NB+CCV的目录结构。该目录结构兼具指针和位向量表示的优点,在不同共享度时采用与之对应的共享信息表示格式,合理地减少了目录存储的开销。Dir5NB+CCV通过混合的多元化表示,在一定程度上降低了共享信息的非精确性,减少多余的失效开销,并且利于高速的硬件实现。(4)为缓解因目录访问而带来的数据访问冲突,设计了双体并行访问存储器结构和双目录Cache访问结构,优化目录访问和处理。SCCMP系统没有采用单独的目录存储器,利用双体并行访问存储器结构使得存储数据和对应目录的访问并行进行。为缓解由此带来的存储器访问压力,设计和实现了与双体并行访问存储器对应的双目录Cache结构,引入目录Cache访问层次,利用程序访问的局部性对目录访问进行优化。实验结果验证了双体并行访问存储器和双目录Cache结构对性能有大幅提升作用。(5)为高效支持消息传递编程模型,研究了在SCCMP系统中有效实现共享存储和消息传递相结合的通信方法,提出了层次的一致性消息通信模型。在SCCMP结点控制器一级提供消息传递通信接口,实现了无死锁的消息通信协议,实现了基于硬件的一致性块传输机制,支持高效的消息传递通信。(6)基于FPGA实现完成了SCCMP结点控制器的逻辑设计和协议验证。在四个结点的FPGA原型系统上进行NAS NPB等应用测试,验证了HYSCC协议的正确性。用ASIC实现了验证后的SCCMP结点控制器,并在64结点的ASIC原型系统上进行了性能测试。测试结果表明NAS NPB等应用运行正确;EP、SP、FT、MG等对存储带宽要求很高的应用在ASIC原型系统上呈现出良好的可扩展性;通信测试表明点点通信最大带宽在1.3GB/s以上,应用测试最大带宽在1.1GB/s以上,基于硬件一致性块传输实现使NPB MPI应用测试获得了更高的性能。(7)本研究成果适用于基于SMP超结点的CC-NUMA类型的大规模系统,并已在某重点工程中得到成功应用。

【Abstract】 With the increasing requirements of high performance computing,the framework and implementation of high performance computer is becoming more challenge. Programmability,usability and system performance have become the object when designing a high performance computer system.The distributed shared memory multi-processor system becomes the main platform of high performance computing, which features easy programming and good scalability.As the popular scalable system approach,the CC-NUMA(Cache Coherent Non-Uniform Memory Access) is becoming the important architecture for high producity in high performance computing.There are many factors affecting CC-NUMA system performance,of which cache coherence protocol becomes the key for system scalability.Most existing CC-NUMA computers are small and with limited scale,due to the complex implementation of cache coherence.Usually,CC-NUMA clusters are used as high performance computers,which bring bad programmability.So,it is very important and necessary to design and develop a cache coherence protocol with good scalability and efficiency for the large scale CC-NUMA system.This paper researches the high efficiency implementation of the cache coherence protocol based on the Scalable Cache Coherence Multi-Processors(SCCMP),the large scale SMP-based CC-NUMA system.The main study includes designing the high efficiency scalable cache coherence protocol according to the architecture features, designing and implementing the scalable directory scheme,efficiently implementing the directory access,effectively supportting cache coherent message passing communications,and validating the protocol.Primary innovative work in this paper can be summarized as following:(ⅰ) We designed and implemented efficient HYbrid Scalable Cache Coherence (HYSCC) protocol,after analyzed the hierarchy and the structure features of the SCCMP system.HYSCC protocol efficiently fulfils needs of different hierarchy in SCCMP system and eases the designment and implementation of itself by taking the advantage of snooping bus protocol and directory character.HYSCC protocol ensures the system scalability based on our scalable directories.High efficiency is yielded by multiple virtual channels,concurrent unblocking process and compact massage type,HYSCC protocol supports special messages and process for the case that the dirty data become shared due to the sharement among processors in a SMP node,which reduces the dirty data written back complexity and simplifies the protocol designment in a SMP node.(ⅱ) We discussed the impacts of the distributed shared I/O accesses to the cache coherence,and provided special messages and cache coherence dealing procedures to support the cache coherent access with I/O attributes.Moreover,we proposed an effective method to concurrently process I/O accesses,and implemented a coherence maintenance mechanism for I/O attribute data in SCCMP system.(ⅲ) We did our research on the feasible and scalable directory scheme,and we proposed the Dir5NB+CCV directory scheme for the SCCMP system.The Dir5NB+CCV scheme is a combination of the modified limited pointer directory(Dir5NB) scheme and the combined coarse vector(CCV) directory scheme,which keeps both pointer representation and full-map vector representation advantages.By hybrid presentation effectively decreasing directory memory overheads,utilizing the advantage of Dir5NB scheme and CCV scheme,the Dir5NB+CCV scheme cuts down shared informantion inaccuracy,reduces excrescent invalidations and suits an efficient hardware implementation.(ⅳ) We proposed a dual storage module structure and dual directory cache(DC) structure to relieve the access collision and to improve directory performance.There is no special directory storage in SCCMP system,but the dual storage module structure has data and corresponding directory item accessed concurrently.To relievate memory access bottleneck,the dual directory cache structure is designed and implemented,which corresponding with the dual storage module structure and introducing cache hierarchy. This way can optimize directory access by utilizing program locality and relieve memory access pressure.Experiments show that with dual storage module and directory cache structure,the system performance is improved greatly.(ⅴ) We researched the effective way to integrate message passing communication paradigm into shared memory in SCCMP system.We proposed a hierarchical coherent communication model,provided communication interface in SCCMP node controller, effectively implemented a deadlock-free communication protocol and a coherent block data transfer mechanism to support the multi-domain MPI communication.(ⅵ) We designed the SCCMP node controller and implemented FPGA prototype for validation.The HYSCC protocol was validated on a 4-node FPGA prototype,and an ASIC chip of SCCMP node controller was fabricated.Experiments were done on a 64-node ASIC system.All tested applications,including NAS NPB benchmark,got correct results on the system.Memory-intensive applications,such as EP,SP,FT,MG, got good scalability.Communication tests showed that the maximum communication bandwidth was more than 1.3GB/s and the maximum communication bandwidth yielded by applications can be over 1.1GB/s.(ⅶ) Our research results are applicable to the large scale system of the SMP-based CC-NUMA category,and also have been successfully used in some important project.

  • 【分类号】TP302.1
  • 【被引频次】4
  • 【下载频次】526
节点文献中: 

本文链接的文献网络图示:

本文的引文网络