节点文献
超立方体互连网络中的组播算法研究
Multicast Algorithms on Hypercube Interconnection
【作者】 陆松;
【导师】 杨晓东;
【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2008, 博士
【摘要】 互连网络是构成高性能计算机系统和决定系统通信性能的关键部分。随着VLSI技术的飞速发展以及多核、多线程和片上系统等新技术的应用,高性能计算机系统的规模不断扩大,处理结点的性能持续增长。相比之下,互连网络性能增长速度十分缓慢,已逐渐成为制约系统整体性能发挥的瓶颈。因此,在研发新型互连技术(光互连)的同时,如何提高以组播和归约等操作为基础的聚合通信性能也是当前互连网络研究的热点之一。高性能计算机系统规模的不断扩大,导致互连网络中结点出错的概率也大大增加。若某些处理结点出错而导致解决挑战性应用的高性能计算机系统宕机,造成的计算能力的巨大浪费是不可忍受的。因此,系统应具有降级重组能力,保证正确的处理结点能正常工作。此时,互连网络提供容错通信能力,保证结点间的可靠、高效通信至关重要。超立方体是人们最早研究且仍是目前最重要的互连网络拓扑结构之一,具有正规性、对称性、强容错性、直径短、可嵌入性等诸多优点。并且随着高维度路由器和高速串行传输技术的出现,因高维超立方体网络有巨大的结点容量,其备受诟病的可扩展性在很长一段时间内将不成问题。本文研究了超立方体网络上的组播算法。包括对无错误结点的常规超立方体网络的组播算法研究,和对包含错误结点的超立方体网络上的容错组播算法研究。组播的通信开销是评价组播性能的重要标准之一。我们根据超立方体拓扑结构的规整性和组播目标结点间的局部性特征,提出了一个分簇组播模型。分析表明:簇的度较小且簇内目标结点间局部性特征好,使得簇内通信开销小,从而减小组播通信开销。基于分簇组播模型,我们设计了一个组播树算法和一个组播路径算法。这两个算法对组播消息的路由是完全分布的。和已有的同类组播算法相比,基于分簇的组播树和组播路径算法对组播目标结点间局部性特征的利用率更高,能显著减小组播通信开销。针对目前超立方体网络上容错能力最强的容错模型——局部子立方连通,我们提出了一个结点可达性模型。该可达性模型是完全分布的,不需要集中式的控制。并且,我们证明了:仅使用结点上的可达性信息,任意两个正确结点之间均是可路由的。此外,我们设计了三个算法用于结点间交换、更新可达性信息,一个算法用于检测局部子立方体的连通性。这些算法都是简洁高效且完全分布的,算法的主要操作是与、或等逻辑操作,复杂度低,利于硬件实现。基于可达性模型,我们设计了一个简洁高效的容错路由算法。与局部子立方连通超立方体网络上已有的容错路由算法相比,我们的算法是完全分布的,复杂度更低。并且,算法的主要操作是与、或等逻辑操作,利于硬件实现,有很高的实用价值。基于可达性模型,我们设计了局部子立方连通超立方体网络上的首个容错组播算法。该算法也是简洁高效且完全分布的,复杂度低,主要操作是与、或等逻辑操作,利于硬件实现,有很高的实用价值。
【Abstract】 The interconnection network is one of the most essential subsystem in high performance computing systems. It determines the performance of the inter-node communication. Due to rapid development of VLSI technology and the new technology such as multicore, multithreaded and SoCs, both the size of interconnection network and the performance of processor increase dramatically. However, the performance increase of interconnection network is much slower. It becomes the new bottleneck gradually. Therefore, besides the research of new interconnecting technology such as optical interconnecting, the collective communication based on multicast and reduction becomes the hot spot. With the increase of network size, the possibility of node failure also increases. The high performance computing system should degrade gracefully when node failure occurred. So it’s important guarantee the fault-tolerant, reliable communication between non-faulty nodes.Hypercube is one of the most popular, versatile and efficient topological structures of interconnection networks. With high-radix router and serial transport, the hypercube with large dimension can connect a huge number of nodes. Therefore, the scalability won’t be a shortcoming of hypercube for a long time.This dissertation is the research of multicast algorithms on hypercube interconnection, including the multicast on non-faulty hypercube and the multicast on faulty hypercube.The multicast overhead is one the metric of performance. Based on the locality between multicast destination nodes, we propose the clustering model. Due to the small degree of cluster and nice locality between destination nodes inside cluster, the overhead is small for inner-cluster. It decreases the overhead of multicast traffic.We present a multicast tree algorithm and a multicast path algorithm based on the clustering model, in which message routing can be fully distributed. Compared with existing works, our multicast algorithms are with higher utility of locality and reduce the multicast traffic significantly.The locally subcube-connected hypercube is the fault tolerant model for hypercube, whose capacity is much greater than others. We propose the reachability model for locally subcube-connected hypercube. This model is fully distributed without centric controller. We proofs that it is routable for arbitrary two non-faulty nodes with reachability. We present three algorithms for exchanging, updating locality, and an algorithm for local subcube-connectivity detecting. All these algorithms are efficient and fully distributed. It is suitable for hardware implementation due to that the major operations are bitwise.We present an efficient fault tolerant routing algorithm based on the reachability model. Compared with existing works, our routing algorithm is fully distributed, with lower complexity. It is suitable for hardware implementation due to that the major operations are bitwise.We also present the first fault tolerant multicast algorithm based on the reachability model for locally subcube-connected hvpercube. It is efficient, fully distributed and with low complexity. It is suitable for hardware implementation due to that the major operations are bitwise.
【Key words】 Interconnection network; Hypercube; Multicast; Fault tolerant; Locally subcube-connected; Reachability;