节点文献

节点内多CPU多GPU协同并行绘制关键技术研究

Multi-CPU and Multi-GPU Collaborative Parallel Rendering in Cluster Node

【作者】 刘华海

【导师】 李思昆;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2012, 博士

【摘要】 并行绘制是将绘制通道从统一的程序执行模型主循环中分离出来,扩展多条独立的图形流水线,并通过并行派发绘制任务实现协同并行绘制计算。并行绘制是提高大规模复杂场景图形绘制性能的有效技术途径。并行绘制系统一般由多个分布式并行绘制节点构成,绘制节点通常采用CPU作为通用计算单元,GPU作为图形协处理单元。早期并行绘制系统节点内CPU计算产生的数据难以满足单个GPU的需求,因而节点内一般只配置一个GPU。随着商业多核处理器和图形硬件技术的发展,目前的并行绘制系统节点可以配置多个CPU和多个GPU。许多研究和应用表明,深入研究节点内协同并行绘制技术,充分发挥绘制节点内多CPU多GPU协同并行计算性能,既是提高单机图形绘制效率的有效技术途径,又是构建大规模复杂场景高效分布式并行绘制系统的重要基础。现有节点内多CPU多GPU并行绘制技术并没有充分考虑绘制节点内硬件体系结构特点,系统难以充分发挥节点内多CPU多GPU的协同并行绘制计算能力。本文以充分发挥绘制节点内协同并行绘制计算能力为目的,针对绘制节点内CPU和GPU的非对称性计算与访存体系结构特点,研究了节点内多CPU多GPU协同并行绘制模型及其sort-last并行绘制模式下性能优化方法,主要工作和研究成果如下:(1)针对已有节点内并行绘制模型将硬件绘制与合成显示阶段串行耦合导致GPU停顿问题,从发挥节点内多核CPU计算能力和提高节点内多GPU并行绘制能力的角度出发,提出了一种面向节点内多核CPU多GPU体系结构的并行混合绘制模型。该模型一方面将应用事件逻辑与绘制逻辑分离,保证了系统的易配置和扩展性;另一方面,采用CPU软件绘制与GPU硬件绘制相结合将硬件绘制与图像合成分离,同时利用DMA异步传输机制构建节点内绘制、读回和合成三段并行绘制流水线,保证了系统的高效性。理论分析与实验表明:该模型易配置、可扩展,同时可以极大的提高节点内并行绘制性能。(2)针对已有节点内CPU端图像合成操作效率低和存在大量冗余操作问题,提出一种基于GPGPU加速的节点内多GPU图像高效合成方法。该方法通过GPGPU计算生成有效像素合成索引列表,完全避免了节点内多GPU图像合成过程中CPU端的冗余合成计算。理论分析表明:在理想负载平衡条件下,该方法加速比为图像有效像素百分比与节点内GPU数量的比值。实验结果表明:在节点内配置4个GPU时,针对有效像素比为12%~76%的高分辨率图像,该方法与原始方法相比合成性能提高3~5倍。(3)针对已有节点内基于CPU-GPU通信模型的图像合成方法数据通信和计算时间开销大的问题,提出了基于节点内P2P直接通信模型的合成策略,一方面避免了大量的GPU与CPU间的数据交换,另一方面高效的利用了GPU片上高速通信带宽和其强大的计算能力;基于该合成策略,提出了图像合成过程中的推合成与挽合成操作相结合的图像合成方法,优化了多GPU图像合成过程中本地显存与远程显存的存储访问效率,为实现高效的并行图像合成算法奠定了坚实理论基础;同时,提出一种基于位图掩码的GPU端图像合成优化方法,该方法依据图像中的有效像素生成掩码位图,通过对GPU间掩码位图进行集合运算快速得到图像重叠区域的掩码位图,使得图像合成操作仅发生在有效像素区域以内,有效减少了图像合成过程中的传输数据量及合成判别计算开销。实验结果表明:采用基于掩码位图的方法能够有效提高约40%的图像合成效率。(4)针对已有并行绘制框架并行绘制流水线难以发挥多CPU多GPU绘制节点性能问题,研究和实现了一个面向多CPU多GPU绘制节点的层次式节点间sort-last并行绘制框架。框架采用基于层次式合成的绘制流水线组织将系统内GPU划分为绘制节点内和节点间两个层次,并针对各自的GPU互联网络拓扑结构特点选用高效的合成通信模型,同时结合节点内无效像素剔除算法去除了冗余图像数据合成与传输。实验结果表明:该框架可以有效避免节点间无效像素传输并具有较高的图像绘制与合成性能。

【Abstract】 Parallel rendering extends multiple graphics pipelines with separating graphics ren-deringfromtheunifiedbasicexecutionmodelofrenderingapplication,andrenderingtaskare computed in parallel after being sent to rendering resources. Parallel rendering is anefficient technical approach to improve performance of large-scale and complex scenesrendering.Parallel rendering system is composed of multiple distributed rendering nodes ingeneral. Rendering nodes normally use CPU as general computing processor and GPUas graphics rendering coprocessor. It was hard to produce enough data to make full useof GPU shaders with CPUs in an early rendering node, so only one GPU was deployed innode. With technological development of COTS multi-core processor and graphics hard-ware, rendering nodes could have multiple CPUs and multiple GPUs. Many researchesand applications show that researching on collaborative parallel rendering to fully usecomputing resources in multi-CPU and multi-GPU cluster node is not only a technicalapproach to improve performance of graphics workstation, but also a great foundational‘building block’to compose the larger systems capable of rendering very large data.Hardwarearchitectureofthemulti-CPUandmulti-GPUrenderingnodewasnotfullyconsidered in existing parallel rendering technology, so parallel rendering system couldnot collaboratively do the rendering task in parallel with high performance. To make fulluse of computing resources in node, non-uniform computing units and memory accessarchitecture was considered in character for multi-CPU and multi-GPU rendering node inour researches. And the researches were mainly on multi-CPU multi-GPU collaborativeparallel rendering model and approaches to improve the model’s performance when dorenderingtasksinsort-lastmode. Themainresearchachievementsaredetailedasfollows:(1)To solve the problem that the composition and display stage is coupled with hard-warerenderingstageinexistingparallelrenderingmodelsforclusternode,anovelparallelhybrid rendering model was introduced. And it could make full use of multi-core CPUsand multiple GPUs in cluster node. In order to ensure easy configuration and good s-calability, the model separates graphics rendering from application’s main event loop.With asynchronous DMA transfer and decoupling hardware rendering stage from compo-sition stage by hybrid software and hardware rendering, a parallel rendering pipeline with rendering, readback and composition stages is constructed in node to obtain high render-ing performance. Theoretical analysis and Experiment results show that the model haseasy configuration and good scalability, and it can efficiently improve parallel renderingperformance of multi-CPU and multi-GPU rendering node.(2)To solve the problem of low efficiency and redundant operations of composi-tion on CPU, a novel composition method accelerated by GPGPU computing was intro-duced. The method generated active pixels composition index list with GPGPU tech-nology and totally avoided inactive pixels composition operations on CPU. Theoreticalanalysis shows that speedup of the method is equal to the radio of active pixels percent-age of image and number of the GPU deployed in node. Experiment results show that themethod performance is about3to5times to original one when compositing high resolu-tion images with12%to76%active pixels percentage in the node with4GPUs.(3) To solve the problem of high computing and communication cost of compositionmethod based on CPU-GPU communication model, a novel composition method basedon GPU P2P direct communication model was introduced. It not only avoided lots of dataexchangebetweenGPUandCPU,butalsofullyusedGPUhighspeedmemorybandwidthand powerful computing ability. To optimize local and remote GPU memory access effi-ciencyofthemethodimplementation, PushCompositingOperationandPullCompositingOperation were presented. A novel bitmap-based composition method was also proposedto reduce data transfer and composition operation discrimination. It made compositiononly operate on overlap regions of GPU images, which got by doing set operation on ac-tive pixels lists. Experiment results show that image composition with the bitmap-basedmethod can raise efficiency about40%.(4)To solve the problem that parallel graphics pipeline of existing parallel render-ing framework could not make full use of computing resources in multi-CPU and multi-GPU rendering node, a novel hierarchical sort-last parallel rendering framework betweenmulti-CPU and multi-GPU rendering nodes was introduced. The framework classifiedGPUs into in-node and out-node, and it composited image in two steps with hierarchicalcomposition pipeline. In each step, composition communication model was decided bycharacter of the topology of GPUs interconnect. And inactive pixels were totally avoidedbeing composited and transmitted by using inactive pixels rejection algorithm. Experi-ment results show that the framework could efficiently avoid inactive pixels transfer and has a good rendering and composition performance.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络