节点文献

基于CPU+GPU的影像匹配高效能异构并行计算研究

Research on High Efficiency Heterogeneous Parallel Computing Based on CPU+GPU in Image Matching

【作者】 肖汉

【导师】 张祖勋; 张剑清;

【作者基本信息】 武汉大学 , 摄影测量与遥感, 2011, 博士

【摘要】 多核CPU和图形处理器(Graphic Processing Unit, GPU)的高速发展,不但促进了图像处理、虚拟现实、计算机仿真等领域的快速发展,同时也为利用GPU进行图形处理以外的高性价比绿色通用计算提供了良好的运行平台。因此,GPU的通用计算己成为高性能计算领域中的热点研究课题之一。伴随着传感器技术的不断进步,致使人们获取地表信息的手段越来越多样快捷。面对数据源的多样化与数据量的成倍增长,许多常规算法很难满足对海量数据进行高速计算的要求。而现代图形硬件GPU日益增加的可编程性和高效能计算能力,则为摄影测量与遥感中可并行化算法的加速提供很大的空间。本文仅就GPU大规模并行计算影像匹配研究中的若干问题进行了详细的分析,并提出了相应的解决方案。具体工作概述如下:(1)通过对摄影测量与遥感领域中与影像匹配处理相关的四种算法在GPU上的并行处理进行研究,提出了基于CPU+GPU的异构群核架构的影像处理共通解决方案,探索了影像处理的GPU大规模并行计算设计模式。基于GPU的影像处理通用并行解决方案要在数据精度、延迟和计算量等几个方面进行GPU加速效果的预评估,算法设计和优化过程中也须采用功能和数据分解、线程映射等并行计算方法以及存储器访问优化、通信优化和指令流优化等优化策略。基于GPU的影像处理通用解决方案设计与性能优化是与GPU的体系结构、求解问题的特征结合在一起的,通常需要多重因素整体考虑并不断尝试,最终达到理想的性能。针对GPU与CPU的不同,重点分析和讨论了GPU的加速原理以及当前比较成熟的统一计算设备架构(Compute Unified Device Architecture, CUDA)通用计算模型构架及其特点。(2)提出多GPUs加速的Wallis变换影像增强并行算法。借助于GPU较强的运算能力,利用CUDA并行计算架构在个人计算机(Personal Computer, PC)上实现了快速Wallis图像滤波算法,包括GPU上任务分解、大规模计算核心的分解方法,结合使用了共享存储器、全局存储器对算法进行加速。使用线程块内的共享存储器较好地解决了同一计算子空间的各线程同步问题。对比CPU和GPU计算Wallis影像变换的时间,实验结果表明,Wallis变换并行算法可以把计算速度提高2个数量级。该方法具有较好的实时性,可大大提高图像增强过程的处理速度,显著地减少计算时间。(3)研究基于GPU的Harris角点检测多设备控制并行算法,使用众多线程将计算中耗时的图像高斯卷积平滑滤波部分改造成单指令多线程(Single Instruction Multiple Thread, SIMT)模式,并采用GPU中共享存储器、常数存储器和锁页内存机制在CUDA上完成图像角点检测的全过程。实验结果表明,基于多GPUs的Harris角点检测并行算法成功实现了硬件加速,相对于CPU上运行的Harris角点检测算法,其执行效率有近60倍的提高。(4)提出基于CUDA架构的快速相关系数影像匹配并行算法,它能够在SIMT模式下完成高性能并行计算。并行算法系根据GPU的并行结构和硬件特点,采用执行配置技术、高速存储技术和全局存储技术三种加速技术,优化了数据存储结构,提高了数据访问效率。实验结果表明,并行算法充分利用了GPU的并行处理能力,速度是基于CPU实现的近20倍并能获得最高多处理器warp占有率。(5)研究面向CPU+GPU群核架构的尺度不变特征变换(Scale Invariant Feature Transform, SIFT)特征匹配并行算法,优化了数据存储结构,提高了数据访问效率。实验结果表明,与SIFT特征匹配的串行CPU实现方式相比,CUDA实现能够实现超过27倍的性能加速,极大地提高了SIFT特征匹配算法在实际应用中的实时性。(6)基于CPU+GPU的影像匹配系统集成研究。包括单GPU/多GPUs加速的Wallis-Harris-相关系数(WHR)影像匹配系统和单GPU/多GPUs加速的Wallis-SIFT(WS)影像匹配系统。实验结果表明,GPU加速的WHR影像匹配系统比CPU实现方法整体提速最高达37倍,GPU加速的WS影像匹配系统比CPU实现方法整体提速最高达39倍。

【Abstract】 The rapid upgrade of multi-core CPU and Graphics Processing Unit (GPU) not only brings along the advance of the related applied technology such as image process, virtual reality, and computer simulation, but also provides an operating platform for low power consumption general-purpose computing of good price/performance ratio except for graphics process. Therefore, general-purpose computing based on GPU has become a very hot research topic in the field of high-performance computing.With the continuous development of sensor technology, resulting in the means for people to obtain the surface information more and more diverse quickly. The face of diverse data sources and doubling data quantity, many conventional algorithms could not well meet the challenge of the high-speed computing of large-scale data. The increasing programmability and high performance computational power of GPU present in modern graphics hardware provides great scope for acceleration of photogrammetry and remote sensing algorithms which can be parallelized. This dissertation gives a detailed analysis research on massively parallel computing based on GPU in issues of image matching, and also proposes effective solutions. Specific tasks are outlined below.(1) Based on the heterogeneous manycore architecture consisted of CPU and GPU schemes for image processing in common is given by studying the field of photogrammetry and remote sensing image matching processing associated with the four algorithms in parallel processing on the GPU. GPU-based massively parallel computing design patterns is explored in image processing. General parallel schemes based on GPU in image processing need to be pre-evaluated in terms of data accuracy, latency, and computing quantity etc. In addition, in the algorithms design and optimization, parallel computing methods such as function and data partition and thread mapping etc, optimization strategies such as memory access optimization, communication optimization and dictation optimization, should be adopted. In the design and optimization of general schemes based on GPU in image processing, various factors should be taken into consideration as a whole such as the architecture of GPU and the characteristics of problem solving. With trial and error, the desired performance can be ultimately achieved. For the difference between GPU and CPU, the acceleration principles of GPU are analyzed, and the general-purpose computing model of current mature framework Compute Unified Device Architecture (CUDA) and its characteristics are discussed.(2) A image enhancement parallel algorithm based on multi-GPU acceleration for Wallis transform is proposed. With the help of the strong computing ability of GPU and the parallel computing architecture of CUDA, the fast algorithm of image filter for Wallis transform is implemented on a Personal Computer. The method of large scale thread division is put forward along with the task division on GPU. Along with the use of shared memory and coalesced global memory access the algorithm is accelerated. Threads for the computation of the same computing subspace are properly synchronized by shared memory in thread block. It compares GPU’s speed with CPU’s for Wallis image transformation. The experimental result shows that Wallis transform parallel algorithm could get two orders of magnitude speedup. The method is excellent in real time processing ability. It accelerates processing speed of image enhancement process and reduces the computing time significantly.(3) Multi-device control parallel algorithm of Harris corner detection based on GPU is presented, so that time-consuming Gaussian image convolution filtering part during the whole image corner detection process can be implemented by many parallel threads. Finally, implementation of this Single Instruction Multiple Thread (SIMT) parallel algorithm using GPU mechanism of shared memory and constant memory and pinned host memory in CUDA is detailed. The experiments show that the parallel algorithm of Harris corner detection based on multi-GPU for the successful implementation of hardware acceleration is nearly 60 times faster than the traditional Harris corner detection algorithm implemented on CPU.(4) A fast correlation coefficient image matching parallel algorithm is presented based on architecture of CUDA. The algorithm can execute high performance parallel computing in SIMT Pattern. On the basis of the parallel architecture and hardware characteristic of GPU, the parallel algorithm introduces three speedup methods to improve the implementation performance:execution configuration technology, high-speed storage technology and global storage technology optimizes the data storage structure and improves the data access efficiency. The experiment result shows that parallel algorithm takes full advantage of GPU’s parallel processing capability and obtain the highest Multiprocessor Warp Occupancy, processing speed is nearly 20 times faster than CPU-based implementation.(5) Parallel algorithm of Scale Invariant Feature Transform (SIFT) feature matching for manycore architecture of CPU and GPU is proposed, which optimized the data storage structure and enhanced the data accessing efficiency. Experimental results show that the CUDA implementation can achieve more than 27 times speedup in comparison with serial CPU implementation of SIFT feature matching. By virtue of GPU, the real-time processing ability of SIFT feature matching algorithm can be greatly improved in practical application.(6) CPU and GPU-based image matching system integration. Including single GPU/multi-GPU accelerating Wallis-Harris-correlation coefficient(WHR) image matching system and single GPU/multi-GPU accelerating Wallis-SIFT(WS) image matching system. Experimental results show that GPU implementing WHR image matching system achieves up to a 37 times speedup over the CPU version, GPU implementing WS image matching system achieves up to a 39 times speedup over the CPU version.

  • 【网络出版投稿人】 武汉大学
  • 【网络出版年期】2012年 04期
  • 【分类号】TP391.41;TP338.6
  • 【被引频次】27
  • 【下载频次】2734
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络