节点文献

面向计算密集型嵌入式应用的VLIW编译优化技术研究

Research on VLIW Compile Optimization Technique for Compute-intensive Embedded Application

【作者】 管茂林

【导师】 张春元;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2012, 博士

【摘要】 随着科学技术的不断发展,应用对计算的需求也不断增加。与传统桌面应用领域不同,在科研、国防、商务、娱乐等众多领域,计算密集型应用正成为微处理器的主要负载,日益吸引人们的关注。而高端嵌入式计算因为其强大的应用需求背景,发展一直非常迅猛。目前嵌入式应用对计算性能、功耗的渴求早已超出了当前嵌入式处理器的能力。作为开发并行性的一个有效方法,VLIW技术仍然在目前的微处理器设计中起着非常重要的作用。VLIW技术可以降低硬件复杂度、提升芯片频率、降低功耗,另一方面也对编译器提出了严峻的挑战:处理器的性能更加依赖于编译器的性能。然而随着VLIW处理器体系结构的不断革新、应用领域的不断扩展,编译器如何才能充分利用体系结构上的性能、功耗等优势,尽可能的开发指令级并行性,针对不同的微处理器体系结构面临着不同的问题。在这样的背景下,作者选择了“面向计算密集型嵌入式应用的VLIW编译优化技术研究”作为论文课题。本文重点研究了计算密集型嵌入式应用在VLIW处理器上运行时面临的若干关键问题,包括数据级并行多线程在VLIW处理器上的软件集成、分布式寄存器文件的负载均衡调度、流处理器上部分互连结构的设计和指令调度以及面向能耗有效微处理器的编译优化技术。本文的工作和创新主要体现在以下几个方面:1.提出了一种轻量级数据级并行多线程在VLIW处理器上软件集成执行的方法。OpenCL规范下的多线程是一种数据并行的多线程,同时其单个线程负载较轻,在VLIW处理器上执行难以充分发挥处理器的性能优势。本文实现了OpenCL规范下的多线程在MASA流处理器运算簇上的软件集成并行执行。根据数据级并行线程程序结构相同的特点,编译器将不同线程之间相应的基本块的操作合并到一起,扩大了编译器可以调度的指令窗口,将数据级并行转化为指令级并行,充分发挥VLIW处理器的性能。实验结果表明,集成适当数目的多线程执行可以在有效提高程序的性能同时,将程序对处理器硬件资源的需求控制在一个可接受的范围内。2.提出了分布式寄存器文件结构中寄存器文件负载均衡的VLIW调度算法。分布式寄存器文件的负载不均衡使得寄存器不能被有效的利用,过高的寄存器文件需求峰值往往导致溢出访存,进而降低性能,抵消其优势。本文针对分布式寄存器文件结构提出了一种寄存器文件负载均衡的VLIW调度算法。通过分析程序的控制结构以及变量的生产者-消费者关系,本文提出了在指令调度时定量精确计算变量生命周期的方法。通过在每一个操作调度完毕后精确计算每个寄存器文件的负载,优先将变量分配到负载较轻的寄存器文件来平衡不同寄存器文件之间的压力。实验结果表明该方法能够有效减少程序对分布式寄存器文件的峰值需求,减少溢出访存。3.提出了流处理器中部分共享互连结构设计和面向部分共享互连结构的指令调度优化算法。流处理器中大量的功能单元和全交叉的互连结构使得其共享互连总线的规模非常庞大,增加了硬件资源开销、传输延迟和硬件布局综合的难度。在进行程序特征分析的基础上,本文通过流处理器中的I/O单元复用技术和部分共享互连结构设计来降低共享互连总线的规模,并通过编译器的优化调度尽可能的利用现有的互连资源,弱化部分互连结构对程序性能的影响。实验结果表明,编译器的优化调度有效避免了程序性能的大幅下降,互连资源的利用率则得到了大幅提升;部分互连结构设计可以有效的降低处理器的硬件开销和能量消耗。4.提出了面向分布式与层次化寄存器文件结构的变量分类调度算法,为能耗有效的微线程处理器设计实现了Thread级的编译器。本文提出了面向万亿次量级的嵌入式处理器研究,介绍了其最底层的微线程处理器结构以及Thread级程序设计模式,为微线程处理器设计实现了Thread级的编译器。为了降低功耗,微线程处理器中采用了分布式与层次化的寄存器文件结构。TORF的极小容量使得很多数据必须存放在ERF中,这使得指令调度变得更加困难。通过分析程序特征,本文提出了面向分布式与层次化的寄存器文件结构的变量分类调度算法,避免了程序员手工分配优化的难题。实验结果表明,相比于分布式寄存器文件结构,变量分类调度算法能够在程序性能略有降低的情况下,大幅降低访问寄存器的能量消耗以及整个处理器的能量消耗,使得分布式与层次化寄存器文件结构的功耗优势得以充分发挥。

【Abstract】 With the continuous development of science and technology, the computing needsof applications are also increasing. Different from the traditional desktop applications,the compute-intensive application is becoming the main load of the microprocessor inmany fields, such as scientific research, defense, business, and entertainment, etc..Meanwhile it is increasingly attracting people’s attention. Because of its strongbackground of application requirements, the development of high-end embeddedcomputing has always been very rapid. Now the demand of computing performance andpower brought from the embedded applications has already exceeded the capacity of thecurrent embedded processor. As an effective method of exploiting parallelism, VLIWtechnology still plays a very important role in the current microprocessor design. TheVLIW technology is propitious for the chip to reduce the hardware complexity, enhancethe frequency of the chip and reduce the power consumption. Meanwhile, it also poses asevere challenge to the compiler that the performance of the processor is moredependent on the performance of the compiler. However, with the continuousinnovation of the VLIW processor architecture and the unceasingly expand ofapplication domain, how can the compiler take full advantage of the architecture onperformance, power and other advantages, and exploit instruction level parallelism asfar as possible? For different microprocessor architectures, the compiler faces differentproblems. In this context, this dissertation focuses on the VLIW compile optimizationtechnique research for compute-intensive embedded application. This dissertationfocuses on several key issues when the compute-intensive embedded applications arerunning on the VLIW processor, including software integration of data level parallelismmulti-thread on VLIW processor, load balanced instruction scheduling for distributedregister file, partly connectivity shared interconnect architecture design and instructionscheduling for stream architecture, compiler optimization techniques forenergy-efficient microprocessor and so on. The dissertation has completed the followingmain contributions and innovations:1. We present a novel approach which integrates lightweight data-level parallelismthreads through compilation for VLIW processors. The multi-threading program of theOpenCL specification is data-level parallelism multithreading, while the load of a singlethread is lighter and it cannot give full play to the VLIW processor. The softwareintegration and parallel execution of the multi-threaded programs under the OpenCLspecification on the cluster of MASA stream processor is carried out in this dissertation.According to the characteristics that the data-level parallel threads have the samecontrol structure, the compiler merges the operations in corresponding basic blocks ofdifferent threads into one basic block to expand the instruction window that the compiler can schedule. It can transform data-level parallelism into instruction-levelparallelism and make the performance of the VLIW processor into full play. Theexperimental results show that the integration and execution of the appropriate numberof threads can effectively improve the performance of program, while the demands ofprocessor hardware resources are controlled in an acceptable range.2. We present the register file load balanced VLIW scheduling (RFBLS) fordistributed register files. The load imbalance of distributed register files makes theregister file cannot be effectively used. High peak register demand of the register filesoften lead to overflow, thereby reducing the performance and weakening its advantages.This dissertation presents the register file load balanced VLIW scheduling for processorwith distributed register file structure. Through analyzing the control structure of theprogram and the producer-consumer relationship of the variable, this dissertationpresents the method of exactly calculating the life time of the variables duringinstruction scheduling. Through the exact calculation of the load of the register filesafter each step of the instruction scheduling, the variables are assigned to the registerfile with lighter pressure firstly, thus balancing the pressure among different registerfiles. The experimental results show that this method can effectively reduce the peakdemand of the program on the distributed register file and reduce the overflow andmemory access.3. We design partly connectivity shared interconnect architecture for streamarchitecture and present instruction scheduling optimization algorithm for the partlyconnectivity shared interconnect architecture. In stream processor, a large number offunctional units and the full cross-interconnect structure makes the size of the sharedinterconnect bus very large, increasing the overhead of hardware resources,transmission delay and the difficulty of the hardware layout. Based on the analysis ofprogram characteristics, this dissertation reduces the size of the shared interconnect busthrough designing partly connectivity shared interconnect and the technology of I/O unitmultiplexing. At the same time, it weakens the influence the partly connectivity sharedinterconnect brought to the program performance though the compiler optimizedscheduling so as to use the existing interconnection resources as much as possible. Theexperimental results show that the compiler optimization scheduling is effective toavoid the sharp decline in the program performance; the utilization of the internetresource has been improved tremendously; the design of partly connectivity sharedinterconnect can reduce the hardware cost and energy consumption of processoreffectively.4. We present the variable classification scheduling algorithms for distributed andhierarchical register file(DHRF) structure and design the Thread level compiler for theenergy-efficient micro-thread processor. This dissertation proposes the embeddedtera-scale processor research, introduces the bottom level micro-thread processor architecture and the Thread-level programming model, and designs the Thread-levelcompiler for the micro-thread processor. In order to reduce the power consumption, themicro-thread processor employs the distributed and hierarchical register file structure.Because of the small capacity of TORF, many data need to be stored in ERF and thismakes the instruction scheduling for processor with DHRF much more difficult. Basedon the analysis of program characteristics, this dissertation presents the variableclassification scheduling algorithms for distributed and hierarchical register filestructure, avoiding the problem of the programmer’s manually allocation andoptimization. The experimental results show that, compared to the distributed registerfile structure, in the condition of slightly reduction of the program’s performance, thevariable classification scheduling algorithm significantly reduces the energyconsumption of register accessing and the entire energy consumption of the processor,which enables the power consumption advantages of the distributed and hierarchicalregister file structure can be fully tapped.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络