节点文献

媒体数字信号处理器MediaDSP6410微结构研究

Research on Microarchitecture of Media Digital Signal Processor MediaDSP6410

【作者】 王星

【导师】 刘鹏;

【作者基本信息】 浙江大学 , 信息与通信工程, 2010, 硕士

【摘要】 RISC/DSP是一种具有很高性价比的可编程的嵌入式媒体处理解决方案。本文作者参与了浙江大学信息与电子工程学系MediaProcessor实验室基于RISC/DSP架构的媒体数字信号处理器MediaDSP6410(简称MD6410)的研发工作,作为部分研究成果,本文着重探讨两发射乱序超标量和双线程扩展微结构的设计。测评给处理器设计提供有用的指导,从应用需求的角度提出对处理器设计的要求,从三个层面进行并行性开发。8路SIMD扩展最大化地开发了视频压缩算法核心的数据并行性;复合媒体处理指令开发了指令级并行性并具有好的代码效率;进一步开发线程级并行,将标量程序段和可向量化的程序段作为线程并行执行。根据嵌入式处理器的设计面积、功耗预算和设计、验证复杂度的限制,设计最低复杂度的乱序超标量处理器以提升标量代码的执行性能。提出了映射表结合不带操作数的发射缓冲的寄存器重命名机制。为了在不影响性能前提下简化设计,媒体指令和存储指令不进行重命名,复杂的媒体指令同MIPS指令流水线串行运行。改进了复合媒体指令的数据冲突检测机制,避免了全局停顿带来的关键路径。实验表明,在TSMC 130nm worst case下,MD6410流水线达到300MHz,以3.3%的面积代价获得1.6-2倍的标量性能改进。多线程扩展旨在开发并行算法,提高处理器的资源利用率和指令吞吐量。为最大化利用硬件资源,提出合理的并行算法和多核多线程硬件架构的映射关系。详细讨论了微结构的设计折中。设计了有利于线程优先级调度的译码段,考虑了共享流水线资源利用率的指令发射逻辑和改进的直接存储访问和便签式存储器接口。提出非阻塞式的消息传递线程同步机制,实现了灵活的多发射和多线程模式切换。实验结果表明,MD6410的双线程设计以5.9%的面积开销获得26%-35%的吞吐量提升。

【Abstract】 RISC/DSP is a highly cost effective programmable solution to embedded media processing. The author takes part in the research on the media digital signal processor MediaDSP6410(MD6410) based on RISC/DSP architecture. The research was launched by MediaProcessor Lab of Department of Information Science and Electronic Engineering of Zhejiang University. As part of the research results, this thesis focuses on the research and design of 2-issue out-of-order superscalar and dual-threaded microarchitecture.Benchmarking guides the design of processor. The specification of processor design is based on the need of application. Parallelism can be developed in three ways. 8-way SIMD extension maximizes the data-level parallelism of the kernels of video compression algorithms. The compound media processing instructions exploit the instruction-level parallelism and are of good code density. Scalar program sections and vectorized sections can be seen as threads, thus thread-level parallelism can be also exploited.Embedded processor design is constrained by area, power budget and design complexity. A superscalar design of minimized complexity is proposed to improve the performance of executing scalar code. A register renaming mechanism which combines the rename map table and issue buffer without operands is proposed. To simplify the design without much sacrificing the performance, media register and store operand are not renamed, so the compound media instructions and RISC instructions are serially executed. The data hazard detection logic is reconsidered to avoid the critical path caused by global stall. Experiments show that MD6410 can work at 300MHz with TSMC 130nm technology in worst case. The performance of execution scalar code is 1.6 to 2 times of the original design at the area cost of 3.3%.Multithreaded extension is aiming at developing parallel algorithms and improving processor resource utilization and throughput. To maximize hardware resource utilization, a map relationship between parallel algorithm and multicore and multithreaded architecture is suggested. The tradeoff of microarchitecture design is carefully examined. The instruction decoder is designed to facilitate prioritized thread scheduling. The instruction issue logic considers the utilization of the shared execution pipeline. And the interface between direct memory access and scratch-pad memory is refined. A non-blocking message passing mechanism is proposed to implement thread synchronization, which makes flexible switch between multithread and superscalar modes possible. Experiments show that the throughput is 26%~35% improved at the area cost of 5.9%.

  • 【网络出版投稿人】 浙江大学
  • 【网络出版年期】2010年 08期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络