节点文献

可编程密码处理器关键技术研究与实现

Research and Implementation on Key Technologies of Programmable Cryptographic Processors

【作者】 赵学秘

【导师】 王志英; 戴葵;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2006, 博士

【摘要】 密码算法是保证信息的机密性、完整性以及可用性等安全要求的基本手段。性能和实现安全等方面的原因使得密码算法需要采用硬件方法实现。专用集成电路(ASIC)和细粒度可重构结构是硬件实现密码算法的两种传统方法。ASIC方法效率高,却无法满足应用环境中灵活实现密码算法的需求。细粒度可重构结构灵活性强,但其通用性带来了较高的设计代价。由于密码算法具有相对固定的处理模式,相关研究工作者分别以空间可编程和时间可编程为基础,面向密码处理领域提出了多种密码专用可重构结构和密码处理器,在一定程度上平衡了性能与灵活性的折衷。然而,已有的密码专用可重构结构普遍存在算法映射困难的问题,使其应用受到了限制;而目前的密码处理器虽然借助编译工具可方便的开发密码算法,但受限于传统体系结构,能够增加的定制功能单元的复杂度及其数量均有限,数据通路效率偏低。本文从时间可编程性出发,将传统体系结构的软硬件界面下移,使得软件看到处理器内部的数据传输以及互连网络,可支持复杂却高效的数据通路,更容易匹配密码处理模式,最终实现高效可编程密码处理器。主要工作及研究成果如下:1.提出了传输触发体系结构(TTA)指导下的专用指令集处理器(ASIP)自动生成方法。TTA中,软件所见为功能单元(FU)之间的数据传输,故硬件设计可以支持寄存器文件分割以及定制更多更复杂的FU,同时解决了指令集生成、可重定向编译等问题。提出了配置流驱动计算体系结构(CSDCA),将软硬件界面进一步下移,由编译器完成处理器内的传输路由,以支持高效却复杂的互连网络,采用段式总线互连技术,较好的解决了随着FU数量增加,数据传输延迟成为主频瓶颈和总线功耗冗余严重等问题。提出了通过双模式计算提高代码密度的方法:程序中的关键循环在CSDCA模式下执行以提高性能,其余部分则工作在RISC模式下以降低代码冗余。这些工作建立了支持高效数据通路的ASIP设计流程。2.提出并实现了一种高性能模幂处理器。提出以基数长度为处理字长的高基数Montgomery算法(RBHRMMM),结合并行模幂算法,将大数模幂运算拆分为原子操作矩阵序列,按照列共享原则设计列共享超流水处理阵列(CSSA)。CSSA作为特殊功能单元,基于上述ASIP设计流程,得到完整模幂运算处理器SEA-II,其电路等效门数为923k。基于SEA-II的1024位RSA解密速度达到6,353Kbps。3.提出并实现了一种可扩展双域公钥密码整体算法处理器。提出双域统一RBHRMMM算法,并以此为基础设计出行共享流水单元(RSSA),将RSSA耦合到已有ASIP设计流程,并增加大数寄存器,得到公钥整体算法处理器SPKP。SPKP具有如下特点:①通过软件工具,可快速开发出整体公钥密码系统;②RSSA具有良好的可扩展性;③流水单元实现矢量乘操作,并支持GF(p)和GF(2~n)双域;④通过调整总线宽度和RSSA中流水单元数量,可满足不同性能/面积约束。4.提出并实现了一种高性能安全Hash处理器。提出新型Hash算法计算模块划分方法,即分为压缩模块和扩散模块,而且每个模块包括队列、混洗和累加等三个子模块。据此设计出可重构功能单元,耦合到已有ASIP设计流程中,得到安全Hash处理器PSHP。与细粒度可重构结构相比,其逻辑利用率高,配置速度和运算速度快,而且开发方便;与ASIC实现相比,可以在性能和面积开销较小的前提下,灵活的支持常用Hash算法。5.提出并实现了一种高性能分组密码算法处理器PSCP。提出分组密码处理器优化的两个原则:①增加置换单元和子密钥存储单元,将核心运算期间的访存次数减少为零;②对基本操作进行重新组合,均衡延迟分布。与ASIC实现相比,在CBC、OFB、CFB等分组相关的加密模式下,PSCP获得相似的性能,但更灵活。与密码专用可重构结构相比,PSCP开发方便,可以实现包括密钥扩散在内的完整算法,具有更好的安全性。以上研究工作首先建立了支持复杂数据通路的ASIP设计流程,然后针对具体种类的密码算法和实际应用环境需求,研究并实现了四种效率高、可用性强的可编程密码处理器。处理器采用的目标工艺均为0.18μm 1P6M CMOS工艺,其中模幂处理器已经实现应用。

【Abstract】 Cryptographic algorithms (CAs) are widely used to ensure security requirements such as confidentiality, integrity and usability. For performance as well as for implementation security reasons it is often required to realize CAs in hardware. Application specific integrated circuits (ASIC) and fine-grain reconfigurable structures (FRS) are two traditional approaches. A well-known drawback of ASIC solution is low flexibility. FRSs have sufficient flexibility, but suffer from significant overhead due to their generic nature.CAs have relatively fixed granularity and similar processing mode. Researchers have proposed several cryptography-specified reconfigurable structures by spatial programmability and several cryptographic processors by temporal programmability, these works achieved good tradeoffs between performance and flexibility. However, current reconfigurable structures are limited from practical applications because of difficulties in mapping CAs to them. For cryptographic processors, although it is convenient to develop algorithms by using compiler, their data-paths are constrained by the traditional architectures and can’t accelerate CAs efficiently.Starting from temporal programmability, this paper shift the hardware/software interface downwards, and let the software specify data transports and every transport’s routing path. This addresses the problems in designing complex but efficient data paths for traditional architectures. According to different class of cryptographic algorithms and the application environments, several practical programmable cryptographic processors are proposed and implemented. The main work and results are:1. We propose an automatic generation method for application specific instruction-set processor (ASIP) directed by transport triggered architecture (TTA). In TTA, software specifies data transports among function units (FUs), so application specific hardware can support more sophisticated FUs, and the problems about instruction generation and retargetable compiling can be solved at the same time. Configuration stream driven computing architecture (CSDCA) is proposed, where routing is performed by the compiler to support efficient but complex interconnections. Combined with segmented buses, we solve the problem that with the increase of FU number, the interconnection network of TTA becomes a bottleneck for frequency and consumes much extra power for specific data transport. RISC|CSDCA dual mode computing is proposed to enhance code density. Computation-intensive loops, which occupy most of the computing time, are performed in CSDCA mode to get higher performance, and the others are processed in RISC mode to reduce code redundancy. The above works build an ASIP design flow supporting efficient but complex data path.2. We propose and implement a high-performance modular exponentiation (ME) processor. A radix-length based high radix Montgomery modular multiplication algorithm is proposed, with this algorithm a ME can be decomposed into a series of primitive operation (PO) matrixes. A column sharing super-pipelining array (CSSA) is designed to perform these PO matrixes. Combined with the above ASIP design flow, a complete ME processor SEA-II is implemented. A decryption rate of 6.35 Mbps can be achieved for 1024-bit RSA with SEA-II.3. We propose a dual-field scalable processor implementing whole public key cryptosystems. A dual-field unified RBHRMMM algorithm is proposed, based on this algorithm, a row sharing super-pipelining array (RSSA) is designed. By embedding RSSA to the above ASIP design flow, a scalable public key processor SPKP is implemented. SPKP has such characters: (I) ECC whole algorithms can be developed conveniently through the TTA tool chain; (II) RSSA is scalable; (III) pipeline elements perform vector production and support Galois field GF(p) and GF(2n); (IV) different performance/area constraint can be achieved by adjusting the bus width and the number of RSSA’s pipeline elements.4. We propose a high-performance cryptographic hash processor. We propose a novel method to split hash algorithms, i.e. the kernel of a hash algorithm can be splitting into compress modules and an expansion module, and every module has the same structure and includes a query, a fusion sub-module and an accumulator. Custom reconfigurable FUs are designed base on this method, and by integrating them into the ASIP design flow, a cryptographic hash processor PSHP is implemented. Compared to fine-grain reconfigurable architecture, PSHP is faster and more area-efficient; compared to ASIC, it can support widely-used hash algorithms with a little overheads.5. We propose a high-performance block cipher processor PSCP. We propose two optimization principles: (I) the number of memory access in kernels can be decreased to zero by coupling a substantial unit and a sub-key storage unit; (II) reorganizing the basic operations to balance delay distribution. Compared with ASIC solutions, PSCP can achieve similar performance in CBC, CFB, or OFB mode, and PSCP has more flexibility. Compared to custom reconfigurable structures, PSCP has a more convenient developing method, and support the complete algorithm including key expansion, so PSCP is much safer and more usable.These processors all use 0.18μm 1P6M CMOS technology, and the ME processor has been sold in the market.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络