节点文献
基于国产PuDianNao芯片的向量函数库优化
Optimization of Vector Function Library Based on Domestic Pu Dian Nao Chip
【摘要】 目前国产人工智能处理器PuDianNao芯片上的向量数学函数只能依靠循环调用标量函数来实现,该方法性能比较低。基于PuDianNao芯片提出了3种优化方法。方法一为插值方法;方法二为SIMD加掩码方法;方法三基于PuDianNao的硬件阵列结构,使用VLIW指令操作阵列中的每个处理单元,封装出SIMT编程模型,提出了暴露分支范围和分支扁平化的编程方法。对以上3种方法进行精度和性能测试,对比实验结果表明,方法三具有最好的精度和性能。使用方法三实现基于国产PuDianNao芯片的向量数学函数库PuDianNao-VecMath,解决了数学函数多分支结构难以向量化的难题。该函数库精度性能较好、功能稳定、运行正确,提供的接口包括取整函数、超越函数、比较函数、激活函数等常见基础数学库函数。在精度上,将函数定义域区间全数据作为输入,运算结果和标量函数在CPU i7运行的结果进行对比。结果表明,单精度版本最大ULP值为2,半精度版本最大ULP值为1。性能与使用标量循环相比有较大提高,单精度版本相对于标量循环平均加速比平均值为18.26,最大加速比为35.90;半精度版本平均加速比平均值为15.65,最大加速比为30.11。
【Abstract】 At present, the vector math functions on the PuDianNao chip of the domestic artificial intelligence processor can only be implemented by calling scalar functions cyclically, and the performance of this method is relatively low. Based on the PuDianNao chip, three optimization methods were proposed. The first two were interpolation method and SIMD masking method. Thirdly, based on a hardware array structure on PuDianNao, VLIW instructions were used to operate each processing unit in the array, and the SIMT programming model was encapsulated programmatically. The accuracy and performance of the above three methods were tested, and the experimental results showed that the third method had the best accuracy and performance. The third method was used to implement the vector mathematical function library PuDianNao-VecMath based on the domestic PuDianNao chip, which solved the problem that the multi-branch structure of mathematical functions was difficult to vectorize. The function library had good precision performance, stable functions and correct operation. The provided interfaces included rounding functions, transcendental functions, comparison functions, activation functions and other common basic math library functions. In terms of precision, the entire data of the function definition domain interval was used as input, and the operation result was compared with the result of the scalar function running on the CPU i7. The results showed that the maximum ULP value was 2, and the maximum ULP value of the half-precision version was 1. Compared with the use of scalar loop, the performance was greatly improved. Compared with the scalar loop, the single-precision version had an average speed-up ratio of 18.26 and a maximum speed-up ratio of 35.90. The half-precision version had an average speed-up ratio of 15.65 and a maximum speed-up ratio of 30.11.
【Key words】 vectorized function; PuDianNao-VecMath; domestic artificial intelligence processor; expose branch scope and branch flattening;
- 【文献出处】 郑州大学学报(工学版) ,Journal of Zhengzhou University(Engineering Science) , 编辑部邮箱 ,2023年01期
- 【分类号】TP311.13;TN40
- 【网络出版时间】2022-08-15 09:15:00
- 【下载频次】51