本论文推荐来源于 AI研习社 CVPR小组

谷歌研究院提出间接卷积算法:一种更省内存的卷积运算实现方案
推荐理由
这是一篇谷歌研究院于CVPR 2019的Efficient Deep Learning for Computer Vision Workshop展示的工作,提出了一种更省内存的卷积运算实现方法。(也不知为什么7月3号才提交arXiv,挠头)
矩阵乘法是卷积神经网络中的基本运算,目前主流的卷积运算是基于BLAS库的GEMM函数(矩阵-矩阵乘法)实现的。在卷积核较大时,为了将数据重新排列以适应GEMM的接口,需要一种内存布局转换(memory layout transformation)方法——im2col或im2row,并消耗额外的内存缓冲。本文则提出一种改进的GEMM算法,无需im2col转换,而引入了一种占用更少内存的缓冲。与原GEMM算法相比最多可达到62%的性能提升,在1x1,stride=1的卷积上有所下降。
 摘要
Deep learning frameworks commonly implement convolution operators with GEMM-based algorithms. In these algorithms, convolution is implemented on top of matrix-matrix multiplication (GEMM) functions, provided by highly optimized BLAS libraries. Convolutions with 1x1 kernels can be directly represented as a GEMM call, but convolutions with larger kernels require a special memory layout transformation - im2col or im2row - to fit into GEMM interface. The Indirect Convolution algorithm provides the efficiency of the GEMM primitive without the overhead of im2col transformation. In contrast to GEMM-based algorithms, the Indirect Convolution does not reshuffle the data to fit into the GEMM primitive but introduces an indirection buffer - a buffer of pointers to the start of each row of image pixels. This broadens the application of our modified GEMM function to convolutions with arbitrary kernel size, padding, stride, and dilation. The Indirect Convolution algorithm reduces memory overhead proportionally to the number of input channels and outperforms the GEMM-based algorithm by up to 62% on convolution parameters which involve im2col transformations in GEMM-based algorithms. This, however, comes at cost of minor performance reduction on 1x1 stride-1 convolutions.
 论文链接
https://arxiv.org/pdf/1907.02129.pdf
你可能还想看
点击
阅读原文
,可在CVPR 顶会交流小组查看更多精彩内容
继续阅读
阅读原文