卷积神经网络性能优化
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">作者丨黎明灰烬</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"><span style="color: black;">源自</span>|https://zhuanlan.zhihu.com/p/80361782</span></span></p>
<h1 style="color: black; text-align: left; margin-bottom: 10px;">引言</h1>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">卷积(Convolution)是神经网络的核心计算之一,它在计算机视觉方面的突破性<span style="color: black;">发展</span>引领了深度学习的热潮。卷积的变种丰富,计算<span style="color: black;">繁杂</span>,神经网络运行时大部分时间都耗费在计算卷积,网络模型的发展在<span style="color: black;">持续</span><span style="color: black;">增多</span>网络的深度,<span style="color: black;">因此呢</span>优化卷积计算就<span style="color: black;">显出</span>尤为重要。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">随着技术的发展,<span style="color: black;">科研</span>人员提出了多种优化算法,<span style="color: black;">包含</span> Im2col、Winograd 等等。本文<span style="color: black;">首要</span>定义卷积神经网络的概念,继而简要介绍几种<span style="color: black;">平常</span>的优化<span style="color: black;">办法</span>,并讨论作者在该<span style="color: black;">行业</span>的<span style="color: black;">有些</span>经验。</span></p><span style="color: black;">大部分时间都耗费在计算卷积链接:</span><span style="color: black;"><span style="color: black;">https://arxiv.org/abs/1807.11164</span></span><span style="color: black;">Im2col 链接:</span><span style="color: black;"><span style="color: black;">https://www.mathworks.com/help/images/ref/im2col.html</span></span><span style="color: black;">Winograd 链接:</span><span style="color: black;"><span style="color: black;">https://www.intel.ai/winograd/#gs.avmb0n</span></span>
<h1 style="color: black; text-align: left; margin-bottom: 10px;">卷积神经网络的概念</h1>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">卷积神经网络(Convolution Neural Networks, CNN)的概念拓展自信号处理<span style="color: black;">行业</span>的卷积。信号处理的卷积定义为</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/30eac7f5771b49c0b1411010e5a34be6~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725647926&x-signature=75Wqg2BUgyAzDhz8w7p9qMoWeB4%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">(1)</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">因为</span>对<span style="color: black;">叫作</span>性</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/30e0e8b999a242b5be19b93526ea8540~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725647926&x-signature=JuPUiPJH3AJ7Q8bagYXJBRKRTwM%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">卷积计算在直觉上<span style="color: black;">很难</span>理解,其可视化后如图一所示。图中红色滑块在移动过程中与蓝色方块的积绘制成的三角图案即为卷积结果 (∗)() 在各点上的取值。</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/2a4e99b4328a4534b3a324f954a2cb79~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725647926&x-signature=HZyaP45DmhAiMFA%2B2ko0DUoycAo%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">图一:信号处理中的卷积</span></span></p><span style="color: black;">信号处理中的卷积链接:</span><span style="color: black;"><span style="color: black;">https://jackwish.net/convolution-neural-networks-optimization.html</span></span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">公式1的离散形式为 :</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/58e7d0e40afb42f482c8578f5d07700b~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725647926&x-signature=rsF9iDteu%2BtsqfrcQlbvs%2FcftPk%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">(2)</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">将该卷积拓展到二维空间<span style="color: black;">就可</span>得到神经网络中的卷积,可简写为:</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/485d2d4aa93e4be6b9708d1c82114f69~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725647926&x-signature=DQ3BsMuLkzzBd37ldgaAkXe0c%2B0%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">(3)</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">其中 为卷积输出, 为卷积输入, 为卷积核。该计算的动态可视化<span style="color: black;">能够</span>参考 conv_arithmetic,<span style="color: black;">这儿</span><span style="color: black;">再也不</span>介绍。</span></p><span style="color: black;">conv_arithmetic 链接:</span><span style="color: black;"><span style="color: black;">https://github.com/vdumoulin/conv_arithmetic</span></span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">当应用到计算机视觉中处理<span style="color: black;">照片</span>时,<span style="color: black;">照片</span>的通道(Channel)<span style="color: black;">能够</span>对二维卷积简单堆叠,即:</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/3639ae52556b43759f4f72159ccf7446~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725647926&x-signature=jHhJ%2Fn3ApqVI0H7bGHHAmEqZkw0%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">(4)</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">其中 c 是输入的通道。这便是在三维张量中应用二维卷积的计算。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">非常多</span>时候,公式描述<span style="color: black;">显出</span>不是很直观,图二是堆叠的二维卷积的可视化。其中,与输入、输出、卷积核<span style="color: black;">关联</span>的标记带有前缀 I、O、K。<span style="color: black;">另外</span>,本文图例对输出、输入、卷积核三者的着色<span style="color: black;">通常</span>为:橙色、黄色、绿色。</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/bca7b623dd0a4da6a0a401da06cb4067~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725647926&x-signature=HIYpWuWGbXAxOUloeYCfPVpKp7k%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">图二:卷积计算定义</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">其中</span>张量的内存布局为 NHWC 时,卷积计算相应的伪代码如下。其中外三层循环遍历输出 C 的<span style="color: black;">每一个</span>数据点,<span style="color: black;">针对</span><span style="color: black;">每一个</span>输出数据都需要经由内三层循环累加求和得到(点积)。</span></p><span style="color: black;">for</span> <span style="color: black;">(int</span> <span style="color: black;">oh</span> <span style="color: black;">=</span> <span style="color: black;">0</span><span style="color: black;">;</span> <span style="color: black;">oh</span> <span style="color: black;"><</span> <span style="color: black;">OH;</span> <span style="color: black;">oh++)</span> <span style="color: black;">{</span>
<span style="color: black;">for</span> <span style="color: black;">(int</span> <span style="color: black;">ow</span> <span style="color: black;">=</span> <span style="color: black;">0</span><span style="color: black;">;</span> <span style="color: black;">ow</span> <span style="color: black;"><</span> <span style="color: black;">OW;</span> <span style="color: black;">ow++)</span> <span style="color: black;">{</span>
<span style="color: black;">for</span> <span style="color: black;">(int</span> <span style="color: black;">oc</span> <span style="color: black;">=</span> <span style="color: black;">0</span><span style="color: black;">;</span> <span style="color: black;">oc</span> <span style="color: black;"><</span> <span style="color: black;">OC;</span> <span style="color: black;">oc++)</span> <span style="color: black;">{</span>
<span style="color: black;">C</span> <span style="color: black;">=</span> <span style="color: black;">0</span><span style="color: black;">;</span>
<span style="color: black;">for</span> <span style="color: black;">(int</span> <span style="color: black;">kh</span> <span style="color: black;">=</span> <span style="color: black;">0</span><span style="color: black;">;</span> <span style="color: black;">kh</span> <span style="color: black;"><</span> <span style="color: black;">KH,</span> <span style="color: black;">kh++){</span>
<span style="color: black;">for</span> <span style="color: black;">(int</span> <span style="color: black;">kw</span> <span style="color: black;">=</span> <span style="color: black;">0</span><span style="color: black;">;</span> <span style="color: black;">kw</span> <span style="color: black;"><</span> <span style="color: black;">KW,</span> <span style="color: black;">kw++){</span>
<span style="color: black;">for</span> <span style="color: black;">(int</span> <span style="color: black;">ic</span> <span style="color: black;">=</span> <span style="color: black;">0</span><span style="color: black;">;</span> <span style="color: black;">ic</span> <span style="color: black;"><</span> <span style="color: black;">IC,</span> <span style="color: black;">ic++){</span>
<span style="color: black;">C</span> <span style="color: black;">+=</span> <span style="color: black;">A</span> <span style="color: black;">*</span> <span style="color: black;">B;</span>
<span style="color: black;">}</span>
<span style="color: black;">}</span>
<span style="color: black;">}</span>
<span style="color: black;">}</span>
<span style="color: black;">}</span>
<span style="color: black;">}</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">和矩阵乘的优化<span style="color: black;">办法</span></strong>类似,<span style="color: black;">咱们</span><span style="color: black;">亦</span>可针对该计算进行向量化、并行化、循环展开的基本的优化操作。卷积的问题在于其 和 <span style="color: black;">通常</span>不超过 5 ,这<span style="color: black;">不易</span>向量化,而计算特征又有输入在空间维度存在数据复用。该计算的<span style="color: black;">繁杂</span>性<span style="color: black;">引起</span>产生了几种优化<span style="color: black;">办法</span>,下面<span style="color: black;">咱们</span>介绍几种。</span></p><span style="color: black;">和矩阵乘的优化<span style="color: black;">办法</span>链接:</span><span style="color: black;"><span style="color: black;">https://jackwish.net/gemm-optimization.html</span></span>
<h1 style="color: black; text-align: left; margin-bottom: 10px;">Im2col 优化算法</h1>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">做为</span><span style="color: black;">初期</span>的深度学习框架,Caffe 中卷积的实现采用的是基于 im2col 的<span style="color: black;">办法</span>,<span style="color: black;">迄今</span>仍是卷积重要的优化<span style="color: black;">办法</span>之一。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Im2col 是计算机视觉<span style="color: black;">行业</span>中将<span style="color: black;">照片</span>转换成矩阵的矩阵列(column)的计算过程。从上一节的介绍中<span style="color: black;">能够</span>看到,二维卷积的计算比较<span style="color: black;">繁杂</span><span style="color: black;">很难</span>优化,<span style="color: black;">因此呢</span>在深度学习框架发展的<span style="color: black;">初期</span>,Caffe <span style="color: black;">运用</span> Im2col <span style="color: black;">办法</span>将三维张量转换为二维矩阵,从而充分利用<span style="color: black;">已然</span>优化好的 GEMM 库来为各个平台加速卷积计算。最后,再将矩阵乘得到的二维矩阵结果<span style="color: black;">运用</span> Col2im 将转换为三维矩阵输出。</span></p><span style="color: black;">Caffe 链接:</span><span style="color: black;"><span style="color: black;">http://caffe.berkeleyvision.org/</span></span><span style="color: black;">Caffe <span style="color: black;">运用</span> Im2col <span style="color: black;">办法</span>将三维张量转换为二维矩阵链接:</span><span style="color: black;"><span style="color: black;">https://github.com/Yangqing/caffe/wiki/Convolution-in-Caffe:-a-memo</span></span>
<h1 style="color: black; text-align: left; margin-bottom: 10px;">算法过程</h1>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">除非<span style="color: black;">尤其</span>说明,本文默认采用的内存布局形式为 NHWC 。其他的内存布局和<span style="color: black;">详细</span>的转换后的矩阵形状或许略有差异,但不影响算法本身的描述。</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/b544b5c19e704ad58150d413569f4ae6~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725647926&x-signature=2k9ama8%2F1liR1YGIuSip6ilRsr4%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">图三:Im2col 算法计算卷积的过程</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">图三是<span style="color: black;">运用</span> Im2col 算法计算卷积的过程示例,<span style="color: black;">详细</span>的过程<span style="color: black;">包含</span>(简单起见忽略 Padding 的<span style="color: black;">状况</span>,即认为 =,=:</span></p><span style="color: black;">将输入由 ×× <span style="color: black;">按照</span>卷积的计算特性展开成 (×)×(××) 形状的二维矩阵。显然,转换后<span style="color: black;">运用</span>的内存空间相比原始输入多约 ∗−1 倍。</span><span style="color: black;">权重的形状<span style="color: black;">通常</span>为 ××× 四维张量,<span style="color: black;">能够</span>将其直接<span style="color: black;">做为</span>形状为 ()×(××)) 的二维矩阵处理。</span><span style="color: black;"><span style="color: black;">针对</span>准备好的两个二维矩阵,将 (××) <span style="color: black;">做为</span>累加求和的维度,运行矩阵乘<span style="color: black;">能够</span>得到输出矩阵 (×)×()。这一过程<span style="color: black;">能够</span>直接<span style="color: black;">运用</span><span style="color: black;">各样</span>优化过的 BLAS(Basic Linear Algebra Subprograms)库来完成。</span><span style="color: black;">输出矩阵 (×)×() 在内存布局视角即为预期的输出张量 ×× 。</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Im2col 计算卷积<span style="color: black;">运用</span> GEMM 的代价是额外的内存开销。这是<span style="color: black;">由于</span>原始卷积计算中,卷积核在输入上滑动以计算输出时,相邻的输出计算在空间上复用了<span style="color: black;">必定</span>的输入输出。而用 Im2col 将三维张量展开成二维矩阵时,这些<span style="color: black;">本来</span><span style="color: black;">能够</span>复用的数据平坦地分布到矩阵中,将输入数据复制了 ∗−1 份。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">当卷积核尺寸 × 是 1×1 时,<span style="color: black;">以上</span><span style="color: black;">过程</span>中的 和 <span style="color: black;">能够</span>消去,即输入转换后形状为 (×)×(),卷积核形状为 ()×(),卷积计算退化为矩阵乘。<span style="color: black;">重视</span>观察,<span style="color: black;">针对</span>这种<span style="color: black;">状况</span>,Im2col 过程<span style="color: black;">实质</span>上并<span style="color: black;">无</span>改变输入的形状,<span style="color: black;">因此呢</span>矩阵乘操作可以直接在原始输入上运行,从而省去内存组织的过程(即<span style="color: black;">以上</span><span style="color: black;">过程</span>中的 1、2、4 步)。</span></p>
<h1 style="color: black; text-align: left; margin-bottom: 10px;">内存布局与卷积性能</h1>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">神经网络中卷积的内存布局<span style="color: black;">重点</span>有 NCHW 和 NHWC 两种,<span style="color: black;">区别</span>的内存布局会影响计算运行时<span style="color: black;">拜访</span>存储器的模式,<strong style="color: blue;"><span style="color: black;">尤其</span>是在运行矩阵乘时</strong>。本小节分析采用 Im2col 优化算法时计算性能性能和内存布局的关系。</span></p><span style="color: black;"><span style="color: black;">尤其</span>是在运行矩阵乘时链接:</span><span style="color: black;"><span style="color: black;">https://jackwish.net/gemm-optimization.html</span></span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">在完成 Im2col 转换后,得到用于运行矩阵乘的输入矩阵和卷积核矩阵。对计算过程施加矩阵计算中常用的数据划分、向量化等优化<span style="color: black;">办法</span>(<span style="color: black;">关联</span>定义请参考<strong style="color: blue;">通用矩阵乘(GEMM)优化算法</strong>)。下面着重分析在这种场景下,<span style="color: black;">区别</span>内存布局对性能的影响。</span></p><span style="color: black;">通用矩阵乘(GEMM)优化算法链接:</span><span style="color: black;"><span style="color: black;">https://jackwish.net/gemm-optimization.html</span></span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">首要</span><span style="color: black;">思虑</span> NCHW 内存布局,将 NCHW 内存布局的卷积对应到矩阵乘 = 时, 是卷积核(filter), 是输入(input),各个矩阵的维度如图四所示。图中的 ,, 用于标记矩阵乘,即</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/6b096595ef354e60b1643c3f08806774~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725647926&x-signature=ooQDLJ%2FBQs6yy9JzoZjTyLM%2FNPo%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">,<span style="color: black;">同期</span>标记出它们和卷积计算中各个维度的关系。</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/32a4a08deea44e0fac15f830d304a8a9~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725647926&x-signature=%2BwHNELRPC21chthWde0iNacTMXM%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">图四:NCHW 内存布局卷积转换成的矩阵乘</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">对该矩阵施行划分后,<span style="color: black;">咱们</span><span style="color: black;">仔细</span>分析局部性的表现,并标记在图四中。其中 Inside <span style="color: black;">暗示</span> 4×4 小块矩阵乘内部的局部性,Outside <span style="color: black;">暗示</span>在削减维度方向小块之间的局部性。</span></p><span style="color: black;">对输出而言,小块内访存局部性较差,这是<span style="color: black;">由于</span>每次向量化加载会加载四个元素,每次加载都会<span style="color: black;">出现</span>缓存缺失(Cache miss)。<span style="color: black;">外边</span>表现取决于全局计算方向——行优先则局部性较好,列优先则较差。输出的<span style="color: black;">行径</span>不是<span style="color: black;">这儿</span>的讨论终点,毕竟它<span style="color: black;">无</span>访存复用。</span><span style="color: black;">对卷积核而言,小块内访存局部性较差,这和输出类似。当小块加载<span style="color: black;">出现</span>缓存缺失时,处理器会一次性加载一个缓存行(Cache line),这使得后续的若干个小块访存都能缓存命中(Cache hit),直到该缓存行<span style="color: black;">所有</span>命中后进入下一个缓存行的范围。<span style="color: black;">因此呢</span>小块<span style="color: black;">外边</span>局部性较好。</span><span style="color: black;">对输入而言,小块内访存局部性表现<span style="color: black;">一样</span><span style="color: black;">欠好</span>。然而<span style="color: black;">区别</span>于卷积核,小块中因缓存缺失加载到缓存中的缓存行数据只会被<span style="color: black;">运用</span>一次,<span style="color: black;">由于</span>这种内存布局中下一个小块的<span style="color: black;">位置</span>范围<span style="color: black;">通常</span>超出了一个缓存行。<span style="color: black;">因此呢</span>输入的几乎每次内存加载都会<span style="color: black;">出现</span>高速缓存缺失——Cache <span style="color: black;">无</span>起到加速的<span style="color: black;">功效</span>,每次访存都需要到内存取数据。</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">因此呢</span>,用 Im2col 处理卷积计算时,NCHW 布局对内存很不友好。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">图五是与之相对的 NHWC 内存布局的示例。值得<span style="color: black;">重视</span>的是,NHWC 和 NCHW 中 、 矩阵所<span style="color: black;">表率</span>的张量<span style="color: black;">出现</span>了调换——=×(调换一下只是不想多画一张图)。<span style="color: black;">详细</span>的拆分方式仍然<span style="color: black;">同样</span>,<span style="color: black;">亦</span>正是上一小节中描述的<span style="color: black;">过程</span>所构建的矩阵。</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/b9d1222c5ca8407eb089d1d788fa7255~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725647926&x-signature=Gv4n%2F01lL1yVwND0h3abnnf%2FaLQ%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">图五:NHWC 内存布局卷积转换成的矩阵乘</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">类似地,分析三个张量的访存表现可知:</span></p><span style="color: black;">对输出而言,NHWC 和 NCHW 表现<span style="color: black;">同样</span>。</span><span style="color: black;">对输入而言,小方块的内部局部性表现不是很好,<span style="color: black;">由于</span>几次向量加载都会<span style="color: black;">出现</span>缓存不命中;而<span style="color: black;">外边</span>局部性表现则较好,<span style="color: black;">由于</span>在削减维度滑动<span style="color: black;">运用</span>的内存是连续的。这种表现和 NCHW 中卷积核的表现<span style="color: black;">同样</span>,整体来看都是对高速缓存比较友好的内存布局。</span><span style="color: black;">对卷积核而言,NHWC 的<span style="color: black;">状况</span>和 NCHW 中输入的<span style="color: black;">状况</span>类似,小块内和小块外的局部性都较差。</span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">两种内存布局中的卷积核缓存表现并不是问题,<span style="color: black;">由于</span>卷积核在运行<span style="color: black;">时期</span>保持不变,<span style="color: black;">能够</span>在模型加载<span style="color: black;">周期</span>转换卷积核的内存布局,使其在小块外的内存对缓存友好(例如将 (××)×() 的布局转换为 ()×(×× )。<span style="color: black;">这儿</span>值得说明的是<span style="color: black;">通常</span>框架或引擎的运行都<span style="color: black;">最少</span>可分为两个<span style="color: black;">周期</span>:准备<span style="color: black;">周期</span>和运行<span style="color: black;">周期</span>。<span style="color: black;">有些</span>模型的预处理工作<span style="color: black;">能够</span>放在准备<span style="color: black;">周期</span>完成,例如重新排布卷积核的内存布局这种在运行<span style="color: black;">周期</span>保持不变的数据。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">因此呢</span>,当<span style="color: black;">运用</span> Im2col <span style="color: black;">办法</span>计算时,整体的访存表现取决于输入的<span style="color: black;">状况</span>,即 NHWC 的内存布局要比 NCHW 内存布局更加友好。<span style="color: black;">咱们</span>在实践过程中的一个实验<span style="color: black;">显示</span>,<span style="color: black;">针对</span>一个 1×1 卷积核的卷积,当采用类似的优化<span style="color: black;">办法</span>时,从 NCHW 转换为 NHWC <span style="color: black;">能够</span>将高速缓存缺失率从约 50% 降低到 2% <span style="color: black;">上下</span>。这种程度的<span style="color: black;">加强</span><span style="color: black;">能够</span>大幅改进软件的运行性能(<span style="color: black;">这儿</span>特指不<span style="color: black;">运用</span><span style="color: black;">尤其</span>设计过的矩阵乘优化<span style="color: black;">办法</span>)。</span></p>
<h1 style="color: black; text-align: left; margin-bottom: 10px;">空间组合优化算法</h1>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Im2col 是一种比较朴素的卷积优化算法,在<span style="color: black;">无</span>精心处理的<span style="color: black;">状况</span>下会带来<span style="color: black;">很强</span>的内存开销。空间组合(Spatial pack)是一种类似<strong style="color: blue;">矩阵乘中重组内存</strong>的优化算法。</span></p><span style="color: black;">矩阵乘中重组内存链接:</span><span style="color: black;"><span style="color: black;">https://github.com/flame/how-to-optimize-gemm/wiki#packing-into-contiguous-memory</span></span>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/99ca05a17e424d2e94b9c30fb80e502b~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725647926&x-signature=UzJj53zMBCl%2BiX2atIRg%2BJfoMbQ%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">图六:空间组合优化算法对计算的划分</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">空间组合优化算法是一种基于分治法(Divide and Conquer)的<span style="color: black;">办法</span>——它基于空间特性将卷积计算划分为若干份,分别处理。图六所示是在空间上将输出、输入划分为四份。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">划分后,一个大的卷积计算被拆分为若干个小的卷积计算。虽然在划分的过程中计算总量不变,但计算小矩阵时访存局部性更好,<span style="color: black;">能够</span>借由计算机存储层次结构<span style="color: black;">得到</span>性能<span style="color: black;">提高</span>。这<span style="color: black;">经过</span>图七中的<span style="color: black;">过程</span>来完成。该<span style="color: black;">过程</span>和上节中 Im2col 重组内存的过程类似:</span></p><span style="color: black;">在 H 和 W 维度划分,将形状为 ××× 的输入张量拆分为 ℎ∗ 个(两个方向分别拆分 ℎ 和 次)形状为 ×/ℎ×/× 的张量,分别将这些小的张量组织为连续内存;</span><span style="color: black;">将得到的 ℎ∗ 个输入张量分别和卷积核做二维卷积操作,<span style="color: black;">就可</span>得到 ℎ∗ 个形状为 ×/ℎ×/× 的输出张量;</span><span style="color: black;">将这些输出张量重组内存布局得到<span style="color: black;">最后</span>形状为 ××× 的输出。</span>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/236379ee54b74995970a610d30fc19c5~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725647926&x-signature=fkjnij1uTWAzkQy%2FWc%2Faf2ZZXbc%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">图七:空间组合计算的<span style="color: black;">过程</span></span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">值得<span style="color: black;">重视</span>的是,方便起见,上文的描述中忽略了 Padding 的问题。<span style="color: black;">实质</span>在<span style="color: black;">过程</span> 1 中将输入张量划分为若干个小张量时,除了将划分的小块中原始数据拷贝外,还需要将相邻的小张量的边界数据拷贝。<span style="color: black;">详细</span>而言,如图八所示,空间拆分得到的小张量的形状<span style="color: black;">实质</span>上是:</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">×(/ℎ+2(−1))×(/+(−1))×.(5)</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">这儿</span>的 2(−1) 和 2(−1) 遵循 Padding 规则。规则为</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/5e9cb3284ccc413da3d07931ee30a685~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725647926&x-signature=zNZwJ6TmtyLOhvngix%2BbUW70xo0%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">时,它们<span style="color: black;">能够</span>忽略;规则为 SAME 时,<span style="color: black;">位置于</span>源张量边界的一边 Padding 补</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/bff514e5cea247cdb176683f7d8f7898~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725647926&x-signature=bCzXXOLl5EMZ869bwGh%2FRwZFkOM%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">,不在源张量边界的 Padding 则<span style="color: black;">运用</span>邻居张量的值。只要<span style="color: black;">思虑</span>一下卷积的计算原理,这是显而易见的。</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/25aadcf3316f4f4ab457084325bfe178~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725647926&x-signature=mBDS0tYB6KROfeJhW0SM8wbA8Qs%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">图八:空间组合算法的划分细节</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">上面的三个示例图都是拆分为 4 份的<span style="color: black;">状况</span>,<span style="color: black;">实质</span>应用中<span style="color: black;">能够</span>拆为<span style="color: black;">非常多</span>份。例如<span style="color: black;">能够</span>拆成小张量边长为 4 <span style="color: black;">或</span> 8 ,从而方便编译器向量化计算操作。随着拆分出的张量越小,其局部性<span style="color: black;">亦</span>越高,<span style="color: black;">消极</span><span style="color: black;">功效</span>是消耗的额外内存<span style="color: black;">亦</span>越多。这些额外内存是<span style="color: black;">因为</span> Padding 引入的。当拆分为 ℎ∗h∗w份时,拆分后 Padding 消耗的内存为:</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/918021d5a1f64db89800130187f0c5cc~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725647926&x-signature=oFCW3%2BqDmxG0eiQ4Pc%2FC2JdIRyg%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">能够</span>看到,随着拆分的粒度越小,额外消耗的内存越大。值得<span style="color: black;">重视</span>的是,当拆分到最细粒度时,即将在形状为 ××× 的输出的空间上拆分∗ 份时,空间组合退化为 Im2col <span style="color: black;">办法</span>。此时在一个元素上的卷积即为矩阵乘计算一个输出元素时的点积。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">只做空间划分时,划分与卷积核无关。而<span style="color: black;">倘若</span>在输出的通道维度划分,卷积核<span style="color: black;">亦</span>可做相应的拆分。通道维度的划分相当于固定空间划分后简单的堆叠,不会对影响内存消耗,但会影响局部性。<span style="color: black;">针对</span><span style="color: black;">区别</span>规模的卷积,寻找合适的划分<span style="color: black;">办法</span>不是一件容易的事情。正如计算机<span style="color: black;">行业</span>的许多问题<span style="color: black;">同样</span>,该问题<span style="color: black;">亦</span>是<span style="color: black;">能够</span>自动化的,例如<strong style="color: blue;">AutoTVM </strong><span style="color: black;">能够</span>在这种<span style="color: black;">状况</span>下寻找较优的划分<span style="color: black;">办法</span>。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">AutoTVM 链接:</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">https://arxiv.org/abs/1805.08166</span></span></p>
<h1 style="color: black; text-align: left; margin-bottom: 10px;">Winograd 优化算法</h1>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">前两节介绍的两种算法,Im2col 在将三维张量组织成矩阵后调用 GEMM 计算库,这些计算库很大程度上<span style="color: black;">运用</span><span style="color: black;">有些</span>基于访存局部性的优化;空间组合优化则本身<span style="color: black;">便是</span>利用局部性优化的<span style="color: black;">办法</span>。本小节介绍的 Winograd 优化算法则是矩阵乘优化<span style="color: black;">办法</span>中 Coppersmith–Winograd 算法的一种应用,是基于算法分析的<span style="color: black;">办法</span>。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">这部分公式<span style="color: black;">太多</span>,排版<span style="color: black;">不方便</span>,有兴趣的话<span style="color: black;">能够</span>参考原文或其他<span style="color: black;">关联</span>文献。</span></p>
<h1 style="color: black; text-align: left; margin-bottom: 10px;">参考原文链接:</h1>
<h1 style="color: black; text-align: left; margin-bottom: 10px;">https://jackwish.net/convolution-neural-networks-optimization.html</h1>
<h1 style="color: black; text-align: left; margin-bottom: 10px;">间接卷积优化算法</h1>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Marat Dukhan 在 QNNPACK(Quantized Neural Network PACKage)中推出了间接卷积算法(The Indirect Convolution Algorithm),似乎到<span style="color: black;">日前</span>为止(2019 年中)依然是所有已公开<span style="color: black;">办法</span>中最快的。<span style="color: black;">近期</span>作者<span style="color: black;">发布</span>了<span style="color: black;">关联</span>的<span style="color: black;">文案</span>来介绍其中的<span style="color: black;">重点</span>思想。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">虽然该算法在 QNNPACK 中的实现<span style="color: black;">重点</span>用于量化神经网络(业界的其他量化优化<span style="color: black;">方法</span>都比较传统 TensorFlow Lite <span style="color: black;">运用</span> Im2col 优化算法、腾讯出品的 NCNN<span style="color: black;">运用</span> Winograd 优化算法;OpenAI 出品的 Tengine <span style="color: black;">运用</span> Im2col 优化算法),但其是一种<span style="color: black;">一样</span>的优化算法设计。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">本文写作时设计<span style="color: black;">文案</span>尚未公开,而理解该算法设计<span style="color: black;">非常多</span>细节内容,最好能结合代码理解。我的 QNNPACK fork <span style="color: black;">包括</span>一个 explained 分支,其中对<span style="color: black;">增多</span>了对部分代码的注释,可作参考。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">QNNPACK 链接:</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">https://github.com/pytorch/QNNPACK</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">TensorFlow Lite <span style="color: black;">运用</span> Im2col 优化算法链接:</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">https://github.com/tensorflow/tensorflow/blob/v2.0.0-beta1/tensorflow/lite/kernels/internal/optimized/integer_ops/conv.h</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">NCNN<span style="color: black;">运用</span> Winograd 优化算法链接:</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">https://github.com/Tencent/ncnn/blob/20190611/src/layer/arm/convolution_3x3_int8.h</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">Tengine <span style="color: black;">运用</span> Im2col 优化算法链接:</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">https://github.com/OAID/Tengine/blob/v1.3.2/executor/operator/arm64/conv/conv_2d_fast.cpp</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">我的 QNNPACK fork 链接:</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">https://github.com/jackwish/qnnpack</span></span></p>
<h1 style="color: black; text-align: left; margin-bottom: 10px;">计算工作流</h1>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">间接卷积算法的有效工作<span style="color: black;">败兴</span>一个关键的前提——网络连续运行时,输入张量的内存<span style="color: black;">位置</span>保持不变。这一特性其实比较容易满足,即使<span style="color: black;">位置</span>真的需要变化,<span style="color: black;">亦</span><span style="color: black;">能够</span>将其拷贝到固定的内存区域中。</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/b438d7db4b24457fbfccab90f8a867db~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725647926&x-signature=sQZJt10XXVUV5qAtsA4lXs0l4sQ%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">图九:间接卷积算法工作流</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">图九是间接卷积算法工作流的<span style="color: black;">仔细</span>过程。最左侧部分<span style="color: black;">暗示</span>多个输入<span style="color: black;">运用</span>相同的输入缓冲区(Input Buffer)。间接卷积算法会在该输入缓冲区<span style="color: black;">基本</span>上构建如图十的「间接缓冲区」(Indirect Buffer),而间接缓冲区是间接卷积算法的核心。如图中右侧,在网络运行时,每次计算出 × 规模的输出,其中 为将 × 视作一维后的向量化规模。<span style="color: black;">通常</span> × 为 4×4、8×8 或 4×8 。在计算 × 规模<span style="color: black;">体积</span>的输出时,经由间接缓冲区取出对应<span style="color: black;">运用</span>的输入,并取出权重,计算出结果。计算过程等价于计算 × 和 × 矩阵乘。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">在实现中,软件的执行过程分为两部分:</span></p><span style="color: black;">准备<span style="color: black;">周期</span>:加载模型,配置输入缓冲区;重排权重,使其内存布局适用于后续计算;</span><span style="color: black;">运行<span style="color: black;">周期</span>:<span style="color: black;">针对</span><span style="color: black;">每一个</span>输入,运行 ⌈∗/⌉∗⌈/⌉次核心循环,每次<span style="color: black;">运用</span> GEMM <span style="color: black;">办法</span>计算出 × <span style="color: black;">体积</span>的输出。</span>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/10f0200e3dad4145b1207a97e2a49b23~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725647926&x-signature=QMX3Oh84WT6BuFdym6297e9UQRQ%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">图十:间接缓冲区</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">如<span style="color: black;">关联</span>章节的讨论,Im2col 优化算法存在两个问题,<span style="color: black;">第1</span>是占用<span style="color: black;">海量</span>的额外内存,第二是需要对输入进行额外的数据拷贝。这两点<span style="color: black;">怎样</span><span style="color: black;">才可</span><span style="color: black;">处理</span>呢?间接卷积算法给出的答案是间接缓冲区(Indirect Buffer),如图十右半所示。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">图十是常规的 Im2col 优化算法和间接卷积优化算法的对比。正如<span style="color: black;">关联</span>小节介绍的那样,Im2col 优化算法<span style="color: black;">首要</span>将输入拷贝到一个矩阵中,如图十中 Input 的<span style="color: black;">关联</span>箭头,<span style="color: black;">而后</span>执行矩阵乘操作。间接卷积优化算法<span style="color: black;">运用</span>的间接缓冲区中存储的其实<span style="color: black;">指的是</span>向输入的指针(这<span style="color: black;">亦</span>是间接卷积优化算法<span style="color: black;">需求</span>输入内存<span style="color: black;">位置</span>固定的<span style="color: black;">原由</span>),在运行时<span style="color: black;">按照</span>这些指针<span style="color: black;">就可</span>模拟出类似于 Im2col 的矩阵计算过程。</span></p>
<h1 style="color: black; text-align: left; margin-bottom: 10px;">间接缓冲区布局</h1>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">间接缓冲区<span style="color: black;">能够</span>理解为是一组卷积核<span style="color: black;">体积</span>的缓冲区,共有 × 个,<span style="color: black;">每一个</span>缓冲区<span style="color: black;">体积</span>为 ×——<span style="color: black;">每一个</span>缓冲区对应着某个输出要<span style="color: black;">运用</span>的输入的<span style="color: black;">位置</span>。每计算一个空间位置的输出,<span style="color: black;">运用</span>一个间接缓冲区;空间位置相同而通道<span style="color: black;">区别</span>的输出<span style="color: black;">运用</span>相同的间接缓冲区,缓冲区中的<span style="color: black;">每一个</span>指针用于索引输入中 个元素。在计算时,随着输出的移动,<span style="color: black;">选择</span><span style="color: black;">区别</span>的间接缓冲区,<span style="color: black;">就可</span>得到相应的输入<span style="color: black;">位置</span>。无需再<span style="color: black;">按照</span>输出<span style="color: black;">目的</span>的坐标计算要<span style="color: black;">运用</span>的输入的<span style="color: black;">位置</span>,这等同于预先计算<span style="color: black;">位置</span>。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">图十一绘制了当 × 为 4 、 和 均为 3 时,间接缓冲区的<span style="color: black;">实质</span><span style="color: black;">运用</span><span style="color: black;">办法</span>与计算过程。图中命名为局部间接缓冲区意指<span style="color: black;">日前</span><span style="color: black;">思虑</span>的是计算 × 时核心计算的过程。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">当计算 × <span style="color: black;">体积</span>的输出时,<span style="color: black;">运用</span>的输入为卷积核在对应输入位置上滑动 步所覆盖的趋于,即规模 ()×(+2(−1))× 的输入。在间接卷积算法中,这些输入内存由 M个间接缓冲区中的指针索引,共有 ×× 个。图十一中标识出了输入空间位置左上角输入和相应的输入缓冲区的对应关系。<span style="color: black;">能够</span>看到,<span style="color: black;">这儿</span>的 A、B、C、D 四个输入缓冲区,相邻的两个缓冲区所指向的<span style="color: black;">位置</span>区域有 (−)/ ,<span style="color: black;">这儿</span>即为 2/3 ,各个缓冲区中指针的坐标<span style="color: black;">亦</span>已标明。</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/32d770f0b5834a5e8216d3666278ed39~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725647926&x-signature=TjEAaRuYobmn8ben9xYAoLWH20k%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">图十一:间接缓冲区详解</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">图中将平面缓冲区绘制成三维形式(增加 IC 维度),意在说明间接缓冲区的<span style="color: black;">每一个</span>指针可索引 IC 个输入元素,而<span style="color: black;">每一个</span>间接缓冲区索引的内容即为与权重对应的输入内存区域。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">进一步地,左上角的输入缓冲区<span style="color: black;">摆列</span>方式并不是<span style="color: black;">最后</span>的排布<span style="color: black;">办法</span>,<span style="color: black;">实质</span>上这些指针会被处理成图十一中部间接缓冲区的形式。将左上<span style="color: black;">每一个</span>缓冲区中的指针打散,<span style="color: black;">就可</span>得到 × 指针,将 A、B、C、D 四个缓冲区的<span style="color: black;">区别</span>空间位置的指针收集到<span style="color: black;">一块</span>,<span style="color: black;">就可</span>得到图中上部分的缓冲区<span style="color: black;">摆列</span>方式 ××。<span style="color: black;">能够</span>看到, A、B、C、D 四个缓冲区内部相同空间位置的指针被组织到了<span style="color: black;">一块</span>。图中中上部分是可视化的效果,中下部分则是间接缓冲区的真正组织方式。图中褐色和深黄色的着色对应着相同的输入内存或指针。值得<span style="color: black;">重视</span>的是,图例中 Stride 为 1,当 Stride 不为 1 时,重新组织后 A、B、C、D 相同空间的坐标(对应于在输入的坐标)不<span style="color: black;">必定</span>是连续的,相邻的空间位置横向坐标相差 <span style="color: black;">体积</span>。</span></p>
<h1 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;">运用</span>间接缓冲区计算</h1>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">咱们</span><span style="color: black;">已然</span><span style="color: black;">晓得</span>了间接缓冲区的组织形式,以及其指针对应于输入内存的<span style="color: black;">位置</span>趋于,<span style="color: black;">此刻</span>来<span style="color: black;">科研</span>在计算过程中<span style="color: black;">怎样</span><span style="color: black;">运用</span>这些缓冲区。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">和上一小节<span style="color: black;">同样</span>,本小节的讨论大体都在计算 × 规模的输出,而这些输出要<span style="color: black;">运用</span> 个 × <span style="color: black;">体积</span>的输入,其中有数据重用。<span style="color: black;">此刻</span>回顾一下 Im2col 的算法(如图十一中下左部分),当向量化运行计算时,<span style="color: black;">针对</span> × 的矩阵乘计算,每次取出 × 规模的输入和 × 规模的权重,对该块运行矩阵乘<span style="color: black;">就可</span>得到相应的部分和结果。其中 是向量化计算在 K维度上的步进因子。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">而卷积之<span style="color: black;">因此</span><span style="color: black;">能够</span><span style="color: black;">运用</span> Im2col 优化算法,本质<span style="color: black;">原由</span>在于其拆解后忽略内存复用后的计算过程等价于矩阵乘。而间接缓冲区使得<span style="color: black;">能够</span><span style="color: black;">经过</span>指针模拟出对输入的访存。在<span style="color: black;">实质</span>运行计算 × 输出的计算核(micro kernel)时,会有 个指针扫描输入。个指针每次从图十一中下部分的间接缓冲区结构中取出 个<span style="color: black;">位置</span>,即对应于输入 × 的内存,如图中右上角的布局。在<span style="color: black;">这儿</span>的步进中,仍是以 × 形式运行,其中 在 维度上运动。当这部分输入扫描完毕后,这 个指针从间接缓冲区中取出相应的指针,继续下一次对 × 输入内存的遍历,每次计算出 1/(∗) 的输出部分和。当这个过程运行 × 次后,即得到了 × 的输出。图十一右下角的伪代码对应了这一过程。<span style="color: black;">因为</span>间接缓冲区<span style="color: black;">已然</span>被组织<span style="color: black;">成为了</span>特定的形状,<span style="color: black;">因此呢</span>每次更新 个指针时,只需要从间接缓冲区指针(图中伪代码里的 p_indirect_buffer)中获取。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">这个过程的<span style="color: black;">规律</span><span style="color: black;">很难</span>理解,<span style="color: black;">这儿</span>再做一点<span style="color: black;">弥补</span>说明。当<span style="color: black;">以上</span>的 个指针<span style="color: black;">持续</span>运动扫描内存时,<span style="color: black;">实质</span>上是在扫描三维输入 Im2col 之后的矩阵。而输入缓冲区的特定是它将对二维矩阵的扫描转化为了对三维张量的扫描。对输入的扫描过程即是对图十一中上部分可视化的输入的扫描,联系左上和左下对应的输入关系,不难<span style="color: black;">发掘</span>它每次扫描输入中 × 块内存。值得<span style="color: black;">重视</span>的是,<span style="color: black;">这儿</span>的 × 由 个 1× 张量<span style="color: black;">构成</span>,它们之间 维度的间距为 。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">这般</span>一来,只要运行该计算核心 ⌈∗/⌉∗⌈/⌉ 次,<span style="color: black;">就可</span>得到<span style="color: black;">所有</span>输出。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">间接卷积优化算法<span style="color: black;">处理</span>了卷积计算的三个问题,<span style="color: black;">第1</span>是空间向量化问题,第二是<span style="color: black;">位置</span>计算<span style="color: black;">繁杂</span>问题,第三是内存拷贝问题。<span style="color: black;">通常</span>计算卷积时都需要对输入补零(<span style="color: black;">针对</span> × 不是 1×1 的<span style="color: black;">状况</span>),这个过程传统的<span style="color: black;">办法</span>都会<span style="color: black;">出现</span>内存拷贝。而间接卷积优化算法中的间接缓冲区<span style="color: black;">能够</span><span style="color: black;">经过</span>间接指针巧妙地<span style="color: black;">处理</span>这一问题。在构造间接缓冲区时额外创建<span style="color: black;">一起</span> 1× 的内存缓冲区,其中填入零值,<span style="color: black;">针对</span>空间中需要补零的位置,将相应的间接指针指向该缓冲区,<span style="color: black;">那样</span>后续计算时即相当于<span style="color: black;">已然</span>补零。例如图十一中 A 的左上角对应于输入空间位置 (0,0) 的,当需要补零时该位置一定要为零值,此时只要修改间接缓冲区的<span style="color: black;">位置</span><span style="color: black;">就可</span>。</span></p>
<h1 style="color: black; text-align: left; margin-bottom: 10px;">讨论、总结与展望</h1>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">至此,本文探讨了<span style="color: black;">有些</span><span style="color: black;">已然</span><span style="color: black;">发布</span>或开源的卷积神经网络的优化<span style="color: black;">办法</span>。这些优化<span style="color: black;">办法</span>或多或少地推动了深度学习技术在云端或移动端的应用,<span style="color: black;">帮忙</span>了<span style="color: black;">咱们</span>的<span style="color: black;">平常</span>生活。例如,<span style="color: black;">近期</span>上海人民乃至全中国人们头疼的垃圾<span style="color: black;">归类</span>问题,<span style="color: black;">亦</span><span style="color: black;">能够</span><strong style="color: blue;">利用深度学习<span style="color: black;">办法</span>来<span style="color: black;">帮忙</span>人们<span style="color: black;">认识</span><span style="color: black;">怎样</span>分类</strong>。</span></p><span style="color: black;">利用深度学习<span style="color: black;">办法</span>来<span style="color: black;">帮忙</span>人们<span style="color: black;">认识</span><span style="color: black;">怎样</span><span style="color: black;">归类</span>链接:</span><span style="color: black;"><span style="color: black;">https://news.mydrivers.com/1/633/633858.htm</span></span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">从本文的集中优化<span style="color: black;">办法</span>中<span style="color: black;">能够</span>看到,卷积神经网络的优化算法依然<span style="color: black;">能够</span><span style="color: black;">包括</span>在基于算法分析的<span style="color: black;">办法</span>和基于软件优化的<span style="color: black;">办法</span>。<span style="color: black;">实质</span>上,在现代计算机系统中,基于软件优化的<span style="color: black;">办法</span>,<span style="color: black;">尤其</span>是基于计算机存储层次结构的优化<span style="color: black;">显出</span>更为重要,<span style="color: black;">由于</span>其<span style="color: black;">办法</span><span style="color: black;">更易</span>挖掘和应用。另一方面<span style="color: black;">亦</span>是<span style="color: black;">由于</span>现代计算机新增的专用指令<span style="color: black;">已然</span><span style="color: black;">能够</span>抹除<span style="color: black;">区别</span>计算的耗时差异。在这种场景下,基于算法分析的<span style="color: black;">办法</span>要和基于软件优化的<span style="color: black;">办法</span>结合或许能取得更优的优化效果。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">最后,本文讨论的优化<span style="color: black;">办法</span>都是通用的<span style="color: black;">办法</span>,而随着神经网络处理器(如寒武纪 MLU、Google TPU)的发展,以及其他通用计算处理器的拓展(如点积<span style="color: black;">关联</span>的指令:Nvidia GPU DP4A、Intel AVX-512 VNNI、ARM SDOT/UDOT ),深度学习的优化或许还值得继续投入资源。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">寒武纪 MLU 链接:</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">https://www.jiqizhixin.com/articles/2019-06-20-12</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Nvidia GPU DP4A 链接:</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">https://devblogs.nvidia.com/mixed-precision-programming-cuda-8/</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Intel AVX-512 VNN 链接:</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">https://devblogs.nvidia.com/mixed-precision-programming-cuda-8/</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">本文写作过程中参考了以下(<span style="color: black;">包含</span>但不限于)资料:</span></strong></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">QNNPACK</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">( 链接:https://github.com/pytorch/QNNPACK )</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">Convolution in Caffe</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">( 链接:https://github.com/Yangqing/caffe/wiki/Convolution-in-Caffe:-a-memo )</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">TensorFlow</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">Wikipedia</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">Fast Algorithms for Convolutional Neural Networks</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">( 链接:https://github.com/Tencent/ncnn )</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">NCNN</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">( 链接:https://github.com/OAID/Tengine )</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">Tengine</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">( 链接:https://jackwish.net/neural-network-quantization-introduction-chn.html )</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">神经网络量化简介</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">( 链接:https://jackwish.net/neural-network-quantization-introduction-chn.html )</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">通用矩阵乘(GEMM)优化算法</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">( 链接:https://jackwish.net/gemm-optimization.html )</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">The Indirect Convolution Algorithm</span></span></p>
哈哈、笑死我了、太搞笑了吧等。 你的言辞如同繁星闪烁,点亮了我心中的夜空。 你说得对,我们一起加油,未来可期。
页:
[1]