分享丨大规模神经网络优化:神经网络损失空间“长”什么样?
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"><span style="color: black;">转载</span> PaperWeekly 作者 郑奘巍</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">©作者 |</span></strong><span style="color: black;"> 郑奘巍</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">单位 | </span></strong><span style="color: black;">新加坡国立大学</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;"><span style="color: black;">科研</span>方向 | </span></strong><span style="color: black;"><span style="color: black;">有效</span><span style="color: black;">设备</span>学习、神经网络优化</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"><span style="color: black;">We should use the special structure properties of f(for example f is a given by a neural network)to optimize it faster, instead of purely relying on optimization algorithms.</span></span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"><span style="color: black;">― Yuanzhi Li</span></span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">Adam 虽然<span style="color: black;">叫作</span>作”通用“优化算法,虽然在大语言模型等 Transformer 架构上表现的好,但它<span style="color: black;">亦</span>并不是万能的,在视觉模型的泛化性,以及<span style="color: black;">有些</span>凸问题上收敛的表现都<span style="color: black;">无</span> SGD 要好。正如本篇开头所引,<span style="color: black;">咱们</span>要关注特殊网络结构的特殊性质(<span style="color: black;">例如</span> Transformer)。</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"><span style="color: black;">怎样</span>刻画网络的优化性质呢?在优化<span style="color: black;">关联</span>的论文中,<span style="color: black;">一般</span><span style="color: black;">经过</span>分析 Hessian 矩阵及其特征值,<span style="color: black;">或</span>将损失函数进行一维或二维的可视化来分析网络的优化性质。<span style="color: black;">咱们</span><span style="color: black;">期盼</span>这些指标能够<span style="color: black;">帮忙</span><span style="color: black;">咱们</span>更好的理解网络损失的 landscape,优化器优化轨迹的性质等等。<span style="color: black;">咱们</span><span style="color: black;">期盼</span>将这些指标刻画的性质与优化器的设计<span style="color: black;">相关</span>起来。<span style="color: black;">咱们</span><span style="color: black;">亦</span><span style="color: black;">期盼</span>找到合适的指标来反应随着网络参数、数据量、批量<span style="color: black;">体积</span>等参数的变化,网络的优化性质<span style="color: black;">怎样</span>变化「R1」。</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">本文讨论的方向<span style="color: black;">倘若</span>有新的动向,<span style="color: black;">亦</span>会<span style="color: black;">连续</span>更新。<span style="color: black;">倘若</span>有任何错误/问题/讨论,欢迎在评论区留言。</span></span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-axegupay5k/b217496dfa524b678a0003f1d977537f~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725628907&x-signature=EXxXm%2BBAns%2BNRWWjC%2BwA9%2FJPjyM%3D" style="width: 50%; margin-bottom: 20px;"></div>
<h1 style="color: black; text-align: left; margin-bottom: 10px;">Hessian 阵</h1>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"><span style="color: black;">海量</span>对神经网络损失空间的<span style="color: black;">科研</span>都是基于 Hessian 阵的。Hessian 阵是损失函数的二阶导数。<span style="color: black;">因为</span> Hessian 阵是实对<span style="color: black;">叫作</span>矩阵,<span style="color: black;">能够</span>应用特征分解。它的特征值和特征向量<span style="color: black;">能够</span><span style="color: black;">帮忙</span><span style="color: black;">咱们</span>理解损失函数的 landscape。<span style="color: black;">这儿</span><span style="color: black;">咱们</span>不太严谨的介绍<span style="color: black;">有些</span>优化<span style="color: black;">文案</span>中<span style="color: black;">平常</span>的概念。</span></span></p>
<h1 style="color: black; text-align: left; margin-bottom: 10px;">1.1 <span style="color: black;">平常</span>概念</h1>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">Curvature 曲率:</span></strong><span style="color: black;">数学定义为一点处密切圆弧的半径的倒数,它刻画了函数与直线的差距。在梯度为 0 的点时,Hessian 阵行列式为曲率。</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">Eigenvalue 特征值:</span></strong><span style="color: black;">Hessian 阵的特征值越大,对应的特征向量(Eigenvector)方向就越陡(凸性越强)。最大特征值对应的方向为最大曲率的方向。<span style="color: black;">倘若</span><span style="color: black;">思虑</span>以特征向量对应方向的单位向量<span style="color: black;">做为</span>基张成的空间,该空间<span style="color: black;">叫作</span>为特征空间 Eigenspace,对应的单位向量为 Eigenbasis。</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">Conditioning number <span style="color: black;">要求</span>数:</span></strong><span style="color: black;">Hessian 阵的最大特征值与最小特征值的比值。条件数越大,系统越不稳定。</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">Axis-aligned 坐标轴对齐:</span></strong><span style="color: black;">Hessian 阵的特征向量<span style="color: black;">是不是</span>与坐标轴平行。</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">Smoothness 光滑</span></strong><span style="color: black;">(<span style="color: black;">通常</span>指 Lipschitz 梯度<span style="color: black;">要求</span>):。<span style="color: black;">倘若</span>是二阶可导凸函数,则有 。即 Smoothness 限制了 Hessian 阵特征值的<span style="color: black;">体积</span>(和变化幅度的上限),<span style="color: black;">不可</span>大于 。</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">Strong Convexity 强凸性:</span></strong><span style="color: black;">虽然神经网络<span style="color: black;">通常</span>都是非凸优化,但<span style="color: black;">这儿</span><span style="color: black;">咱们</span>引入凸性用以和光滑对应,以便于理解。强凸性的定义为 。即 Strong Convexity 限制了 Hessian 阵特征值的<span style="color: black;">体积</span>(和变化幅度的下限),<span style="color: black;">不可</span><span style="color: black;">少于</span> m。</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">Sharpness / Flatness 平坦程度:</span></strong><span style="color: black;">对 Hessian 特征谱分布<span style="color: black;">状况</span>的描述。Sharp 指 Hessian 阵特征值含有<span style="color: black;">海量</span>大值,Flat 指 Hessian 阵特征值含有<span style="color: black;">海量</span>小值(绝大部分接近 0)。另一种定义<span style="color: black;">办法</span>是直接将 Sharpness 定义为最大的特征值 。Hessian 阵</span><strong style="color: blue;"><span style="color: black;">退化(degenerate)/ 奇异(singular)/ 低秩 / 线性<span style="color: black;">关联</span></span></strong><span style="color: black;">是 Flat 的一种表现。<span style="color: black;">一般</span>收敛点平坦有利于泛化性能(Hessian 阵是实对<span style="color: black;">叫作</span>矩阵,特征值和奇异值相等)。</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;"><span style="color: black;">平常</span>地形(terrain, landscape):</span></strong></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">驻点(stationary point),临界点(critical point):</span></span></p><span style="color: black;"><span style="color: black;">鞍点:H 非正定的驻点</span></span><span style="color: black;"><span style="color: black;">plateaus:平坦地,<span style="color: black;">持有</span>小曲率的平面</span></span><span style="color: black;"><span style="color: black;">basins:盆地,局部最优点的集合</span></span><span style="color: black;"><span style="color: black;">wells:井,鞍点的集合</span></span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">局部的泰勒展开到二阶:</span></span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-6w9my0ksvp/6095fcf6737141c699a377b237c7d24b~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725628907&x-signature=vd9lnKBFEUIL7OJpU0xxTD9BvLk%3D" style="width: 50%; margin-bottom: 20px;"></div>
<h1 style="color: black; text-align: left; margin-bottom: 10px;">1.2 费雪信息</h1>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">Fisher information matrix 费雪信息:</span></span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-6w9my0ksvp/bb81ac074ecc4f25bac8b3a455c09484~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725628907&x-signature=Rejg3LSpEn0WQr9GK9SH4l1AeIE%3D" style="width: 50%; margin-bottom: 20px;"></div><span style="color: black;"><span style="color: black;">费雪信息总是对<span style="color: black;">叫作</span>半正定阵。<span style="color: black;">因为</span> ,<span style="color: black;">因此</span> Fisher information (二阶矩)衡量了不<span style="color: black;">一样</span>本对应导数的方差()。</span></span><span style="color: black;"><span style="color: black;">,<span style="color: black;">因此</span> Fisher information <span style="color: black;">同期</span>是<span style="color: black;">运用</span> log-likelihood 衡量的函数的 Hessian 阵的期望。证明可参考此处:</span><span style="color: black;">https://mark.reid.name/blog/fisher-information-and-log-likelihood.html</span></span><span style="color: black;"><span style="color: black;"><span style="color: black;">归类</span>模型建模的分布是 。在训练初期,网络的输出 与真实分布的差距很大。<span style="color: black;">倘若</span><span style="color: black;">按照</span>网络输出的标签采样,<span style="color: black;">能够</span>得到对应分布的 Hessian 阵(半正定)但并不是<span style="color: black;">咱们</span>关心的真实分布的 Hessian 阵。</span></span><span style="color: black;"><span style="color: black;">当模型分布与真实分布的差距很小时,Fisher information <span style="color: black;">能够</span>近似为 Hessian 阵的期望。</span></span>
<h1 style="color: black; text-align: left; margin-bottom: 10px;">1.3 Gauss-Newton 阵</h1>
<h1 style="color: black; text-align: left; margin-bottom: 10px;">Gauss-Newton decomposition:记神经网络输出为,其中 为损失函数,则网络的黑塞阵有如下分解。以一维 f 为例,高维可参考此处:</h1>
<h1 style="color: black; text-align: left; margin-bottom: 10px;">https://andrew.gibiansky.com/blog/machine-learning/gauss-newton-matrix/</h1>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-6w9my0ksvp/2d6c437f012147d0bccc2e41ef4c8cda~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725628907&x-signature=6VHvLDY%2FfdQXeiRy11ON65MfK6c%3D" style="width: 50%; margin-bottom: 20px;"></div>
<h1 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;">经过</span>链式法则<span style="color: black;">就可</span>证明。<span style="color: black;">这儿</span>的<span style="color: black;">第1</span>项被<span style="color: black;">叫作</span>作 Gauss-Newton 阵,它衡量了<span style="color: black;">因为</span>网络特征 的变动贡献的 Hessian 阵值。第二项则是<span style="color: black;">因为</span>输入值的变动贡献的 Hessian 阵值。<span style="color: black;">因为</span>第二项<span style="color: black;">一般</span>很小,<span style="color: black;">因此</span><span style="color: black;">咱们</span><span style="color: black;">能够</span>近似认为黑塞阵由 Gauss-Newton 阵决定。<span style="color: black;">经过</span>变形,<span style="color: black;">咱们</span><span style="color: black;">能够</span>得到 Gauss-Newton 阵的另一种形式,衡量了样本的带权二阶矩。可见 Gauss-Newton 阵与 Fisher information 有着密切的联系。</h1>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-6w9my0ksvp/409096b766aa4e45a433836e9891b7c9~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725628907&x-signature=9f%2FJHLhZcqbxsC7t90GMMIZgpro%3D" style="width: 50%; margin-bottom: 20px;"></div>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-6w9my0ksvp/9416415a87624b309d59b352234ddd8f~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725628907&x-signature=WI7X%2BrB7FxS9od7%2B%2BK2aG6VBtzo%3D" style="width: 50%; margin-bottom: 20px;"></div>
<h1 style="color: black; text-align: left; margin-bottom: 10px;">临界点的数量与性质</h1>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">为了更直观的理解,不会介绍太多理论推导。讨论网络优化性质的<span style="color: black;">文案</span><span style="color: black;">常常</span>观点鲜明,不少<span style="color: black;">文案</span>得出的结论<span style="color: black;">乃至</span>是相互矛盾。从今天的视角往前看,<span style="color: black;">咱们</span><span style="color: black;">重点</span>介绍<span style="color: black;">日前</span>被广泛接受的观点。但<span style="color: black;">亦</span>要<span style="color: black;">重视</span>,部分结论是在 CNN,MLP,LSTM 上用监督学习进行,对应的是 over-paramterized 网络(参数量远大于数据量)。<span style="color: black;">日前</span>的 LLM 则<span style="color: black;">必定</span>程度是 under-parameterized 网络。</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">1. 局部最优点的数量随着网络参数的<span style="color: black;">增多</span>而指数级<span style="color: black;">增多</span>,但<span style="color: black;">区别</span>局部最优点的训练损失相差不大</span></strong></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"> 用单神经元网络说明了局部最优点的数量随着网络参数的<span style="color: black;">增多</span>而指数级<span style="color: black;">增多</span>。LeCun 引领的一系列工作 <span style="color: black;">运用</span> spin-glass <span style="color: black;">理学</span>模型阐释了简化的神经网络中差局部最优解随着模型参数增大指数级减小,以及局部最优解<span style="color: black;">周边</span>相当平坦。 则证明了多层神经网络中<span style="color: black;">类似</span>的结论。</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"> 中<span style="color: black;">说到</span>,在全连接层中交换任意两个神经元<span style="color: black;">能够</span>得到等价的神经网络。这个性质让随机初始化的网络可能收敛到非常<span style="color: black;">区别</span>的局部最优值。 则用过实验<span style="color: black;">发掘</span>,<span style="color: black;">区别</span>的局部最小值之间<span style="color: black;">能够</span><span style="color: black;">经过</span>平坦的路径相连。 <span style="color: black;">一样</span>支持了这点:任意两个局部极小值之间都<span style="color: black;">能够</span><span style="color: black;">经过</span>一条不会太”扭曲“的路径相连,这条路径上的损失最多比局部最优处的损失高一点。</span></span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-6w9my0ksvp/9ef003d946554d46805a9710734d01b4~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725628907&x-signature=LfHEH6N5APtjt6fEp0lgvRTz4tc%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">2. 鞍点的数量比上局部最优点的数量随着网络参数的<span style="color: black;">增多</span>而指数级<span style="color: black;">增多</span></span></strong></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"> 理论分析了鞍点数量和局部最优点数量的比例随着网络参数的<span style="color: black;">增多</span>而指数级<span style="color: black;">增多</span>,<span style="color: black;">同期</span>鞍点<span style="color: black;">一般</span>被 plateaus 所<span style="color: black;">包裹</span>。 则<span style="color: black;">经过</span>理论分析<span style="color: black;">弥补</span>,训练时不会陷入鞍点,<span style="color: black;">然则</span>会减慢速度。 同意该看法,并认为训练神经网络的过程<span style="color: black;">能够</span>看作避开鞍点,并<span style="color: black;">经过</span><span style="color: black;">选取</span>鞍点的一端从而打破“对<span style="color: black;">叫作</span>性”的过程。在鞍点处优化器的<span style="color: black;">选取</span>会影响到最终收敛点的性质。</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">3. Hessian 的低秩性质</span></strong></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"> <span style="color: black;">经过</span>计算 CNN 网络训练时的特征谱,<span style="color: black;">发掘</span> Hessian 阵特征值谱分布特点:</span></span></p><span style="color: black;"><span style="color: black;">在训练初期,Hessian 特征值以 0 为中心对<span style="color: black;">叫作</span>分布。随着训练进行<span style="color: black;">持续</span>向 0 集中,H 退化严重。训练末期极少量<span style="color: black;">少于</span> 0。</span></span><span style="color: black;"><span style="color: black;">接近收敛时 Hessian 特征谱分为两部分,<span style="color: black;">海量</span>接近 0,离群点远离大部分特征值,离群点的数量接近于训练数据中的类别数量(或聚类数量)。即 Hessian 阵此时是低秩的。</span></span><span style="color: black;"><span style="color: black;">数据不变的<span style="color: black;">状况</span>下,增大参数仅仅使 Hessian 阵接近 0 的部分更宽;参数不变的<span style="color: black;">状况</span>下,数据量的变化会改变 Hessian 阵特征谱的离群点。</span></span><span style="color: black;"><span style="color: black;">参数量越大,批量<span style="color: black;">体积</span>越小,训练结束时 Hessian 阵特征值的负值越多(但总体比例都很小。 <span style="color: black;">弥补</span>,在训练<span style="color: black;">初期</span> Hessian 阵就会变得基本<span style="color: black;">无</span>负值。</span></span>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"> <span style="color: black;">一样</span><span style="color: black;">发掘</span>了离群点数量(近似 Hessian 阵的秩)与<span style="color: black;">归类</span>任务中的类别数量<span style="color: black;">关联</span>。<span style="color: black;">近期</span>的一篇工作 则说明,CNN 中 Hessian 阵的秩随着参数<span style="color: black;">增多</span>而增大。 分析了每层的 Hessian 特征值谱,<span style="color: black;">发掘</span>与<span style="color: black;">全部</span>网络的 Hessian 特征值谱<span style="color: black;">类似</span>。</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"> 则进一步描绘了<span style="color: black;">全部</span>训练过程中的低秩现象:Hessian 最大的几个特征值对应的特征向量张成的子空间尽管会变化,但变化程度不大。训练的梯度会<span style="color: black;">火速</span>的进入到这个子空间中,但在这个子空间中并<span style="color: black;">无</span>什么规律。故<span style="color: black;">全部</span>神经网络的训练似乎是<span style="color: black;">出现</span>在一个低维的子空间上的。</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">4. Gauss-Newton 阵<span style="color: black;">能够</span>近似 Hessian 阵</span></strong></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"> 都指出了这一点。实验如图所示。Sophia 等二阶优化器正是利用了这点进行对角黑塞阵的计算。 则验证了梯度的二阶矩的大特征值与 Hessian 阵的大特征值近似。</span></span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-6w9my0ksvp/511aa422d3a248fc9b9e79da658bd405~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725628907&x-signature=ba354p1B5KbUqfynhhv9zyVDA0g%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">▲ <span style="color: black;">区别</span>层以及<span style="color: black;">全部</span>网络 Hessian 阵与 Gauss-Newton 阵特征值分布对比</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">5. 平坦的收敛点<span style="color: black;">拥有</span>更好的泛化性,大的批量容易收敛到 sharp 的收敛点</span></strong></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"> 指出 Flat minima 有利于泛化,大批量训练容易收敛到 sharp minima,小批量训练容易收敛到 flat minima。 同意这一点,但反对 中认为 flat minima 和 sharp minima 被 well 分割开。<span style="color: black;">经过</span>实验证明能够从 sharp minima 平坦的“走”到 flat minima。 分析了<span style="color: black;">区别</span>批量<span style="color: black;">体积</span>收敛点的 Hessian 特征谱,<span style="color: black;">发掘</span>了相同的结论。</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"><span style="color: black;">通常</span>认为平坦的损失空间,既有利于优化算法的梯度下降,得到的收敛点性质<span style="color: black;">亦</span>会更好。故<span style="color: black;">非常多</span><span style="color: black;">文案</span>从该<span style="color: black;">方向</span>解释网络设计的成功<span style="color: black;">原由</span>。譬如残差链接,BN,更宽的网络(over-paramterized)都有助于让损失空间更平滑。</span></span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-6w9my0ksvp/d7dc65d6e66e4f159ff0e4aa0e47cf84~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725628907&x-signature=2%2FEOEpB879IS0ExsaFc88Va7JLQ%3D" style="width: 50%; margin-bottom: 20px;"></div>
<h1 style="color: black; text-align: left; margin-bottom: 10px;">优化路径性质</h1>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"><span style="color: black;">这儿</span>先介绍两篇近期的工作,之后有机会再<span style="color: black;">弥补</span>。</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"> 沿着前人的工作,探讨了初始权重与训练完成后的权重,两者插值得到的一系列权重的性质。<span style="color: black;">能够</span>看到<span style="color: black;">区别</span>初始化的网络权重<span style="color: black;">常常</span>收敛到同一个盆地。虽然与训练路径<span style="color: black;">区别</span>,插值路径<span style="color: black;">亦</span>是单调递减,<span style="color: black;">无</span><span style="color: black;">凸出</span>的现象。</span></span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-6w9my0ksvp/5dcaa005badc426ab2a8abd9664bcb13~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725628907&x-signature=6niZO8EAqK7DS13CQdMNPZSTRHc%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"> 讨论了好的优化路径(<span style="color: black;">经过</span>模型设计如<span style="color: black;">增多</span> BN <span style="color: black;">或</span><span style="color: black;">选取</span>好的优化超参)应该让优化器快速进入到平坦的区域,从而能够<span style="color: black;">运用</span><span style="color: black;">很强</span>的学习率。本文<span style="color: black;">选取</span>观察 即最大的特征值以衡量网络的 sharpness。<span style="color: black;">文案</span><span style="color: black;">发掘</span>网络初始化<span style="color: black;">办法</span>,Warmup,BatchNorm 等<span style="color: black;">办法</span>都有效的减小了 ,从而<span style="color: black;">能够</span><span style="color: black;">运用</span>更高的学习率。作者<span style="color: black;">经过</span>实验<span style="color: black;">发掘</span>,SGD 算法能够稳定收敛的<span style="color: black;">要求</span>是 和学习率的乘积应该<span style="color: black;">少于</span> 2。(<span style="color: black;">因为</span> Hessian 最小的值接近 0,<span style="color: black;">因此</span><span style="color: black;">掌控</span> <span style="color: black;">亦</span><span style="color: black;">能够</span>看做<span style="color: black;">掌控</span> Hessian 阵的<span style="color: black;">要求</span>数)</span></span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-6w9my0ksvp/eeb5d9bab46d4370a4c83c3c64af4b66~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725628907&x-signature=0Mq%2FEO9L78OXxbiiQN5iQDJKK%2FM%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">▲ 左:<span style="color: black;">区别</span>学习率下<span style="color: black;">区别</span><span style="color: black;">办法</span>的表现;右:最大特征值和学习率的变化</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"> 工作则是将优化器每步进行分解。<span style="color: black;">按照</span>局部泰勒展开,作者定义 gradient correlation 和 directional sharpness 两项:</span></span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-6w9my0ksvp/9a2eebd4dae24e369ef6951dd5a96cfe~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725628907&x-signature=hx1QY%2BzBOYUE1Xo9mIOmpAoOdkw%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">▲ 函数的单次更新的泰勒展开</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">作者<span style="color: black;">发掘</span>,<span style="color: black;">区别</span>的优化器的 direction sharpness 的表现很不相同。Sharpness 刻画的是一个点局部的平坦程度,而 directional sharpness 则刻画了优化方向上的平坦程度。在语言模型上,作者<span style="color: black;">发掘</span> SGD 优化器的 direction sharpness <span style="color: black;">明显</span>高于 Adam 优化器。作者认为这是 Adam 优化器收敛速度快的<span style="color: black;">原由</span>。作者还<span style="color: black;">经过</span>梯度的值裁剪进一步减小 Sharpness 程度,从而加速收敛。</span></span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p26-sign.toutiaoimg.com/tos-cn-i-6w9my0ksvp/984420d2702c4a56a2de205a78eeaea0~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725628907&x-signature=TqXC7%2FLSEdLzb1zQppJYEAmcHh4%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">▲ <span style="color: black;">区别</span><span style="color: black;">办法</span>的 directional sharpness 对比</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"><span style="color: black;">咱们</span><span style="color: black;">已然</span><span style="color: black;">晓得</span>收敛点是需要越平坦泛化性越好的。但从这两篇<span style="color: black;">文案</span>中<span style="color: black;">能够</span>看到,<span style="color: black;">咱们</span><span style="color: black;">期盼</span>优化路径始终能够在平坦的区域探索。以往的<span style="color: black;">文案</span>会认为这些区域可能会<span style="color: black;">引起</span>优化陷入鞍点,但<span style="color: black;">日前</span>看来,优化空间中存在<span style="color: black;">海量</span><span style="color: black;">这般</span>平坦的空间,优化器应该在这些较为平坦的空间中“大踏步”的进行探索。</span></span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-6w9my0ksvp/0474e30e540d4063991b64cfab969160~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725628907&x-signature=zOdg0dXCtt9VpQtKF%2F1xFW0kT40%3D" style="width: 50%; margin-bottom: 20px;"></div>
<h1 style="color: black; text-align: left; margin-bottom: 10px;">Transformer 优化性质</h1>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">前面的分析都是比较泛化的,为了更好的设计针对 Transformer 的优化器,<span style="color: black;">咱们</span>还需要分析 Transformer 的特殊性质。<span style="color: black;">例如</span><span style="color: black;">文案</span> 就指出了 Transformer 训练时的<span style="color: black;">有些</span>困难。<span style="color: black;">一样</span>的,Adam 在 Transformer 上表现<span style="color: black;">明显</span>好于 SGD,<span style="color: black;">亦</span>暗示了 Transformer 的优化性质与 CNN 等网络<span style="color: black;">区别</span>。</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">Transformer 的分析<span style="color: black;">文案</span><span style="color: black;">非常多</span>都是对模型设计的分析,<span style="color: black;">倘若</span>有机会,我会单独写一篇<span style="color: black;">文案</span>进行介绍。<span style="color: black;">这儿</span>只介绍一篇<span style="color: black;">近期</span>的工作 ,以补完本节的完整性。</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">该工<span style="color: black;">功效</span><span style="color: black;">海量</span>实验揭示了非常直接但有趣的一个现象:Transformer 架构的激活层非常的稀疏。在 T5-Base 中,<span style="color: black;">仅有</span> 3% 的神经元被激活了(ReLU 输出值非 0)。并且增大模型会使其更加稀疏,以及和数据似乎无关。无论<span style="color: black;">运用</span>视觉还是语言,<span style="color: black;">乃至</span>是随机数据,以及不重复的数据,都有该现象<span style="color: black;">显现</span>。作者简单的讨论了这种现象可能和优化动力学<span style="color: black;">关联</span>。笔者认为这个现象可能会给 Transformer 的优化器设计<span style="color: black;">有些</span>启示。</span></span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-6w9my0ksvp/645c54a12e6d416099f2f3a9a3c61840~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725628907&x-signature=1Pc%2BeuLV6Vg%2F0KvxKbQK0wQinB0%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">▲ 激活值的稀疏性现象</span></span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-6w9my0ksvp/6e1a99ba915b4e5f841395bfd0a8fb6d~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725628907&x-signature=QIUnuZrrLMzUPSIrkrKBsa0fO00%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">参考文献</span></strong></span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-6w9my0ksvp/97ee53127a7c487cbde6ae0561060fd3~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725628907&x-signature=KSZvz8gekTgBQiDQspdWRvoMApQ%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"> Exponentially many local minima for single neurons</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;"> The Loss Surfaces of Multilayer Networks</span> Open Problem: The landscape of the loss surfaces of multilayer networks Explorations on High Dimensional Landscapes No bad local minima: Data independent training error guarantees for multilayer neural networks Deep Learning without Poor Local Minima Empirical Analysis of the Hessian of Over-Parametrized Neural Networks Essentially No Barriers in Neural Network Energy Landscape Identifying and attacking the saddle point problem in high-dimensional non-convex optimization Gradient Descent Converges to Minimizers Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets Gradient Descent Happens in a Tiny Subspace A Deeper Look at the Hessian Eigenspectrum of Deep Neural Networks and its Applications to Regularization Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization Measurements of Three-Level Hierarchical Structure in the Outliers in the Spectrum of Deepnet Hessians The Hessian perspective into the Nature of Convolutional Neural Networks On large-batch training for deep learning generalization gap and sharp minima Hessian-based Analysis of Large Batch Training and Robustness to Adversaries<span style="color: black;"> A Loss Curvature Perspective on Training Instability in Deep Learning</span> Toward Understanding Why Adam Converges Faster Than SGD for Transformers Understanding the Difficulty of Training Transformers The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers Analyzing Monotonic Linear Interpolation in Neural Network Loss Landscapes</span></p>
可以发布外链的网站 http://www.fok120.com/ 同意、说得对、没错、我也是这么想的等。 论坛外链网http://www.fok120.com/ 一看到楼主的气势,我就觉得楼主同在社区里灌水。 软文发布平台 http://www.fok120.com/ 楼主发的这篇帖子,我觉得非常有道理。 论坛是一个舞台,让我们在这里尽情的释放自己。
页:
[1]