fny5jt9 发表于 2024-8-31 05:51:49

大规模神经网络优化:超参最佳实践与规模律


    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_gif/Psho9dm7oDHKVtfYDubjKdZRUjAfBQQicXjoZWJ3qnK42ooD4eeJUfJBM4SSZVa2RE5lO0j6rWwzliby0j9u4bDg/640?wx_fmt=gif&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1" style="width: 50%; margin-bottom: 20px;"></p><strong style="color: blue;"><span style="color: black;">©作者 |&nbsp;</span></strong><span style="color: black;">郑奘巍</span><strong style="color: blue;"><span style="color: black;">单位 |&nbsp;</span></strong><span style="color: black;">新加坡国立大学</span><strong style="color: blue;"><span style="color: black;"><span style="color: black;">科研</span>方向 |&nbsp;</span></strong><span style="color: black;"><span style="color: black;">有效</span><span style="color: black;">设备</span>学习与神经网络优化</span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">从理论分析入手把握大规模神经网络优化的规律,<span style="color: black;">能够</span><span style="color: black;">指点</span>实践中的超参数<span style="color: black;">选取</span>。反过来,实践中的超参数<span style="color: black;">选取</span><span style="color: black;">亦</span><span style="color: black;">能够</span><span style="color: black;">指点</span>理论分析。本篇<span style="color: black;">文案</span>聚焦于大语言模型,介绍从 GPT <span style="color: black;">败兴</span><span style="color: black;">大众</span><span style="color: black;">广泛</span><span style="color: black;">运用</span>的训练超参数的变化。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">规模律<span style="color: black;">科研</span>的是随着神经网络规模的增大,超参数、性能是<span style="color: black;">怎样</span>改变的。规模律是对模型、数据、优化器关系的深刻刻画,揭示大模型优化时的<span style="color: black;">广泛</span>规律。<span style="color: black;">经过</span>规模律,<span style="color: black;">咱们</span><span style="color: black;">能够</span>用少量成本在小模型上验证超参数的<span style="color: black;">选取</span>和性能的变化<span style="color: black;">状况</span>,继而外推到大模型上。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">在 LLM 中规模性常常变换模型<span style="color: black;">体积</span>和数据规模,进行<span style="color: black;">海量</span>调参而保持优化器不变。故<span style="color: black;">针对</span>大模型优化器而言,规模性是其性能很好的展现(性能上限)。设计更好的优化器(用更少的数据达到相同的性能)<span style="color: black;">便是</span>在挑战现有的规模律。</span></p><img src="https://mmbiz.qpic.cn/mmbiz_png/Psho9dm7oDGhKg9nnSz5qQrwKvXibt3wulOVRfC18yCkd6xXqGq22h6QUk8chptF0fnQ4uXeZtAktYMrWwG2SyQ/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">超参最佳实践</span></strong></span><span style="color: black;"><span style="color: black;"><span style="color: black;">咱们</span><span style="color: black;">首要</span>回顾从 GPT <span style="color: black;">败兴</span>重要<span style="color: black;">文案</span>中<span style="color: black;">运用</span>的超参数,本文将<span style="color: black;">区别</span>模型的超参数列举在下方。<span style="color: black;">首要</span>,除了 Google 的 T5, PaLM 外,其它的模型都是用了 Adam 类的优化器(Adam 或 AdamW)。其次,超参数<span style="color: black;">选取</span>上的更新都是在前人的<span style="color: black;">基本</span>上慢慢变化,并被后续采纳的。这<span style="color: black;">包含</span><span style="color: black;">运用</span> dropuout、梯度范数裁剪(Megatron-LM),批量的动态变化(GPT-3),Adam</span></span><span style="color: black;">(GPT-3)。</span><span style="color: black;"><strong style="color: blue;"><span style="color: black;">学习率:</span></strong></span><span style="color: black;"><span style="color: black;"><span style="color: black;">咱们</span><span style="color: black;">发掘</span>随着模型的增大,学习率越来越小。学习率与数据量、批量<span style="color: black;">体积</span>都没有<span style="color: black;">显著</span>的关系,且<span style="color: black;">通常</span><span style="color: black;">运用</span></span></span><span style="color: black;"> <span style="color: black;">上下</span>的学习率。学习率的变化策略都<span style="color: black;">包含</span> warmup 和衰减(decay)两<span style="color: black;">周期</span>。<span style="color: black;">日前</span><span style="color: black;">广泛</span><span style="color: black;">运用</span> GPT-3 中余弦衰减到原学习率的<span style="color: black;">非常</span>之一。谷歌则倾向于<span style="color: black;">运用</span>平方根衰减(优点之一在于<span style="color: black;">不消</span>提前<span style="color: black;">晓得</span>训练步数)。</span><span style="color: black;"><strong style="color: blue;"><span style="color: black;">批量<span style="color: black;">体积</span>:</span></strong></span><span style="color: black;">训练<span style="color: black;">运用</span>的批量<span style="color: black;">体积</span>随着模型的增大<span style="color: black;">亦</span>在<span style="color: black;">持续</span>增大,从 GPT 的 32k、BERT 的 128k,到 GPT-3 的 3.2M、LLaMA 的 4M。值得<span style="color: black;">重视</span>的是,GPT-3 的批量<span style="color: black;">体积</span>是从 32k <span style="color: black;">起始</span>,在 12B tokens 的训练中<span style="color: black;">逐步</span><span style="color: black;">增多</span>到 4M 的,批量<span style="color: black;">体积</span><span style="color: black;">增多</span>了 125 倍。</span><span style="color: black;">OpenAI 在论文中认为随着学习的进行,模型能够承载的批量<span style="color: black;">体积</span>快速<span style="color: black;">增多</span>。而后续<span style="color: black;">非常多</span>工作直接<span style="color: black;">运用</span>了更大的批量。这可能是批量增大的过程只占总数据的 2%,即使直接<span style="color: black;">运用</span>最大批量<span style="color: black;">亦</span>不会<span style="color: black;">导致</span>太大的问题。</span><span style="color: black;"><strong style="color: blue;"><span style="color: black;">权重衰减 /L2 正则化:</span></strong></span><span style="color: black;">在 L2 正则化(或 weight decay)上,GPT 与 BERT 都<span style="color: black;">运用</span>了正则化,后续的模型有些<span style="color: black;">运用</span>而有些<span style="color: black;">无</span><span style="color: black;">运用</span>。<span style="color: black;">首要</span><span style="color: black;">重视</span>到,在 GPT 和 BERT 时代,数据量还是大于模型参数量的(over-parameterized),训练时<span style="color: black;">亦</span>是<span style="color: black;">运用</span>多轮训练(multi-epoch)。</span><span style="color: black;">而随着人们<span style="color: black;">认识</span>到数据的重要性,数据量<span style="color: black;">已然</span>超越模型的参数量的(GPT3, 680B tokens, 175B params, under-parameterized),训练时<span style="color: black;">亦</span>只<span style="color: black;">运用</span>了一轮训练(single-epoch)。<span style="color: black;">按照</span> 中的分析,在 over-parameterized 网络中<span style="color: black;">运用</span> weight decay 相当于对优化器施加了潜在的正则;而在 under-parameterized 网络中,weight decay 只是改变了<span style="color: black;">实质</span>的学习率。随着网络训练权重的变化,相当于施加了自适应的学习率变化策略。</span><span style="color: black;"><span style="color: black;">在本文的最后列举了<span style="color: black;">区别</span>模型的超参<span style="color: black;">选取</span>。其中 Adam 括号中的数字<span style="color: black;">表率</span> </span><span style="color: black;">,sch 为学习率<span style="color: black;">调节</span>策略,bs为批量<span style="color: black;">体积</span>,L2 为权重衰减的 </span></span><span style="color: black;"> 权重,init 为初始化<span style="color: black;">办法</span>。</span><img src="https://mmbiz.qpic.cn/mmbiz_png/Psho9dm7oDGhKg9nnSz5qQrwKvXibt3wuhfgUpIfdPSqH8YjjHbCUiaaKsMA36bIMsMtGNKoBcus5py06M0fvx3A/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;">
    <h2 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">神经网络规模律</span></strong></span></h2><span style="color: black;">神经网络规模律(neural scaling laws)<span style="color: black;">经过</span><span style="color: black;">低价</span>的小规模实验来预测大规模模型的表现,从而决定最佳的架构、算法、数据集、超参数等等。从广义上讲所有<span style="color: black;">原因</span>都<span style="color: black;">能够</span><span style="color: black;">科研</span>:模型的宽度,数据数量,计算资源(FLOPs)等等。</span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/Psho9dm7oDGgBdeDpEy3vzNUmE99YRkLxpX8ttiamJvuXVb3YU1Wxau8Nv1cSOdvv9dVt5Iqibm51uVjwMibXVLqg/640?wx_fmt=png&amp;from=appmsg&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p><span style="color: black;">上图是强化学习中的<span style="color: black;">有些</span>例子,黑色点为实验数据,红色线为拟合的规模律,绿色点为验证数据。<span style="color: black;">能够</span>看到,<span style="color: black;">倘若</span>规模律的拟合效果好,就<span style="color: black;">能够</span>用来预测大规模模型的表现。除了<span style="color: black;">以上</span>单调的规模律,还有<span style="color: black;">有些</span>非单调的规模律,如下图所示。Tranformer 的性能随着模型的宽度<span style="color: black;">增多</span>先<span style="color: black;">增多</span>后减小最后再<span style="color: black;">增多</span>。</span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/Psho9dm7oDGgBdeDpEy3vzNUmE99YRkLZcNCVz1H15yfwXGzWMQJnq3Jmj2FaicE8yscUvlIerQEDbmGp94jBag/640?wx_fmt=png&amp;from=appmsg&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p><span style="color: black;"><span style="color: black;">神经网络规模律的<span style="color: black;">科研</span>重点之一在于<span style="color: black;">科研</span>什么样的曲线能够拟合<span style="color: black;">以上</span>现象。一个简单的拟合策略是<span style="color: black;">运用</span> </span></span><span style="color: black;">,这<span style="color: black;">能够</span>对付不少<span style="color: black;">状况</span>,然而<span style="color: black;">没法</span>应对<span style="color: black;">以上</span>非单调的<span style="color: black;">状况</span>。 提出了自己的拟合曲线 BNSL(broken neural scaling laws)</span><img src="https://mmbiz.qpic.cn/mmbiz_png/Psho9dm7oDGgBdeDpEy3vzNUmE99YRkLI4wOFpCtLvPPtU0j0qUlJgzktcX3ibGvlaG90ExQT9icib2jvANolABeg/640?wx_fmt=png&amp;from=appmsg&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"><span style="color: black;"><span style="color: black;">其中 </span><span style="color: black;"> 对应横坐标,其它参数为拟合参数。其中,</span><span style="color: black;"> <span style="color: black;">表率</span>了曲线由 </span><span style="color: black;"> 段<span style="color: black;">构成</span>,当 </span><span style="color: black;"> 时<span style="color: black;">便是</span> </span></span><span style="color: black;">。<span style="color: black;">大众</span><span style="color: black;">不消</span>纠结于公式的<span style="color: black;">详细</span>形式,该公式只是<span style="color: black;">期盼</span>“大包大揽”,把所有可能的规模性都<span style="color: black;">思虑</span>进来。这个公式<span style="color: black;">准许</span><span style="color: black;">显现</span>下图中所示的三种变化方式,<span style="color: black;">拥有</span>很高的灵活性。</span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/Psho9dm7oDGgBdeDpEy3vzNUmE99YRkLhRvy5ozyxN4P47BQjXRGp78oXaqkZMoIg0GVF4Wes029a1xEFibyv8g/640?wx_fmt=png&amp;from=appmsg&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p><img src="https://mmbiz.qpic.cn/mmbiz_png/Psho9dm7oDGhKg9nnSz5qQrwKvXibt3wukOjHSmSsEuRCB0fJu69CtdNgLnvFPDUCgeicOppBKuDvniaD3q8XWQ0Q/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;">
    <h2 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">大语言模型规模律</span></strong></span></h2><span style="color: black;">讨论大语言模型规模律最重要的两篇<span style="color: black;">能够</span>说是 OpenAI 的 和 DeepMind 的 Chinchilla 了。<span style="color: black;">咱们</span>将<span style="color: black;">重点</span>介绍这两篇<span style="color: black;">文案</span>的结论。</span><span style="color: black;"><span style="color: black;">定</span>义 为模型参数量, 为数据量, 为计算量(FLOPs), 为损失值。超参数分为优化超参数(学习率等)和架构超参数(如深度、宽度)。 为批量<span style="color: black;">体积</span>, 为训练步数,<span style="color: black;">针对</span>单轮训练,。其中<span style="color: black;">针对</span>大语言模型,确定 和 <span style="color: black;">体积</span>后,就<span style="color: black;">能够</span>估算出</span><span style="color: black;">。</span><span style="color: black;"><span style="color: black;"><span style="color: black;">实质</span>中<span style="color: black;">咱们</span><span style="color: black;">持有</span>的计算量为 </span><span style="color: black;"> 时,为了<span style="color: black;">得到</span>最低的损失 </span><span style="color: black;">,<span style="color: black;">咱们</span><span style="color: black;">期盼</span><span style="color: black;">经过</span><span style="color: black;">选取</span> </span><span style="color: black;"> 和 </span><span style="color: black;"> 使得 </span><span style="color: black;"> 最小。记 </span><span style="color: black;"> 为给定计算量下最佳的 </span></span><span style="color: black;">,即</span><img src="https://mmbiz.qpic.cn/mmbiz_png/Psho9dm7oDGgBdeDpEy3vzNUmE99YRkLHLj0N3TRKK9Fok5TE2B5ESnw9Mdj6d9iaPEkiaC1bvxlffx2W0qQgjUw/640?wx_fmt=png&amp;from=appmsg&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">1. 模型性能与 </span></strong><strong style="color: blue;"><span style="color: black;"> 密切<span style="color: black;">关联</span>,与架构超参数关系不大。</span></strong></span><span style="color: black;"><strong style="color: blue;"><span style="color: black;">2. L&nbsp;与 </span></strong></span><span style="color: black;"><strong style="color: blue;"><span style="color: black;">成幂律分布(Power-law),</span></strong><strong style="color: blue;"><span style="color: black;">即</span></strong></span><span style="color: black;"><strong style="color: blue;"><span style="color: black;">。</span></strong></span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/Psho9dm7oDGgBdeDpEy3vzNUmE99YRkL4wdojglFxR1yUlEibteDpdmxMcuEZOIfjymfibrRFcRCMsooAOILmfuA/640?wx_fmt=png&amp;from=appmsg&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p><span style="color: black;"><span style="color: black;"><span style="color: black;">这儿</span> </span><span style="color: black;"> 指的是在给定 </span><span style="color: black;"> 下的最佳性能,即最低的损失值。该规律的前提<span style="color: black;">要求</span>是不受<span style="color: black;">另一</span>两个<span style="color: black;">原因</span>制约。<span style="color: black;">因为</span> </span></span><span style="color: black;">,该规律<span style="color: black;">最后</span>会失效,但 的实验规模使<span style="color: black;">咱们</span>看不到这一点。</span><span style="color: black;"><strong style="color: blue;"><span style="color: black;">3. 给定计算量后</span></strong></span><span style="color: black;"><strong style="color: blue;"><span style="color: black;">,</span></strong></span><span style="color: black;"><strong style="color: blue;"><span style="color: black;"> 。</span></strong></span><span style="color: black;">该结论即当模型参数翻倍后,数据量<span style="color: black;">亦</span>应该翻倍从而得到最优性能。这是 中对 <span style="color: black;">重点</span>纠正的结论。下图中黑色虚线为 的结论,其它三色线是 用三个<span style="color: black;">办法</span>得出的相同结论,并且<span style="color: black;">按照</span>该放缩率训练了 Chinchilla 模型。</span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/Psho9dm7oDGgBdeDpEy3vzNUmE99YRkLNfF0xEQNvS4yPyvZ63TSqlzicJm16DMajR2maMgKadYpQibUSGMvAZfw/640?wx_fmt=png&amp;from=appmsg&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p><span style="color: black;">在 中,作者认为模型增大 5 倍,数据量增大 8 倍。 认为两个<span style="color: black;">原因</span><span style="color: black;">引起</span>了 中的错误:</span><span style="color: black;">对<span style="color: black;">区别</span>的 <span style="color: black;">无</span>尝试<span style="color: black;">运用</span><span style="color: black;">区别</span>的学习率<span style="color: black;">调节</span>策略(正确的学习率<span style="color: black;">调节</span>策略对训练影响很大)</span><span style="color: black;"> <span style="color: black;">运用</span>的 较小。</span><span style="color: black;"><strong style="color: blue;">规模性存在曲率</strong></span><span style="color: black;">,<span style="color: black;">引起</span>用太小的&nbsp;得到的结论不准确。(规模性存在曲率<span style="color: black;">亦</span>说明了<span style="color: black;">最后</span>该规律会失效)</span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/Psho9dm7oDGgBdeDpEy3vzNUmE99YRkLkIRiaicfgUW3LtYcC3DNxpfChIXq4jQz01T7y3a3SBSzaJZ2LtdXwm5Q/640?wx_fmt=png&amp;from=appmsg&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p><span style="color: black;"><span style="color: black;"><span style="color: black;">这儿</span>展式 中的一种论证,即绘制相同 </span><span style="color: black;"> 下<span style="color: black;">区别</span> </span><span style="color: black;"> 与最优 </span>的关系,从而得到最优配置。</span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/Psho9dm7oDGgBdeDpEy3vzNUmE99YRkL9hqrRBCibBhMgZrpOeCdpLqBJOIOiabaaoOTDkaQvkkU1VoUKzickJkLg/640?wx_fmt=png&amp;from=appmsg&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p><span style="color: black;"><span style="color: black;">Chinchilla 规模律的<span style="color: black;">最后</span>拟合结果如下,<span style="color: black;">经过</span>代入 </span><span style="color: black;"> <span style="color: black;">咱们</span><span style="color: black;">能够</span>计算得到述 </span><span style="color: black;"> 的取值,并<span style="color: black;">能够</span>揭示数据与模型规模应该<span style="color: black;">同期</span><span style="color: black;">增多</span>的规律。<span style="color: black;">另外</span>,在 Chinchilla 的设置下,</span>。</span><img src="https://mmbiz.qpic.cn/mmbiz_png/Psho9dm7oDGgBdeDpEy3vzNUmE99YRkLNeAh8wKsOlTPJvrARIwYI5AE1LY8qTl7CDylXJgTotbiaou0zC9vm3g/640?wx_fmt=png&amp;from=appmsg&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">4. 临界批量<span style="color: black;">体积</span> </span></strong></span><span style="color: black;"><strong style="color: blue;"><span style="color: black;">,与其它<span style="color: black;">原因</span>弱<span style="color: black;">关联</span>。</span></strong></span><span style="color: black;"><span style="color: black;">临界批量<span style="color: black;">体积</span>在大规模神经网络优化:批量与噪声中有过介绍,<span style="color: black;">能够</span>理解为<span style="color: black;">运用</span>相同</span><span style="color: black;"> <span style="color: black;">能够</span>达到相同 </span><span style="color: black;"> 的最大 </span><span style="color: black;">。在 中,拟合得到 </span><span style="color: black;">。</span> 约小<span style="color: black;">能够</span>用的批量越大<span style="color: black;">亦</span>解释了上文 GPT-3 模型中批量<span style="color: black;">体积</span>的增大。</span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/Psho9dm7oDGgBdeDpEy3vzNUmE99YRkLrKxVD7tnVosPvkK43SlLQlF1liczPLZcaNu3CVfPWdzpPIw1BDibhzAQ/640?wx_fmt=png&amp;from=appmsg&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p><span style="color: black;">另一方面,训练损失随着训练步数呈现快速下降-线性-平坦三个<span style="color: black;">周期</span>的特点(见下图 Llama 训练图)。<span style="color: black;">因为</span>训练<span style="color: black;">初期</span>训练损失的快速下降,临界批量<span style="color: black;">体积</span>又随损失幂律下降,可见临界批量大小随训练步数下降的<span style="color: black;">火速</span>。<span style="color: black;">咱们</span>用将 llama 的损失带入计算,当训练的非常前期损失就能下降到 2.2,临界批量<span style="color: black;">体积</span> 4.7M,这与 llama <span style="color: black;">运用</span>的 4M 批量<span style="color: black;">体积</span>吻合。这<span style="color: black;">亦</span>解释了<span style="color: black;">为何</span><span style="color: black;">能够</span>省略掉批量<span style="color: black;">体积</span>的<span style="color: black;">调节</span>。</span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/Psho9dm7oDGgBdeDpEy3vzNUmE99YRkLAWvBCeF7wANZKpQOHpnvvoPUMUkqiaApaa8SDTXXsoQK5XpCusZ3G6g/640?wx_fmt=png&amp;from=appmsg&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p><span style="color: black;"><span style="color: black;">倘若</span>损失能够下降到 1.5,临界批量<span style="color: black;">体积</span>就会<span style="color: black;">增多</span>到 30M,<span style="color: black;">因此</span> llama <span style="color: black;">能够</span>在训练中进一步<span style="color: black;">增多</span>批量<span style="color: black;">体积</span>的<span style="color: black;">运用</span>。按此推断,GPT-4 <span style="color: black;">最后</span><span style="color: black;">运用</span>了 60M 的批量<span style="color: black;">体积</span>,对应的训练损失可能为 1.3。</span>
    <h3 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">5. 模型的迁移泛化能力与在训练数据集上的泛化能力正<span style="color: black;">关联</span>。</span></strong></span></h3><span style="color: black;">如右图所示,在训练数据集上的测试损失越低,则在其它数据集上的损失<span style="color: black;">亦</span>越低(如训练在 Wikipedia,测试在 WebText2)。右图则<span style="color: black;">表示</span>随着参数量增大,模型的测试损失越低。且在<span style="color: black;">区别</span>数据集上的测试损失与在训练集上的测试测试损失仅仅相差一个常数偏移。</span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/Psho9dm7oDGgBdeDpEy3vzNUmE99YRkLMsYtba1bXnic2J8gN2pgsX51QI2V6JPBy3Qg5Qs0695IVFIv4An7zBA/640?wx_fmt=png&amp;from=appmsg&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p>
    <h3 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">6. 更大的模型收敛更快(更少的数据量达到相同的损失)</span></strong></span></h3><span style="color: black;">下图中越亮的线<span style="color: black;">表率</span>更大的模型。左图说明达到相同的测试损失,<span style="color: black;">运用</span>大模型需要见到的数据量更少。右图中则是<span style="color: black;">运用</span>相同计算量的比较。两条线的交点分割了<span style="color: black;">运用</span><span style="color: black;">体积</span>模型的优劣:在交点左侧应该<span style="color: black;">运用</span>小模型,在交点右侧应该<span style="color: black;">运用</span>大模型。</span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/Psho9dm7oDGgBdeDpEy3vzNUmE99YRkLSj1N6kfVhOGoDffpzrhMr6FvaeySNFGvreFkJ8p7pQq935M3qesic2g/640?wx_fmt=png&amp;from=appmsg&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p><span style="color: black;">图中<span style="color: black;">另一</span>一个重要的观察是,训练后期损失下降的更慢。故与其训练一个小模型到收敛,不如用相同的资源训练一个不到收敛的大模型更加<span style="color: black;">有效</span>。</span><img src="https://mmbiz.qpic.cn/mmbiz_png/Psho9dm7oDGhKg9nnSz5qQrwKvXibt3wukOjHSmSsEuRCB0fJu69CtdNgLnvFPDUCgeicOppBKuDvniaD3q8XWQ0Q/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;">
    <h2 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">大语言模型规模律拾遗</span></strong></span></h2><span style="color: black;">除了<span style="color: black;">以上</span>两篇经典<span style="color: black;">文案</span>之外,不少<span style="color: black;">文案</span><span style="color: black;">亦</span>给出了自己的洞见。</span>
    <h3 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">3.1 涌现<span style="color: black;">指的是</span>标<span style="color: black;">选取</span>的结果,连续指标与参数规模符合幂律分布</span></strong></span></h3><span style="color: black;">涌现现象指的是模型的某些性能随着模型参数<span style="color: black;">增多</span>到<span style="color: black;">必定</span>规模<span style="color: black;">忽然</span>不可预测的快速<span style="color: black;">提高</span>。这被认为是大模型能力的重要<span style="color: black;">表现</span>。<span style="color: black;">这儿</span><span style="color: black;">咱们</span><span style="color: black;">科研</span>的<span style="color: black;">指的是</span>标性能与模型参数的关系,<span style="color: black;">亦</span>是一种规模律。</span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/Psho9dm7oDGgBdeDpEy3vzNUmE99YRkLIRrhfI4PACRxbiadRBl4lQUyKxW94bdO0DF9cNKz2icyiajSzFl1h0RKg/640?wx_fmt=png&amp;from=appmsg&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p><span style="color: black;"> 论文则指出,大部分<span style="color: black;">所说</span>的涌现现象,都出<span style="color: black;">此刻</span>两种指标上:多选题的正确性,以及完全字符串匹配正确性。更换指标<span style="color: black;">能够</span>更好的对模型能力的规模性进行预测。</span><span style="color: black;">上文中<span style="color: black;">咱们</span><span style="color: black;">已然</span><span style="color: black;">晓得</span>,模型损失值随模型参数指数下降(图A),从而<span style="color: black;">能够</span>得到单个样本预测的正确率指数<span style="color: black;">提升</span>(图B)。<span style="color: black;">倘若</span>将非线性指标“完全字符串匹配正确率”替换为“错误预测的 Token 数”,<span style="color: black;">能够</span><span style="color: black;">发掘</span><span style="color: black;">一样</span>的幂律分布。同理,将不连续的<span style="color: black;">选取</span>正确率替换为连续的<span style="color: black;">选取</span>正确率,<span style="color: black;">亦</span><span style="color: black;">能够</span>得到幂律分布。</span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></p><span style="color: black;">笔者认为,这篇<span style="color: black;">文案</span><span style="color: black;">不该</span>该看做对”涌现“重要性的否定。在现实世界、生活、市场中,<span style="color: black;">咱们</span>关心的指标<span style="color: black;">便是</span>非线性,<span style="color: black;">或</span>说非连续指标。这篇<span style="color: black;">文案</span>的<span style="color: black;">道理</span>在于,<span style="color: black;">咱们</span><span style="color: black;">能够</span>用连续指标更好的建模规模律,从而预测非连续指标的变化。<span style="color: black;">同期</span>,这<span style="color: black;">亦</span>揭示了大模型中”量变产生质变“的<span style="color: black;">背面</span>机理,并不需要用“整体的<span style="color: black;">繁杂</span>交互”进行解释。</span>
    <h3 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">3.2 大模型需要更小的学习率</span></strong></span></h3>
    <h3 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><span style="color: black;">经过</span>上文中的大模型参数经验,<span style="color: black;">咱们</span>很容易就<span style="color: black;">发掘</span>大模型需要更小的学习率。 在下左图中展示了这点。其认为这是为了<span style="color: black;">掌控</span>总方差在<span style="color: black;">必定</span>值(方差随参数量以 </span><span style="color: black;">增大)。<span style="color: black;">针对</span>这点笔者暂未找到<span style="color: black;">仔细</span>的理论解释。 中还提出了一种新的初始化和参数设置<span style="color: black;">办法</span>以<span style="color: black;">保准</span><span style="color: black;">区别</span>规模的模型<span style="color: black;">能够</span><span style="color: black;">运用</span>相同的学习率,<span style="color: black;">这儿</span><span style="color: black;">再也不</span>展开。</span></h3>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></p>
    <h3 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">3.3 <span style="color: black;">运用</span>重复数据训练时(multi-epoch),应该用<span style="color: black;">更加多</span>的轮次训练较小的模型</span></strong></span></h3><span style="color: black;"> 探究了当数据有限时,<span style="color: black;">怎样</span>训练大模型。左图中,当轮次<span style="color: black;">少于</span> 4 时,与<span style="color: black;">运用</span>新数据效果相当(GPT-4 中重复了文本两次,代码四次,与该结果印证)。当轮次大于 40 次时,则几乎<span style="color: black;">无</span><span style="color: black;">提高</span>。右图中,用左图的拟合结果<span style="color: black;">能够</span>计算得到,相比于 Chinchilla 的规模性,<span style="color: black;">运用</span>重复数据训练时,应该用<span style="color: black;">更加多</span>的数据(重复数)训练较小的模型。</span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></p>
    <h3 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">3.4 <span style="color: black;">运用</span>重复数据训练对训练<span style="color: black;">帮忙</span>很小</span></strong></span></h3><span style="color: black;"> 进行了<span style="color: black;">海量</span>的实验验证了一系列观点。下左图中,作者在 Encoder-Decoder 模型上验证了 Chinchilla 规模律<span style="color: black;">一样</span>成立(即数据量与模型参数量应该<span style="color: black;">同期</span><span style="color: black;">增多</span>)。右图则<span style="color: black;">表示</span>了<span style="color: black;">运用</span>出发数据训练对性能<span style="color: black;">无</span><span style="color: black;">帮忙</span>。文中还尝试了高质量数据、UL2 训练<span style="color: black;">目的</span>、<span style="color: black;">区别</span>的正则化<span style="color: black;">办法</span>,<span style="color: black;">最后</span><span style="color: black;">发掘</span>除了 Dropout 之外对重复训练都<span style="color: black;">无</span>帮助。</span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></p>
    <h3 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">3.5 训练比 Chinchilla 规模律更小的模型</span></strong></span></h3>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Chinchilla 规模律的出发点是给定计算量,<span style="color: black;">经过</span>分配参数量和数据量最小化损失值。换言之,给定要达到的损失值,最小化计算量。然而在<span style="color: black;">实质</span>中,训练一个小模型能带来计算量(<span style="color: black;">表率</span>训练开销)以外的收益:</span></p><span style="color: black;">小模型<span style="color: black;">安排</span>后进行推理成本更小</span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">小模型训练所需的集群数量更少</span></p><span style="color: black;"><span style="color: black;">故 提出,在不大幅度<span style="color: black;">增多</span>训练开销的前提下,尽可能减小模型的参数量。<span style="color: black;">详细</span>而言,作者在 Chinchilla 规模律的<span style="color: black;">基本</span>上,让模型的参数量变为 </span><span style="color: black;">,<span style="color: black;">从而</span>计算出达到相同损失所需的数据量 </span><span style="color: black;">。<span style="color: black;">经过</span>推导可得 </span><span style="color: black;"> 与 </span><span style="color: black;"> 无关,即无论训练开销多大,</span><span style="color: black;"> 与 </span><span style="color: black;"> 的关系都是一致的。下图展示了计算量的<span style="color: black;">增多</span>值 </span><span style="color: black;"> 与 </span></span><span style="color: black;"> 的关系。</span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></p><span style="color: black;">其中,LLaMA-7B 就比 Chinchilla 中对应的最优解<span style="color: black;">运用</span>了更小的模型和<span style="color: black;">更加多</span>的计算量(数据)。<span style="color: black;">因为</span>参数量减小到<span style="color: black;">必定</span>程度,需要的计算量会有急剧的<span style="color: black;">提升</span>,作者认为模型的<span style="color: black;">体积</span><span style="color: black;">不该</span>该<span style="color: black;">少于</span>临界模型<span style="color: black;">体积</span>。譬如当<span style="color: black;">运用</span> 30% 的参数量时,所需计算量会<span style="color: black;">增多</span> 100%。参数量<span style="color: black;">不该</span>该再继续减小(否则计算量会<span style="color: black;">提升</span><span style="color: black;">非常多</span>)。</span><span style="color: black;">在 Llama-2 上<span style="color: black;">咱们</span><span style="color: black;">亦</span>能看到类似的现象。<span style="color: black;">按照</span> Chinchilla 规模性,2T 数据对应大约 50B 的参数量。<span style="color: black;">因此</span><span style="color: black;">针对</span> Llama-2-7b <span style="color: black;">来讲</span>,训练了一个相对更小的模型。而<span style="color: black;">针对</span> Llama-2-70b <span style="color: black;">来讲</span>,则<span style="color: black;">不足</span>效率。</span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></p><span style="color: black;">Werra&nbsp;认为<span style="color: black;">咱们</span>应该用<span style="color: black;">更加多</span>的数据继续训练更小的模型。这其中的难点在于:</span><span style="color: black;">训练所需的数据量<span style="color: black;">不足</span>(正如 指出的,<span style="color: black;">咱们</span>正在用尽互联网上所有的 tokens)。</span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">小集群上训练小模型需要更长的训练时间(Llama2 500k its);<span style="color: black;">倘若</span><span style="color: black;">运用</span>大集群训练则更困难(<span style="color: black;">例如</span>要<span style="color: black;">运用</span>更大的批量<span style="color: black;">体积</span><span style="color: black;">才可</span>提<span style="color: black;">有效</span>率)。</span></p><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;">
    <h2 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">LLM 的超参<span style="color: black;">选取</span></span></strong></span></h2>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">4.1 GPT</span></strong><span style="color: black;">(117M):</span></span></p><span style="color: black;">Adam</span><span style="color: black;">lr:2.5e-4</span><span style="color: black;">sch: warmup linear 2k, cosine decay to 0</span><span style="color: black;">bs: 32k=64x512</span><span style="color: black;">its: 3M (100e)</span><span style="color: black;">L2: 0.01</span><span style="color: black;">init: N(0, 0.02)</span><span style="color: black;"><strong style="color: blue;"><span style="color: black;">4.2 BERT</span></strong><span style="color: black;">(330M):</span></span><span style="color: black;">Adam(0.9,0.999)</span><span style="color: black;">lr: 1e-4</span><span style="color: black;">sch: warmup 10k, linear decay to 0</span><span style="color: black;">bs: 128k=256x512</span><span style="color: black;">its: 1M (40e)</span><span style="color: black;">L2: 0.01</span><span style="color: black;">dropout: 0.1</span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">4.3 Megatron-LM</span></strong><span style="color: black;">(GPT2 8.3B &amp; Bert 3.9B):</span></span></p><span style="color: black;">Adam</span><span style="color: black;">lr: 1.5e-4</span><span style="color: black;">sch: warmup 2k, cosine decay to 1e-5</span><span style="color: black;">bs: 512k=512x1024</span><span style="color: black;">its: 300k</span><span style="color: black;">L2: 0.01</span><span style="color: black;">dropout: 0.1</span><span style="color: black;">gradient norm clipping: 1.0</span><span style="color: black;">init: N(0, 0.02), weights before residual layer&nbsp;</span><span style="color: black;"><strong style="color: blue;"><span style="color: black;">4.4 T5&nbsp;(11B)</span></strong></span><span style="color: black;">AdaFactor</span><span style="color: black;">lr: 1e-2</span><span style="color: black;">sch: warmup constant 10k, sqrt decay</span><span style="color: black;">bs: 65k=128x512</span><span style="color: black;">its: 500k (1e)</span><span style="color: black;"><strong style="color: blue;"><span style="color: black;">4.5 GPT-3</span></strong></span><span style="color: black;">Adam(0.9, 0.95, eps=1e-8)</span><span style="color: black;">lr &amp; final bs:<span style="color: black;">‍</span></span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></p><span style="color: black;">sch: warmup linear 375m tokens, cosine decay to 0.1xlr 260b tokens, continue training with 0.1xlr</span><span style="color: black;">bs sch: 32k to final bs gradually in 4-12B tokens</span><span style="color: black;">seq length: 2048</span><span style="color: black;">data: 680B</span><span style="color: black;">gradient norm clipping: 1.0</span><span style="color: black;"><strong style="color: blue;"><span style="color: black;">4.6 Gopher</span></strong></span><span style="color: black;">Adam (Adafactor unstable beyond 7.1B)</span><span style="color: black;">lr &amp; final bs:<span style="color: black;">‍</span></span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></p><span style="color: black;">sch: warmup 1.5k, cosine decay to 0.1xlr</span><span style="color: black;">gradient norm clipping: 0.25 for 7.1B &amp; 280B, 1.0 for the rest</span><span style="color: black;"><strong style="color: blue;"><span style="color: black;">4.7 Chinchilla&nbsp;(70B)</span></strong></span><span style="color: black;">AdamW</span><span style="color: black;">lr: 1e-4</span><span style="color: black;">bs: 1.5M to 3M</span><span style="color: black;">others follow Gopher</span><span style="color: black;"><strong style="color: blue;"><span style="color: black;">4.8 OPT</span></strong></span><span style="color: black;">Adam(0.9, 0.95) (SGD plateau quickly)</span><span style="color: black;">lr &amp; bs:<span style="color: black;">‍</span></span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></p><span style="color: black;">sch: warmup linear 2k, decay to 0.1xlr</span><span style="color: black;">L2: 0.1</span><span style="color: black;">dropout: 0.1</span><span style="color: black;">gradient norm clipping: 1.0</span><span style="color: black;">ini</span><span style="color: black;">t: N(0, 0.006), output layer<span style="color: black;"> N(0, 0.006*)</span></span><span style="color: black;"><strong style="color: blue;"><span style="color: black;">4.9 PaLM</span></strong></span><span style="color: black;">Adafactor(0.9, 1-)</span><span style="color: black;">lr 1e-2</span><span style="color: black;">sch: warmup 10k, decay at</span><span style="color: black;"><span style="color: black;">‍</span></span><span style="color: black;">bs: 1M (&lt;50k), 2M (&lt;115k), 4M (&lt;255k)</span><span style="color: black;">L2: <span style="color: black;">lr</span></span><span style="color: black;">dropout: 0.1</span><span style="color: black;">gradient norm clipping: 1.0</span><span style="color: black;">its: 255k<span style="color: black;">‍</span>init: N(0,</span><span style="color: black;">embedding N(0,1)</span><span style="color: black;"><strong style="color: blue;"><span style="color: black;">4.10 LLaMA&nbsp;(RMSNorm, SwiGLU, RoPE)</span></strong></span><span style="color: black;">AdamW(0.9, 0.95)</span><span style="color: black;">lr &amp; bs:</span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></p><span style="color: black;">sch: warmup 2k, decay to 0.1xlr</span><span style="color: black;">L2: 0.1</span><span style="color: black;">gradient norm clipping: 1.0</span><span style="color: black;"><strong style="color: blue;"><span style="color: black;">4.11 LLaMA2</span></strong></span><span style="color: black;">AdamW(0.9, 0.95, eps=1e-5)</span><span style="color: black;">lr<span style="color: black;">‍</span></span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></p><span style="color: black;">sch: warmup 2k, decay to 0.1xlr</span><span style="color: black;">L2: 0.1</span><span style="color: black;">gradient norm clipping: 1.0</span><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;">
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;">参考文献</strong></p><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;">
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> Why do we need weight decay in modern deep learning?</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> Broken neural scaling laws</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> Training Compute-Optimal Large Language Models</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> Scaling Laws for Neural Language Models</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> Are Emergent Abilities of Large Language Models a Mirage?</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> Scaling Data-Constrained Language Models</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"> To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis</p>&nbsp;Go smol or go home
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">更加多</span>阅读</strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><a style="color: black;"><span style="color: black;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></span></a></p><a style="color: black;"><span style="color: black;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></span></a><a style="color: black;"><span style="color: black;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></span></a><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;">
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;">#<span style="color: black;">投 稿&nbsp;通 道</span>#</strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">&nbsp;让你的文字被<span style="color: black;">更加多</span>人看到&nbsp;</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">怎样</span><span style="color: black;">才可</span>让<span style="color: black;">更加多</span>的<span style="color: black;">优秀</span>内容以更短路径到达读者群体,缩短读者寻找<span style="color: black;">优秀</span>内容的成本呢?<strong style="color: blue;">答案<span style="color: black;">便是</span>:你不认识的人。</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">总有<span style="color: black;">有些</span>你不认识的人,<span style="color: black;">晓得</span>你想<span style="color: black;">晓得</span>的东西。PaperWeekly 或许<span style="color: black;">能够</span><span style="color: black;">作为</span>一座桥梁,<span style="color: black;">促进</span><span style="color: black;">区别</span>背景、<span style="color: black;">区别</span>方向的学者和学术灵感相互碰撞,迸发出<span style="color: black;">更加多</span>的可能性。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">PaperWeekly 鼓励高校实验室或个人,在<span style="color: black;">咱们</span>的平台上分享各类<span style="color: black;">优秀</span>内容,<span style="color: black;">能够</span>是<strong style="color: blue;">最新论文<span style="color: black;">诠释</span></strong>,<span style="color: black;">亦</span><span style="color: black;">能够</span>是<strong style="color: blue;">学术热点剖析</strong>、<strong style="color: blue;"><span style="color: black;">研究</span>心得</strong>或<strong style="color: blue;">竞赛经验讲解</strong>等。<span style="color: black;">咱们</span>的目的<span style="color: black;">仅有</span>一个,让知识真正流动起来。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">
页: [1]
查看完整版本: 大规模神经网络优化:超参最佳实践与规模律