1fy07h 发表于 2024-7-9 14:35:51

专为大模型训练优化,百度集合通信库 BCCL 万卡集群快速定位故障


    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_gif/0MlRpv01lCHEOx2NBmHjXaGmTqzVI9sEP5VyAmy5TGWwjIXHeANK8sk6OPNQ1b5H3XicIJCaib5jewxdccs5gicRg/640?wx_fmt=gif&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1" style="width: 50%; margin-bottom: 20px;"></p><span style="color: black;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">1&nbsp;&nbsp;&nbsp;&nbsp;集合通信对分布式训练至关<span style="color: black;">要紧</span></span></strong></span></span><span style="color: black;">在分布式训练中,每<span style="color: black;">一起</span> GPU 只负责处理部分模型<span style="color: black;">或</span>数据。<strong style="color: blue;">集群中<span style="color: black;">区别</span> GPU 之间<span style="color: black;">经过</span>集合通信的方式,完成梯度同步和参数更新等操作,使得所有 GPU 能够<span style="color: black;">做为</span>一个整体加速模型训练。</strong></span><span style="color: black;"><span style="color: black;">倘若</span>有<span style="color: black;">一起</span> GPU 在集合通信中出了<span style="color: black;">情况</span>,将会<span style="color: black;">引起</span>其他 GPU <span style="color: black;">处在</span>等待状态,直到这块 GPU 完成数据同步,集群中所有 GPU 才会<span style="color: black;">起始</span>后续工作。</span><span style="color: black;"><strong style="color: blue;"><span style="color: black;">因此</span>,集合通信性能直接影响了分布式任务的速度,决定了集群中所有 GPU 能否形成合力加速模型训练。</strong></span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">为了最大<span style="color: black;">提高</span>集合通信的性能,在<span style="color: black;">基本</span><span style="color: black;">设备</span>层面,集群<span style="color: black;">一般</span>采用基于 RDMA 的高性能<span style="color: black;">理学</span>网络,在任务运行时<span style="color: black;">运用</span>集合通信库进行加速。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">2&nbsp; &nbsp; 大模型对系统的运维能力和稳定性提出新<span style="color: black;">需求</span></span></strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">咱们</span><span style="color: black;">晓得</span>,大模型的训练任务时长以周或月为周期,<span style="color: black;">集群规模在千卡<span style="color: black;">乃至</span>万卡以上规模。这<span style="color: black;">引起</span>在<span style="color: black;">全部</span>任务过程中会<span style="color: black;">出现</span><span style="color: black;">各样</span>故障,<span style="color: black;">引起</span>资源利用率不高<span style="color: black;">或</span>任务中断。</span><strong style="color: blue;"><span style="color: black;">这使得大模型的训练任务,</span><span style="color: black;">不可</span>只看重集群规模和性能,更<span style="color: black;">必须</span>关注系统的运维能力和稳定性。</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">倘若</span>系统的运维能力和稳定性<span style="color: black;">不足</span>好,将会降低集群的「<strong style="color: blue;">有效训练时长</strong>」,延长项目时间产生昂贵的时间成本。<span style="color: black;">例如</span>完成<span style="color: black;">全部</span>训练任务花了 30 天,结果有 10 天是在排除各类故障,这是不可接受的。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">在分布式训练任务中,<span style="color: black;">做为</span>系统核心组件之一的集合通信库,<span style="color: black;">一样</span><span style="color: black;">必须</span>面向大模型场景,在系统的运维能力和稳定性上进行优化。</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">3&nbsp; &nbsp; 百度集合通信库 BCCL 概述</span></strong></span></p>
    <h2 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;">百度集合通信库 BCCL(Baidu Collective Communication Library)是百度智能云推出的一款面向大模型训练场景优化的集合通信库,是百度百舸 3.0 中的<span style="color: black;">要紧</span>组件。</span></h2>
    <h2 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><span style="color: black;">BCCL 基于开源的 NCCL 进行了功能扩展和能力<span style="color: black;">加强</span>,针对大模型训练场景在可观测性、故障诊断、稳定性等方面进行优化,进一步<span style="color: black;">提高</span>集合通信库的可运维能力。</span><span style="color: black;">同期</span>,BCCL 针对百度智能云的特定 GPU 芯片进行了集合通信性能优化,进一步<span style="color: black;">提高</span>资源利用率。相比 NCCL,<span style="color: black;">BCCL 的<span style="color: black;">重要</span>特性如下:</span></span></h2><span style="color: black;">可观测性:新增集合通信带宽实时统计能力;</span><span style="color: black;">故障诊断:新增集合通信 hang 时的故障诊断能力;</span><span style="color: black;">稳定性:<span style="color: black;">加强</span>网络稳定性和故障容错能力;</span><span style="color: black;">性能优化:<span style="color: black;">提高</span>大模型训练主流 GPU 芯片的集合通信性能。</span><span style="color: black;">接下来,<span style="color: black;">咱们</span>将介绍 BCCL 在以上 4 个方面的能力。</span>
    <h2 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">4 &nbsp;&nbsp; 可观测性:集合通信带宽实时统计</span></strong></span></h2><span style="color: black;"><strong style="color: blue;"><span style="color: black;">4.1&nbsp;&nbsp;&nbsp;&nbsp;背景</span></strong></span><span style="color: black;">在训练过程中,有时候会<span style="color: black;">显现</span>任务正常运行,<span style="color: black;">然则</span>集群的端到端性能下降的<span style="color: black;">状况</span>。<span style="color: black;">显现</span>这类问题,可能是集群中任一组件<span style="color: black;">引起</span>的。<span style="color: black;">此时</span>候就<span style="color: black;">必须</span>运维工程师对集群进行全面的<span style="color: black;">检测</span>。</span><span style="color: black;"><strong style="color: blue;"><span style="color: black;">4.2&nbsp;&nbsp;&nbsp;&nbsp;问题</span></strong></span><span style="color: black;">其中,存储系统、RDMA 网络、GPU 卡等<span style="color: black;">一般</span>都配有实时可观测性平台,<span style="color: black;">能够</span>在不中断任务运行的<span style="color: black;">状况</span>下判断<span style="color: black;">是不是</span>存在<span style="color: black;">反常</span>。相比之下,针对集合通信性能的判断,则缺乏实时和直接的手段。<span style="color: black;">日前</span>,若<span style="color: black;">可疑</span>集合通信存在性能问题,只能<span style="color: black;">运用</span>如下 2 种手段:</span><span style="color: black;"><span style="color: black;">运用</span> RDMA 流量监控平台进行故障排查。这种<span style="color: black;">办法</span>仅能间接推测出跨机集合通信性能<span style="color: black;">是不是</span>有<span style="color: black;">反常</span>。</span><span style="color: black;">停止训练任务释放 GPU 资源,<span style="color: black;">运用</span> nccl-test 进行二分<span style="color: black;">查询</span>,<span style="color: black;">最后</span>锁定<span style="color: black;">显现</span>故障的设备。</span><span style="color: black;">虽然第 2 种<span style="color: black;">办法</span><span style="color: black;">能够</span>完成集合通信<span style="color: black;">反常</span>的诊断,<span style="color: black;">然则</span>测试场景比较有限,只能判断<span style="color: black;">是不是</span>有常规的硬件<span style="color: black;">反常</span>问题。<span style="color: black;">同期</span><span style="color: black;">全部</span>过程中会<span style="color: black;">引起</span>训练中断,产生昂贵的时间成本。</span><span style="color: black;"><strong style="color: blue;">4.3&nbsp;&nbsp;&nbsp;&nbsp;特性和效果</strong></span><span style="color: black;"><strong style="color: blue;">BCCL 的实时集合通信带宽统计功能,<span style="color: black;">能够</span>在训练过程中对集合通信性能进行实时观测,准确地展示集合通信在<span style="color: black;">区别</span><span style="color: black;">周期</span>的性能表现,为故障诊断排除、训练性能调优等<span style="color: black;">供给</span>数据支撑</strong>。即使在<span style="color: black;">繁杂</span>通信模式下,BCCL <span style="color: black;">经过</span>精确的打点技术依然能<span style="color: black;">供给</span>准确的带宽统计的能力。</span><span style="color: black;">在集合通信性能<span style="color: black;">反常</span>的故障排除方面,<span style="color: black;">能够</span>进一步<span style="color: black;">按照</span><span style="color: black;">区别</span>通信组的性能缩小故障范围。在混合并行模式下,<span style="color: black;">能够</span><span style="color: black;">经过</span>多个性能<span style="color: black;">反常</span>的通信组的交集进一步确认故障节点。</span><span style="color: black;">在训练性能优化方面,<span style="color: black;">能够</span><span style="color: black;">评定</span>该带宽<span style="color: black;">是不是</span>打满硬件上限,<span style="color: black;">是不是</span>有其他的优化策略,为模型调优<span style="color: black;">供给</span><span style="color: black;">更加多</span>的监控数据支撑。</span>
    <h2 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">5 &nbsp;&nbsp; 故障诊断:集合通信故障诊断</span></strong></span></h2><span style="color: black;"><strong style="color: blue;"><span style="color: black;">
                <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">5.1&nbsp;&nbsp;&nbsp;&nbsp;背景</p>
            </span></strong></span><span style="color: black;">设备故障<span style="color: black;">引起</span>的训练任务<span style="color: black;">反常</span>停止,<span style="color: black;">亦</span>是大模型训练任务时常<span style="color: black;">出现</span>的<span style="color: black;">情况</span>。故障<span style="color: black;">出现</span>后,<span style="color: black;">通常</span>都会有报错日志<span style="color: black;">或</span>巡检<span style="color: black;">反常</span>告警,<span style="color: black;">例如</span><span style="color: black;">能够</span><span style="color: black;">发掘</span>某个 GPU 存在<span style="color: black;">反常</span>。在训练任务<span style="color: black;">反常</span>时,<span style="color: black;">咱们</span>只<span style="color: black;">必须</span>匹配<span style="color: black;">反常</span>时间点<span style="color: black;">是不是</span>有<span style="color: black;">关联</span><span style="color: black;">反常</span>事件或告警,<span style="color: black;">就可</span>确认故障 root cause。</span><span style="color: black;">除此之外,还存在着一类不告警的「静默故障」。<strong style="color: blue;">当<span style="color: black;">出现</span>故障时,<span style="color: black;">全部</span>训练任务 hang 住,<span style="color: black;">没</span>法继续训练,<span style="color: black;">然则</span>进程不会<span style="color: black;">反常</span>退出,<span style="color: black;">亦</span><span style="color: black;">没</span>法确认是哪个 GPU 或哪个故障节点<span style="color: black;">引起</span>训练任务 hang。</strong>然而,此类问题的排查难点在于,该类故障不会立刻<span style="color: black;">出现</span>,训练任务<span style="color: black;">能够</span>正常<span style="color: black;">起步</span>并正常训练,<span style="color: black;">然则</span>在训练超过一<span style="color: black;">按时</span>间后(可能是几个小时<span style="color: black;">或</span>数天)<span style="color: black;">忽然</span> hang 住。排查时很难稳定复现该故障,<span style="color: black;">引起</span>排查难度进一步<span style="color: black;">加强</span>。</span><span style="color: black;"><strong style="color: blue;"><span style="color: black;">5.2&nbsp; &nbsp; 问题</span></strong></span><span style="color: black;"><span style="color: black;">因为</span>集合通信的同步性,当某个 GPU <span style="color: black;">显现</span>故障时,其他 GPU 仍会认为自己<span style="color: black;">处在</span>正常地等待状态。<span style="color: black;">因此呢</span>,当通信过程中断时,<span style="color: black;">无</span> GPU 会输出<span style="color: black;">反常</span>日志,使得<span style="color: black;">咱们</span>很难<span style="color: black;">快速</span>定位到<span style="color: black;">详细</span>的故障 GPU。当上层应用程序在某一多 GPU 的集合通信操作中 hang 时,应用程序<span style="color: black;">亦</span>只能感知到某个集合通信组(故障 comm)<span style="color: black;">显现</span>了问题,却<span style="color: black;">没</span>法精确地判断是哪个 GPU <span style="color: black;">引起</span>了此次集合通信的<span style="color: black;">反常</span>。</span><span style="color: black;">运维工程师<span style="color: black;">一般</span><span style="color: black;">运用</span> nccl-test 来尝试复现和定位问题,<span style="color: black;">然则</span><span style="color: black;">因为</span>压测时间短、测试场景简单,很难复现集合通信 hang。</span><span style="color: black;">在百度集团内部排查此类问题时,<span style="color: black;">首要</span>停止线上的训练任务,<span style="color: black;">而后</span>进行<span style="color: black;">长期</span>的压测,<span style="color: black;">例如</span><span style="color: black;">针对</span>现有训练任务模型进行切分,对集群<span style="color: black;">设备</span>进行分批次压测,<span style="color: black;">持续</span>缩小故障范围,从而确认故障机。排查代价<span style="color: black;">一般</span><span style="color: black;">必须</span> 2 天<span style="color: black;">乃至</span><span style="color: black;">更加多</span>。这类故障排查的时间,将带来巨大的集群停机成本。</span><span style="color: black;"><strong style="color: blue;">5.3&nbsp;&nbsp;&nbsp;&nbsp;<strong style="color: blue;">特性和效果</strong></strong></span><span style="color: black;">为了应对这一挑战,在训练任务正常运行时,BCCL 实时记录集合通信内部的通信状态。当任务 hang 时,BCCL 会输出各个 rank 的集合通信状态。运维工程师<span style="color: black;">能够</span><span style="color: black;">按照</span>这些数据特征来进一步缩小故障 GPU 的范围。<span style="color: black;">经过</span>这种<span style="color: black;">办法</span>,BCCL <span style="color: black;">经过</span>一种近乎<span style="color: black;">没</span>损的方式实现了故障机的快速定位,大幅度<span style="color: black;">加强</span>了问题排查的效率。</span>
    <h2 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">6 &nbsp;&nbsp; 稳定性:网络稳定性和容错<span style="color: black;">加强</span></span></strong></span></h2><span style="color: black;"><strong style="color: blue;">6.1&nbsp;&nbsp;&nbsp;&nbsp;背景</strong></span><span style="color: black;">在模型训练过程中,单个网络端口偶发性的 updown 会<span style="color: black;">引起</span>当前进程<span style="color: black;">反常</span>,<span style="color: black;">从而</span><span style="color: black;">导致</span><span style="color: black;">全部</span>训练任务退出。然而,单端口的偶发性 updown 在<span style="color: black;">理学</span>网络是不可避免的。</span><span style="color: black;"><strong style="color: blue;">6.2&nbsp;&nbsp;&nbsp;&nbsp;<strong style="color: blue;">特性和效果</strong></strong></span><span style="color: black;">BCCL 针对此类偶发性的<span style="color: black;">反常</span>场景,进行了故障容错以避免任务退出,<span style="color: black;">提高</span>训练任务的稳定性。</span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">掌控</span>面容错能力<span style="color: black;">提高</span>:在训练任务<span style="color: black;">起步</span>时,<span style="color: black;">一般</span>会<span style="color: black;">因为</span>偶发性的网络故障或其他故障<span style="color: black;">引起</span>训练任务<span style="color: black;">起步</span>失败。BCCL 针对<span style="color: black;">平常</span>的偶发性<span style="color: black;">反常</span>故障<span style="color: black;">增多</span>相应的重试机制,<span style="color: black;">保证</span>训练任务正常<span style="color: black;">起步</span>。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数据面容错能力<span style="color: black;">提高</span>:在训练任务正常运行时,偶发性的网络抖动可能<span style="color: black;">引起</span> RDMA 重传超次,从而<span style="color: black;">引起</span><span style="color: black;">全部</span>训练任务<span style="color: black;">反常</span>。BCCL 优化了 RDMA 重传超次机制,<span style="color: black;">提高</span>训练任务的健壮性。</span></p>
    <h2 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><strong style="color: blue;">7 &nbsp;&nbsp; 性能优化:集合通信性能优化</strong></span></h2><span style="color: black;">针对大模型训练场景的主流 GPU 芯片,集合通信性能还存在继续<span style="color: black;">提高</span>的空间,进一步对任务进行加速。</span><span style="color: black;">BCCL 针对百度智能云<span style="color: black;">供给</span>的主流的 GPU 芯片进行了深度优化。以双机 H800 测试环境为例,BCCL 相比 NCCL 带宽利用率可<span style="color: black;">提高</span> 10%。</span><img src="https://mmbiz.qpic.cn/sz_mmbiz_png/0MlRpv01lCFKDZVvFf6gZ7fdwmFXePBgDyBA74LAKibV6ahFicGAAibovOwB1ABbMFzHfrk6KsIx8lMKdvtJT7qgw/640?wx_fmt=png&amp;from=appmsg&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"><img src="https://mmbiz.qpic.cn/sz_mmbiz_png/0MlRpv01lCFKDZVvFf6gZ7fdwmFXePBgAsrgogIjlRIwwxMFP3JXX4q5aXB0CTkehLXyDw2QHwsicV5ibS8xTq6g/640?wx_fmt=png&amp;from=appmsg&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"><img src="https://mmbiz.qpic.cn/sz_mmbiz_png/0MlRpv01lCFKDZVvFf6gZ7fdwmFXePBgfUPhWQcgf8PC0OmIicP0ibbOxqtP14HT238P3CCNVdlammYIbXCpYnzQ/640?wx_fmt=png&amp;from=appmsg&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"><span style="color: black;"><strong style="color: blue;">8&nbsp;&nbsp;&nbsp;&nbsp;总结</strong></span><span style="color: black;">2023 年 12 月 20 日,百度百舸·AI 异构计算平台 3.0 发布,它是专为大模型优化的智能<span style="color: black;">基本</span><span style="color: black;">设备</span>。</span><span style="color: black;"><span style="color: black;">借助 BCCL 在运维能力和稳定性进行的优化,使得百度百舸平台</span><span style="color: black;">的有效训练时长达到 98%,带宽的有效利用率<span style="color: black;">能够</span>达到 95%。</span></span><span style="color: black;">-&nbsp;-&nbsp;-&nbsp;-&nbsp;-&nbsp;-&nbsp;-&nbsp;-&nbsp;-&nbsp;-&nbsp;END&nbsp;-&nbsp;-&nbsp;-&nbsp;-&nbsp;-&nbsp;-&nbsp;-&nbsp;-&nbsp;-&nbsp;-</span><span style="color: black;"><span style="color: black;">点击阅读原文,<span style="color: black;">认识</span>&nbsp;</span><span style="color: black;">BCCL&nbsp;</span><span style="color: black;"><span style="color: black;">更加多</span>信息</span></span>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">传送门</span></strong></span></p><a style="color: black;"><span style="color: black;">大模型重构云计算</span></a><a style="color: black;"><span style="color: black;">AI 原生时代的云计算</span></a><a style="color: black;"><span style="color: black;">大规模 AI 高性能网络的设计与实践</span></a><a style="color: black;"><span style="color: black;">GPU 容器虚拟化新能力发布和全场景实践</span></a><a style="color: black;"><span style="color: black;">面向大模型的存储加速<span style="color: black;">方法</span>设计和实践</span></a><a style="color: black;"><span style="color: black;">向量检索在大模型应用场景的技术和实践</span></a><a style="color: black;"><span style="color: black;">大模型</span></a>时代的异构计算平台<a style="color: black;"><span style="color: black;">高性能网络建设指南,《智算中心网络架构白皮书》开放下载</span></a>




b1gc8v 发表于 2024-9-27 14:03:43

交流如星光璀璨,点亮思想夜空。

wrjc1hod 发表于 2024-10-22 07:54:39

你的见解独到,让我受益匪浅,非常感谢。
页: [1]
查看完整版本: 专为大模型训练优化,百度集合通信库 BCCL 万卡集群快速定位故障