记录一次百度蜘蛛爬虫疯狂抓取的诡异经历
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">这次诡异的经历时间长达十多天,<span style="color: black;">时期</span>明月网站服务器的负载多次飙升到极限,每次都是强制停止 php-fpm 进程来缓解,<span style="color: black;">能够</span>说严重影响了博客网站的正常运行,刚<span style="color: black;">起始</span>明月还以为是又碰到个“手欠”拿我博客来练手 CC/DDos 攻击的,<span style="color: black;">然则</span>随后几天的日志分析结果外加明月<span style="color: black;">数年</span><span style="color: black;">败兴</span>被 CC/DDos 攻击经验判断排除了被人攻击的可能性,<span style="color: black;">原由</span><span style="color: black;">吗</span>?很简单,你见过有人用百度蜘蛛爬虫IP 来实施 CC/DDos 攻击吗?反正,明月是<span style="color: black;">无</span>见过!</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/fa72f42f42484a67b5992759150ba193~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725097459&x-signature=%2BDe2jZxwOIgZ39DKtim9zbmM5J0%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">刚<span style="color: black;">起始</span>明月<span style="color: black;">亦</span>是不相信会是百度蜘蛛爬虫<span style="color: black;">导致</span>的这个结果,<span style="color: black;">然则</span>在把几天的 Nginx 日志里的 IP 进行了筛选后得出的结果是这些爬虫的 IP 几乎都是真实的百度蜘蛛爬虫IP,并不是简单的 UA 仿冒百度蜘蛛爬虫。我去,这个结果真心让人很郁闷呀:</span><strong style="color: blue;"><span style="color: black;">我竟然被别人梦寐以求的百度蜘蛛爬虫给围殴了</span></strong><span style="color: black;">!</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/07e41666c8654431a3e9a41f24a6425f~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725097459&x-signature=qjGCCQh2R2cGBtdRH6iFpYLVPqg%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">俗话说“事出反常必有妖”,本着这个思路明月<span style="color: black;">起始</span>了为期<span style="color: black;">1星期</span>的排查工作,<span style="color: black;">由于</span>【不熬夜,是最顶级的自律】和【熬夜<span style="color: black;">怎样</span>改变了<span style="color: black;">咱们</span>的身体】这两篇<span style="color: black;">文案</span>的缘故明月正在戒掉“熬夜”这个习惯(<span style="color: black;">期盼</span>像明月戒烟<span style="color: black;">同样</span>能成功哦!)<span style="color: black;">因此</span>这次排查工作效率很低,都是抽空进行的,需要多次随机的抽查这些蜘蛛爬虫请求的 User Agaent、IP、链接、主机域名等等数据,直到今天<span style="color: black;">最终</span>让明月给找到问题出在哪里了?</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">导致</span>百度蜘蛛爬虫这次大批量、<span style="color: black;">连续</span>性的抓取一个最<span style="color: black;">重点</span>的<span style="color: black;">原由</span>是百度站长平台的“抓取频次”过高<span style="color: black;">导致</span>的,查看百度站长平台站点抓取频次如下图:</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/1e12769f44bc441f978307338d5a0334~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725097459&x-signature=fN3h8pcY%2BQWEsbBXcYA0H2wQP48%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">能够</span>看到是 21912 次/天,<span style="color: black;">能够</span>想象这个频次给网站服务器带来了多大的压力呀!<span style="color: black;">始终</span>到最后明月才<span style="color: black;">发掘</span>这次是两个站点的高频次抓取<span style="color: black;">一块</span>汇总到我一个服务器上来了,上面这个 21912 次/天是 blog.ymanz.com 这个域名站点的抓取频次,还有一个抓取频次<span style="color: black;">便是</span>我博客的域名 imydl.com 的抓取频次是 17982 次/天。两个站点相叠加那<span style="color: black;">便是</span><span style="color: black;">每日</span>接近 40000 次的抓取频率,平均到每分钟就要接待近 30 次的请求,无语了!</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/21f46c951d3f40458577ff7ce50e733c~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725097459&x-signature=u2kOUAxG3N4RnjmzzDCedYQz1pU%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">这负载给拉的是满满的,要<span style="color: black;">晓得</span>明月的服务器配置可是<span style="color: black;">初期</span>阿里云 ECS 最低配置:1H1G 哦</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">我这小驴车怎么经得起这么折腾,<span style="color: black;">因此</span>明月<span style="color: black;">发掘</span>问题后赶紧<span style="color: black;">处理</span>,<span style="color: black;">首要</span>是取消掉 blog.ymanz.com 的解析(这是明月博客<span style="color: black;">初期</span>的域名,<span style="color: black;">日前</span>看来只能是放弃解析<span style="color: black;">转</span>了),其次调低百度站长平台里 blog.ymanz.com 和 imydl.com 站点的抓取频次<span style="color: black;">每日</span>上限:</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/4cbedc683d97419b983e0a24e3779836~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725097459&x-signature=jl%2B2r1XFmu73ypALPry%2ByevvusU%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">由于</span> blog.ymanz.com 是个废弃域名了,<span style="color: black;">因此</span>直接<span style="color: black;">调节</span>到最低值。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">经过<span style="color: black;">以上</span>操作后,随后的几个小时百度蜘蛛爬虫来访的频率降下来了,服务器的负载<span style="color: black;">亦</span>难得的恢复如初:</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/pgc-image/2202738b2e7e4166a73dd0d08b9cb806~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725097459&x-signature=0v%2FxJexzfXaBHOeLrQ08AWAS7bA%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">看到这久违的负载值,这几天的忙活<span style="color: black;">无</span>白费,这次经历下来让明月<span style="color: black;">针对</span>运维这个工作的认识又加深了不少,这是一个随时都要面对挑战,并且,当面临挑战的时候要平心静气的分析、整理、思考后<span style="color: black;">处理</span>问题并制定出<span style="color: black;">仔细</span>的预防<span style="color: black;">方法</span>并实施,<span style="color: black;">倘若</span>您是一个网站运营者并不是很懂服务器运维,<span style="color: black;">那样</span>明月<span style="color: black;">意见</span>您<span style="color: black;">能够</span><span style="color: black;">思虑</span>一下运维外包服务,<span style="color: black;">例如</span>明月自己就有<span style="color: black;">供给</span>这种有偿服务哦</span></p>
外贸网站建设方法 http://www.fok120.com/ 我完全赞同你的观点,思考很有深度。 回顾历史,我们感慨万千;放眼未来,我们信心百倍。 你的言辞如同繁星闪烁,点亮了我心中的夜空。 感谢楼主的分享!我学到了很多。 外链发布社区 http://www.fok120.com/
页:
[1]