白杨:爬虫是什么?白话说说SEO好伴侣「爬虫」,你真的认识吗?
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/3a10c6ee57a340a5b594c8961c7d820d~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725089408&x-signature=ZckSUkJcCAxPjvUtTfs8qlPIl%2BM%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">照片</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">本文大纲:</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">1、爬虫是什么?反爬虫又是什么?</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">2、爬虫有<span style="color: black;">那些</span><span style="color: black;">归类</span>?</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">3、爬中流程与搜索引擎工作流程</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">4、http/https协议与状态码</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">5、robots协议</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">爬虫是什么?反爬虫又是什么?</span></strong></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">这儿</span>的爬虫不是<span style="color: black;">咱们</span>生活中的爬虫,如蜘蛛。<span style="color: black;">这儿</span>的爬虫<span style="color: black;">更加多</span>指的是网络爬虫,即<span style="color: black;">咱们</span>叫它网页蜘蛛或网络<span style="color: black;">设备</span>人。当然,在SEO里,叫网页蜘蛛<span style="color: black;">更加多</span>。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">网络爬虫,是一种<span style="color: black;">根据</span><span style="color: black;">必定</span>规则,自动地抓取互联网上的信息的一种程序。他有一个英文名叫spider,<span style="color: black;">例如</span>百度网页蜘蛛就叫baiduspider,那搜狗的就叫Sogou spider。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">这<span style="color: black;">亦</span>是<span style="color: black;">咱们</span>SEO人员做网站优化排名会听说的一个词。网站为啥没收录呢?原来蜘蛛没来抓取!<span style="color: black;">怎样</span>看这个爬虫蜘蛛朋友来没来呢,让技术把网站日志下载给<span style="color: black;">咱们</span>,<span style="color: black;">咱们</span>就<span style="color: black;">能够</span>判断了,你说算不算好<span style="color: black;">伴侣</span>?</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">百度爬虫是什么?Baiduspider是啥?</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Baiduspider是百度搜索引擎的一个自动程序,它的<span style="color: black;">功效</span>是<span style="color: black;">拜访</span>互联网上的网页,<span style="color: black;">创立</span>索引数据库,<span style="color: black;">运用</span>户能在百度搜索引擎中搜索到网站上的网页。百度还有<span style="color: black;">那些</span>蜘蛛呢?如下图。最<span style="color: black;">大都是</span>圈中这个,记得哈~</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/b19b96e4cf0141308e1a889187e0f34c~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725089408&x-signature=%2F8LdB%2FoHYD083%2Fn3k8N%2Bm09Smjc%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">照片</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">反爬虫是什么?</span></strong></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">咱们</span>以门户网站举例,企业网站<span style="color: black;">亦</span>同理哈。门户网站<span style="color: black;">经过</span>相应的策略和技术手段,防止爬虫程序进行网站数据的爬取,这就叫反爬虫。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">当然,其实还有反反爬虫,即爬虫程序<span style="color: black;">经过</span>相应的策略和技术手段,破解了门户网站的反爬虫手段,从而爬取到相应的数据,这就叫反反爬虫。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">再白话举例:你要来采集我的内容(爬虫),我不给你采并且我做防采集(反爬虫)。你呢,又搞了更高技术把我防采集攻破了采集(反反爬虫),<span style="color: black;">这般</span>理解了吧?</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">爬虫有<span style="color: black;">那些</span><span style="color: black;">归类</span>?</span></strong></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">爬虫总共就分两类:通用爬虫与聚焦爬虫。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">通用爬虫:简单说<span style="color: black;">便是</span>尽可能的把网上的所有的网页下载下来,放到服务器里再对这些网页做<span style="color: black;">关联</span>处理,最后给用户搜索用,<span style="color: black;">一般</span>指的搜索引擎爬虫。<span style="color: black;">例如</span>:谷歌爬虫、百度爬虫、搜狗爬虫、360爬虫等。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">聚焦爬虫:它是<span style="color: black;">按照</span>指定的<span style="color: black;">需要</span>抓取网络上指定网站的数据。<span style="color: black;">例如</span>:获取知乎问答上的某一问题的浏览量和回答人数,而不是获取<span style="color: black;">全部</span>页面中所有数据。它<span style="color: black;">亦</span><span style="color: black;">能够</span>理解叫特定爬虫。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">上面<span style="color: black;">说到</span>的反爬虫与反反爬虫,基本上都是在反这种聚焦爬虫哈,你<span style="color: black;">亦</span><span style="color: black;">能够</span>理解为爬虫攻防战哈哈哈。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">爬中流程与搜索引擎工作流程</span></strong></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">爬虫<span style="color: black;">通常</span>工作流程:确定某个URL——发送请求——响应内容——提取数据——<span style="color: black;">保留</span>数据。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">搜索引擎蜘蛛工作流程:爬取网页——存储数据——数据预处理——<span style="color: black;">供给</span>用户搜索网页排名。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">是不是感觉难理解?发送请求是什么,响应内容又是什么?这个往下看HTTP协议与状态看完你就懂了。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">关于搜索引擎数据预处理在处理什么,怎么理解?看公众号白杨SEO两年前写过这篇《白杨SEO:大白话告诉你理解搜索引擎工作原理的<span style="color: black;">道理</span>和运用》,看完你就懂了。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">http/https协议与状态码</span></strong></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">HTTP协议<span style="color: black;">指的是</span>Hyper Text Transfer Protocol(超文本传输协议)的缩写,是用于从万维网 WWW(World Wide Web缩写)服务器传输超文本到本地浏览器的传送协议。默认端口号:80。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">而HTTPS (Secure Hypertext Transfer Protocol)安全超文本传输协议指的是HTTPS是在HTTP上<span style="color: black;">创立</span>SSL加密层,并对传输数据进行加密,是HTTP协议的安全版。默认端口号:443。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">你是不是理解不了这个HTTP到底什么东东?简单白话<span style="color: black;">来讲</span>这个<span style="color: black;">便是</span>用来传输和接收页面的,<span style="color: black;">保准</span>你的电脑能快速传输文本文档并且让你看到哈。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">至于HTTP的请求头,响应头,都是<span style="color: black;">各样</span>代码,白杨SEO就不在<span style="color: black;">这儿</span>写了,<span style="color: black;">倘若</span>你要真的想<span style="color: black;">认识</span>,自己去搜索,<span style="color: black;">这儿</span>只讲一下<span style="color: black;">咱们</span>SEO中会看到的HTTP请求响应状态码,<span style="color: black;">通常</span>状态码如下:</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/93e72ad5d9ec4af8801da9da88dd1393~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725089408&x-signature=bvshY17YqJ34ctEen9TwRuEx%2FoU%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">照片</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">上面只要是2<span style="color: black;">或</span>3开头都是好的,<span style="color: black;">例如</span>查白杨SEO博客的:</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/d587aedd1012489b8ed142e3d088aac9~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725089408&x-signature=Mz7IcryKcnoQEEpg7aPHXz%2FuzO4%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">照片</span></span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/7e568c273f3e413ab842edc90eb46e58~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725089408&x-signature=Qm5Hjah20ENL%2FkwHxRg%2BLhnXb6A%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">照片</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">输入http://www.baiyangseo.com返回是301,而输入</p>https://www.baiyangseo.com 返回的是200正常的你<span style="color: black;">晓得</span><span style="color: black;">为何</span>吗?
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">其实,这在SEO里<span style="color: black;">来讲</span>,是<span style="color: black;">由于</span>两个<span style="color: black;">区别</span>URL内容是一模<span style="color: black;">同样</span>,为了让搜索引擎避免认为作<span style="color: black;">坏处</span>,<span style="color: black;">因此</span>做了301永久重定向。简单理解,你用不带s的HTTP那个域名打开<span style="color: black;">便是</span>这个带的了哈。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">关于这个状态码,<span style="color: black;">倘若</span>你想学习<span style="color: black;">认识</span>更深入一点,<span style="color: black;">一样</span><span style="color: black;">能够</span>白杨SEO公众号上这篇:《白杨SEO:SEO入门学习之搜索引擎蜘蛛与网站HTTP状态码》</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">robots协议</span></strong></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">最后,来到针对搜索引擎网页蜘蛛robots协议了。这个<span style="color: black;">倘若</span>你是学SEO的,肯定要学的。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">robots协议是什么?简单理解<span style="color: black;">便是</span>网站<span style="color: black;">经过</span>Robots协议告诉搜索引擎,网站上<span style="color: black;">那些</span>页面<span style="color: black;">能够</span>抓取,<span style="color: black;">哪些</span>页面<span style="color: black;">不可</span>抓取!<span style="color: black;">然则</span>,它仅仅是互联网中的一种约定<span style="color: black;">罢了</span>。<span style="color: black;">因此</span>有些人说我明明禁止XXX蜘蛛还是被抓取了哈哈哈。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">它长啥样?到底有什么用?</span></p>
<div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/17db5250c56b4868bb7e1c1cb67f4a26~noop.image?_iz=58558&from=article.pc_detail&lk3s=953192f4&x-expires=1725089408&x-signature=ifvkmxWcI3hDmizT2zlZx%2BL%2BgGM%3D" style="width: 50%; margin-bottom: 20px;"></div>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">照片</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">长啥样,如上图,<span style="color: black;">功效</span><span style="color: black;">便是</span>上面说的,在SEO里<span style="color: black;">便是</span>告诉蜘蛛来爬我<span style="color: black;">这儿</span>,<span style="color: black;">通常</span><span style="color: black;">每一个</span>站都会做这个,<span style="color: black;">由于</span>蜘蛛<span style="color: black;">首要</span>要爬取一个页面这个<span style="color: black;">地区</span>是最先爬取的,<span style="color: black;">亦</span>会反复爬取。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">不要问我为啥要给蜘蛛爬取,你做一个网站目的是啥,不<span style="color: black;">便是</span>要让蜘蛛爬取<span style="color: black;">而后</span>用户搜索的时候看到你带来流量吗?当然,你说我做网站只是用来存储我自己看除外哈哈哈。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">作者简介:</span></strong></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">白杨SEO,专注SEO<span style="color: black;">科研</span>十年,SEO、流量实战派,对互联网<span style="color: black;">精细</span>流量有深入<span style="color: black;">科研</span>。</span></p>
期待楼主的下一次分享!” “板凳”(第三个回帖的人) 楼主的文章非常有意义,提升了我的知识水平。 感谢你的精彩评论,为我的思绪打开了新的窗口。
页:
[1]