nykek5i 发表于 2024-8-25 13:25:08

超级干货 :一文读懂网络爬虫


    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/heS6wRSHVMn2o2HDd6UXlnFIvQpPYUYtTg3zqNO7EfwSPTAo8Gwjdc3cSmZTU70sInBAiaPQ0XNcc4OOy79doLw/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">前言</span></strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">在当前数据爆发的时代,数据分析行业势头强劲,越来越多的人涉足数据分析<span style="color: black;">行业</span>。进入<span style="color: black;">行业</span>最想要的<span style="color: black;">便是</span>获取<span style="color: black;">海量</span>的数据来为自己的分析<span style="color: black;">供给</span>支持,<span style="color: black;">然则</span><span style="color: black;">怎样</span>获取互联网中的有效信息?这就促进了“爬虫”技术的飞速发展。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">网络爬虫(又被<span style="color: black;">叫作</span>为网页蜘蛛,网络<span style="color: black;">设备</span>人,在FOAF社区中间,更经常的<span style="color: black;">叫作</span>为网页追逐者),是一种<span style="color: black;">根据</span>一定的规则,自动地抓取万维网信息的程序<span style="color: black;">或</span>脚本。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">传统爬虫从一个或若干初始网页的URL<span style="color: black;">起始</span>,<span style="color: black;">得到</span>初始网页上的URL,在抓取网页的过程中,<span style="color: black;">持续</span>从当前页面上抽取新的URL放入队列,直到满足系统的<span style="color: black;">必定</span>停止<span style="color: black;">要求</span>。聚焦爬虫的工作流程较为<span style="color: black;">繁杂</span>,需要<span style="color: black;">按照</span><span style="color: black;">必定</span>的网页分析算法过滤与主题无关的链接,<span style="color: black;">保存</span>有用的链接并将其放入等待抓取的URL队列。<span style="color: black;">而后</span>,它将<span style="color: black;">按照</span><span style="color: black;">必定</span>的搜索策略从队列中<span style="color: black;">选取</span>下一步要抓取的网页URL,并重复<span style="color: black;">以上</span>过程,直到达到系统的某一<span style="color: black;">要求</span>时停止。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">另一</span>,所有被爬虫抓取的网页将会被系统存贮,进行<span style="color: black;">必定</span>的分析、过滤,并<span style="color: black;">创立</span>索引,以便之后的<span style="color: black;">查找</span>和检索;<span style="color: black;">针对</span>聚焦爬虫<span style="color: black;">来讲</span>,这一过程所得到的分析结果还可能对以后的抓取过程给出反馈和<span style="color: black;">指点</span>。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">笔者是爬虫初学者,<span style="color: black;">经过</span>这篇综述来记录一下自己的心得体会。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">以下为<span style="color: black;">文案</span><span style="color: black;">重点</span>内容:</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/heS6wRSHVMn2o2HDd6UXlnFIvQpPYUYt3m7KA4eomibjqQ03Gd9eIA6h2ibC3etib1PO0TPXjqcTsl7zUWKayCCYA/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">1. 初见爬虫</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">运用</span>Python中的Requests第三方库。在Requests的7个<span style="color: black;">重点</span><span style="color: black;">办法</span>中,最常<span style="color: black;">运用</span>的<span style="color: black;">便是</span>get()<span style="color: black;">办法</span>,<span style="color: black;">经过</span>该<span style="color: black;">办法</span>构造一个向服务器请求资源的Request对象,结果返回一个<span style="color: black;">包括</span>服务器资源的额Response对象。<span style="color: black;">经过</span>Response对象则<span style="color: black;">能够</span>获取请求的返回状态、HTTP响应的字符串即URL对应的页面内容、页面的编码方式以及页面内容的二进制形式。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">在<span style="color: black;">认识</span>get()<span style="color: black;">办法</span>之前<span style="color: black;">咱们</span>先<span style="color: black;">认识</span>一下HTTP协议,<span style="color: black;">经过</span>对HTTP协议来理解<span style="color: black;">咱们</span><span style="color: black;">拜访</span>网页这个过程到底都进行了<span style="color: black;">那些</span>工作。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">1.1 浅析HTTP协议</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">超文本传输协议(HTTP,HyperText Transfer Protocol)是互联网上应用最为广泛的一种网络协议。所有的www文件都必须遵守这个标准。HTTP协议<span style="color: black;">重点</span>有几个特点:</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">支持客户/服务器模式</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">简单快捷:</strong>客服向服务器发出请求,只需要传送请求<span style="color: black;">办法</span>和路径。请求<span style="color: black;">办法</span>常用的有GET, HEAD, POST。每种<span style="color: black;">办法</span>规定了客户与服务器联系的类型<span style="color: black;">区别</span>。<span style="color: black;">因为</span>HTTP协议简单,使得HTTP服务器的程序规模小,因而通信速度快。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">灵活:</strong>HTTP<span style="color: black;">准许</span>传输任意类型的数据对象。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">无连接:</strong>无连接的含义是限制每次连接请求只处理一个请求。服务器处理完客户的请求,收到客户的应答后即断开连接,这种方式<span style="color: black;">能够</span>节省传输时间。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">无状态:</strong>HTTP协议是无状态协议。无状态<span style="color: black;">指的是</span>协议<span style="color: black;">针对</span>事物处理<span style="color: black;">无</span>记忆能力。缺少状态<span style="color: black;">寓意</span>着<span style="color: black;">倘若</span>后续处理需要前面的信息,则它必须重传,<span style="color: black;">这般</span>可能<span style="color: black;">引起</span>每次连接传送的数据量增大,另一方面,在服务器不需要先前信息时它的应答就较快。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">下面<span style="color: black;">经过</span>一张图<span style="color: black;">咱们</span>来<span style="color: black;">认识</span>一下<span style="color: black;">拜访</span>网页的过程都<span style="color: black;">出现</span>了什么:</span></p><img src="https://mmbiz.qpic.cn/mmbiz_png/heS6wRSHVMn2o2HDd6UXlnFIvQpPYUYtU0gfE89XSvBU1054CbrUQqGdyGaeaic4CGCfcM5XNQAY7KBb7Qy4axQ/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;">
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">1.&nbsp;</strong><span style="color: black;">首要</span>浏览器拿到网址之后先将主机名解析出来。如<span style="color: black;">http://www.baidu.com/index.html&nbsp;</span>则会将主机名&nbsp;<span style="color: black;">www.baidu.com</span>&nbsp;解析出来。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">2.&nbsp;</strong><span style="color: black;">查询</span>ip,<span style="color: black;">按照</span>主机名,会<span style="color: black;">首要</span><span style="color: black;">查询</span>ip,<span style="color: black;">首要</span><span style="color: black;">查找</span>hosts文件,成功则返回对应的ip<span style="color: black;">位置</span>,<span style="color: black;">倘若</span><span style="color: black;">无</span><span style="color: black;">查找</span>到,则去DNS服务器<span style="color: black;">查找</span>,成功就返回ip,否则会报告连接错误。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">3.&nbsp;</strong>发送http请求,浏览器会把<span style="color: black;">自己</span><span style="color: black;">关联</span>信息与请求<span style="color: black;">关联</span>信息封装成HTTP请求 <span style="color: black;">信息</span>发送给服务器。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">4.&nbsp;</strong>服务器处理请求,服务器读取HTTP请求中的内容,在经过解析主机,解析站点名<span style="color: black;">叫作</span>,解析<span style="color: black;">拜访</span>资源后,会<span style="color: black;">查询</span><span style="color: black;">关联</span>资源,<span style="color: black;">倘若</span><span style="color: black;">查询</span>成功,则返回状态码200,失败就会返回大名鼎鼎的404了,在服务器监测到请求不在的资源后,<span style="color: black;">能够</span><span style="color: black;">根据</span>程序员设置的<span style="color: black;">转</span>到别的页面。<span style="color: black;">因此</span>有<span style="color: black;">各样</span>有个性的404错误页面。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">5.&nbsp;</strong>服务器返回HTTP响应,浏览器得到返回数据后就<span style="color: black;">能够</span>提取数据,<span style="color: black;">而后</span>调用解析内核进行翻译,最后<span style="color: black;">表示</span>出页面。之后浏览器会对其引用的文件<span style="color: black;">例如</span><span style="color: black;">照片</span>,css,js等文件<span style="color: black;">持续</span>进行<span style="color: black;">以上</span>过程,直到所有文件都被下载下来之后,网页就会<span style="color: black;">表示</span>出来。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">HTTP请求,http请求由三部分<span style="color: black;">构成</span>,分别是:请求行、<span style="color: black;">信息</span>报头、请求正文。请求<span style="color: black;">办法</span>(所有<span style="color: black;">办法</span>全为大写)有多种,各个<span style="color: black;">办法</span>的解释如下:</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">GET&nbsp;&nbsp;&nbsp;</strong>&nbsp; 请求获取Request-URI所标识的资源</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">POST</strong>&nbsp;&nbsp;&nbsp; 在Request-URI所标识的资源后附加新的数据</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">HEAD&nbsp;&nbsp;&nbsp;</strong>请求获取由Request-URI所标识的资源的响应<span style="color: black;">信息</span>报头</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">PUT&nbsp;</strong>&nbsp;&nbsp;&nbsp; 请求服务器存储一个资源,并用Request-URI<span style="color: black;">做为</span>其标识</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">DELETE&nbsp;</strong>&nbsp; 请求服务器删除Request-URI所标识的资源</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">TRACE&nbsp;</strong>&nbsp; &nbsp;请求服务器回送收到的请求信息,<span style="color: black;">重点</span>用于测试或诊断</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">CONNECT</strong>&nbsp; &nbsp;<span style="color: black;">保存</span>将来<span style="color: black;">运用</span></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">OPTIONS &nbsp;&nbsp;</strong>
      </span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">请求<span style="color: black;">查找</span>服务器的性能,<span style="color: black;">或</span><span style="color: black;">查找</span>与资源<span style="color: black;">关联</span>的选项和<span style="color: black;">需要</span></p>

    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">GET<span style="color: black;">办法</span></span>应用举例:</strong>在浏览器的<span style="color: black;">位置</span>栏中输入网址的方式<span style="color: black;">拜访</span>网页时,浏览器采用GET<span style="color: black;">办法</span>向服务器获取资源,eg:GET /form.html HTTP/1.1 (CRLF)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">HTTP响应<span style="color: black;">亦</span><span style="color: black;">是由于</span>三个部分<span style="color: black;">构成</span>,分别是:状态行、<span style="color: black;">信息</span>报头、响应正文。</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">状态行格式如下:HTTP-Version Status-Code Reason-Phrase CRLF,其中,HTTP-Version<span style="color: black;">暗示</span>服务器HTTP协议的版本;Status-Code<span style="color: black;">暗示</span>服务器发回的响应状态代码;Reason-Phrase<span style="color: black;">暗示</span>状态代码的文本描述。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">状态代码有三位数字<span style="color: black;">构成</span>,<span style="color: black;">第1</span>个数字定义了响应的类别,且有五种可能取值:</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">1xx:指示信息--<span style="color: black;">暗示</span>请求已接收,继续处理</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">2xx:成功--<span style="color: black;">暗示</span>请求已被成功接收、理解、接受</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">3xx:重定向--要完成请求必须进行更进一步的操作</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">4xx:客户端错误--请求有语法错误或请求<span style="color: black;">没法</span>实现</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">5xx:服务器端错误--服务器未能实现合法的请求</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">平常</span>状态代码、状态描述、说明:</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">200 OK&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: black;">&nbsp; //客户端请求成功</span></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">400 Bad Request&nbsp;&nbsp;<span style="color: black;">//客户端请求有语法错误,<span style="color: black;">不可</span>被服务器所理解</span></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">401 Unauthorized<span style="color: black;">&nbsp;//请求未经授权,这个状态代码必须和WWW-Authenticate报头域<span style="color: black;">一块</span><span style="color: black;">运用</span>&nbsp;</span></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">403 Forbidden&nbsp;&nbsp;<span style="color: black;">//服务器收到请求,<span style="color: black;">然则</span>拒绝<span style="color: black;">供给</span>服务</span></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">404 Not Found&nbsp;&nbsp;<span style="color: black;">//请求资源不存在,eg:输入了错误的URL</span></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">500 Internal Server Error<span style="color: black;">&nbsp;//服务器<span style="color: black;">出现</span>不可预期的错误</span></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">503 Server Unavailable&nbsp;&nbsp;<span style="color: black;">//服务器当前<span style="color: black;">不可</span>处理客户端的请求,一段时间后可能恢复正常。</span></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">eg:HTTP/1.1 200 OK (CRLF)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">仔细</span>的HTTP协议<span style="color: black;">能够</span>参考这篇<span style="color: black;">文案</span>:</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">http://www.cnblogs.com/li0803/archive/2008/11/03/1324746.html</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">前面<span style="color: black;">咱们</span><span style="color: black;">认识</span>了HTTP协议,<span style="color: black;">那样</span><span style="color: black;">咱们</span><span style="color: black;">拜访</span>网页的过程,<span style="color: black;">那样</span>网页在是什么样子的。爬虫眼中的网页又是什么样子的。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">网是静态的,但爬虫是动态的,<span style="color: black;">因此</span>爬虫的基本思想<span style="color: black;">便是</span>沿着网页(蜘蛛网的节点)上的链接的爬取有效信息。当然网页<span style="color: black;">亦</span>有动态(<span style="color: black;">通常</span>用PHP或ASP等写成,例如用户登陆界面<span style="color: black;">便是</span>动态网页)的,但<span style="color: black;">倘若</span>一张蛛网摇摇欲坠,蜘蛛会感到不<span style="color: black;">那样</span>安稳,<span style="color: black;">因此</span>动态网页的优先级<span style="color: black;">通常</span>会被搜索引擎排在静态网页的后面。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">晓得</span>了爬虫的基本思想,<span style="color: black;">那样</span><span style="color: black;">详细</span><span style="color: black;">怎样</span>操作呢?这得从网页的基本概念说起。一个网页有三大<span style="color: black;">形成</span>要素,分别是html文件、css文件和JavaScript文件。<span style="color: black;">倘若</span>把一个网页看做一栋房子,<span style="color: black;">那样</span>html相当于房子外壳;css相当于地砖涂料,美化房子外观内饰;JavaScript则相当于家具电器浴池等,<span style="color: black;">增多</span>房子的功能。从<span style="color: black;">以上</span>比喻<span style="color: black;">能够</span>看出,html才是网页的<span style="color: black;">基本</span>,毕竟地砖颜料在市场上<span style="color: black;">亦</span>有,家具电器都<span style="color: black;">能够</span>露天<span style="color: black;">安排</span>,而房子外壳才是独一无二的。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">下面<span style="color: black;">便是</span>一个简单网页的例子:</span></p><img src="https://mmbiz.qpic.cn/mmbiz_jpg/heS6wRSHVMn2o2HDd6UXlnFIvQpPYUYt7oj0RnvO15ricuTn5kqbf2Licoc6POV5MQhpoFVKdju66NoicmNmLHfWA/640?wx_fmt=jpeg&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;">
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">而在爬虫眼里,这个网页是<span style="color: black;">这般</span>的:</span></p><img src="https://mmbiz.qpic.cn/mmbiz_jpg/heS6wRSHVMn2o2HDd6UXlnFIvQpPYUYtEicg9Kw3wF2FoaOwS8Nb5WICsrA4UaibGCJt3Mh4iatQFicyMS9sKEdc5Q/640?wx_fmt=jpeg&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;">
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">因此呢</span>网页实质上<span style="color: black;">便是</span>超文本(hypertext),网页上的所有内容都是在形如“&lt;&gt;...&lt;/&gt;”<span style="color: black;">这般</span>的标签之内的。<span style="color: black;">倘若</span><span style="color: black;">咱们</span>要搜集网页上的所有超链接,只需寻找所有标签中前面是"href="的字符串,并查看提取出来的字符串<span style="color: black;">是不是</span>以"http"(超文本转换协议,https<span style="color: black;">暗示</span>安全的http协议)开头<span style="color: black;">就可</span>。<span style="color: black;">倘若</span>超链接不以"http"开头,<span style="color: black;">那样</span>该链接很可能是网页所在的本地文件<span style="color: black;">或</span>ftp或smtp(文件或邮件转换协议),应该过滤掉。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">在Python中<span style="color: black;">咱们</span><span style="color: black;">运用</span>Requests库中的<span style="color: black;">办法</span>来<span style="color: black;">帮忙</span><span style="color: black;">咱们</span>实现对网页的请求,从而达到实现爬虫的过程。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">1.2 Requests库的7个<span style="color: black;">重点</span><span style="color: black;">办法</span>:</strong></span></p><img src="https://mmbiz.qpic.cn/mmbiz_png/heS6wRSHVMn2o2HDd6UXlnFIvQpPYUYtMTSOLpRmKkNTS06V388oUicTg4oibmsFDMEaTCWib9S37V3WzTzGNOWzQ/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;">
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">最常用的<span style="color: black;">办法</span>get用来实现一个简单的小爬虫,<span style="color: black;">经过</span>示例代码展示:</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/heS6wRSHVMn2o2HDd6UXlnFIvQpPYUYt3GErldb4myBjsr2DFnKYsu8O6VNVDwKtWlBuMnYUkiarlLd3vto81hQ/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">2. Robots协议</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Robots协议(<span style="color: black;">亦</span><span style="color: black;">叫作</span>为爬虫协议、<span style="color: black;">设备</span>人协议等)的全<span style="color: black;">叫作</span>是“网络爬虫排除标准”(Robots Exclusion Protocol),网站<span style="color: black;">经过</span>Robots协议告诉搜索引擎<span style="color: black;">那些</span>页面<span style="color: black;">能够</span>抓取,<span style="color: black;">那些</span>页面<span style="color: black;">不可</span>抓取。<span style="color: black;">经过</span>几个小例子来解读一下robots.txt中的内容,robots.txt默认<span style="color: black;">安置</span>于网站的根目录小,<span style="color: black;">针对</span>一个<span style="color: black;">无</span>robots.txt文件的网站,默认是<span style="color: black;">准许</span>所有爬虫获取其网站内容的。</span></p><img src="https://mmbiz.qpic.cn/mmbiz_png/heS6wRSHVMn2o2HDd6UXlnFIvQpPYUYtibTTUHC3khIiakON3S3OjJbnVicbDXvc5uNce7IhFC1PJTkLv8grOu7VQ/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;">
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">咱们</span><span style="color: black;">针对</span>robots协议的理解,<span style="color: black;">倘若</span>是<span style="color: black;">商场</span>利益<span style="color: black;">咱们</span>是必须要遵守robots协议内容,否则会承担相应的法律责任。当只是个人玩转网页、练习则是<span style="color: black;">意见</span>遵守,<span style="color: black;">加强</span>自己编写爬虫的友好程度。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">3. 网页解析</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">BeautifulSoup尝试化平淡为神奇,<span style="color: black;">经过</span>定位HTML标签来格式化和组织<span style="color: black;">繁杂</span>的网络信息,用简单易用的Python对象为<span style="color: black;">咱们</span>展示XML结构信息。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">BeautifulSoup是解析、遍历、<span style="color: black;">守护</span>“标签树”的功能库。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">3.1 BeautifulSoup的解析器</strong></span></p><img src="https://mmbiz.qpic.cn/mmbiz_png/heS6wRSHVMn2o2HDd6UXlnFIvQpPYUYtrWmVMRx1mjBib25daayAjvYWX5vrEmvIrQicUJS9iboeMMnVyQ3ibjFNYQ/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;">
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">BeautifulSoup<span style="color: black;">经过</span>以上四种解析器来对<span style="color: black;">咱们</span>获取的网页内容进行解析。<span style="color: black;">运用</span>官网的例子来看一下解析结果:</span></p><img src="https://mmbiz.qpic.cn/mmbiz_png/heS6wRSHVMn2o2HDd6UXlnFIvQpPYUYt9M4qmH6GywqPAociaWv2vj8YwibsibYrr07c8jlYCxRfZKLJpjhkSBZHw/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;">
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">首要</span>获取以上的一段HTML内容,<span style="color: black;">咱们</span><span style="color: black;">经过</span>BeautifulSoup解析之后,并且输出解析后的结果来对比一下:</span></p><img src="https://mmbiz.qpic.cn/mmbiz_png/heS6wRSHVMn2o2HDd6UXlnFIvQpPYUYtbwZJuppILCiaza3mjZnmrAlWo7kPW2fz6mvpIFdgMm7qQIfJdp8LicjA/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;">
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">经过</span>解析的网页内容,<span style="color: black;">咱们</span>就<span style="color: black;">能够</span><span style="color: black;">运用</span>BeautifulSoup中的<span style="color: black;">办法</span>来轻而易举的<span style="color: black;">得到</span>网页中的<span style="color: black;">重点</span>信息:</span></p><img src="https://mmbiz.qpic.cn/mmbiz_png/heS6wRSHVMn2o2HDd6UXlnFIvQpPYUYtqYu4xQN2f9IjssRUzWCxpHomMWuat5vOIoWliaQpZ3Vay00hfp8Uu9Q/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;">
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">3.2 BeautifulSoup类的基本元素</strong></span></p><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;">
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">3.3 BeautifulSoup的遍历功能</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">遍历分为上行遍历、下行遍历、平行遍历三种。</span></p><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;">
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">下行遍历:</strong></span></p><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;">
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">上行遍历:</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">平行遍历:</strong></span></p><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;">
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">&nbsp;<span style="color: black;">4. 正则表达式</span></strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">正则表达式,又<span style="color: black;">叫作</span>规则表达式。(英语:Regular Expression,在代码中常简写为regex、regexp或RE),计算机科学的一个概念。正则表<span style="color: black;">一般</span>被用来检索、替换<span style="color: black;">哪些</span>符合某个模式(规则)的文本。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">笔者<span style="color: black;">亦</span>是初学正则表达式,感觉自己<span style="color: black;">不可</span>简洁清晰的讲述正则表达式,<span style="color: black;">意见</span>参考网上的教程<span style="color: black;">(&nbsp;http://deerchao.net/tutorials/regex/regex.htm#mission&nbsp;)</span>图文并茂,<span style="color: black;">仔细</span>讲解了正则表达式。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">经过</span><span style="color: black;">把握</span>正则<span style="color: black;">暗示</span><span style="color: black;">亦</span><span style="color: black;">能够</span><span style="color: black;">帮忙</span><span style="color: black;">咱们</span>获取网页中的<span style="color: black;">重点</span>信息。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">5. 爬虫框架Scrapy</strong></span></p><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;">
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Scrapy是Python<span style="color: black;">研发</span>的一个快速,高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化的数据。Scrapy用途广泛,<span style="color: black;">能够</span>用于数据挖掘、监测和自动化测试。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Scrapy吸引人的<span style="color: black;">地区</span>在于它是一个框架,任何人都<span style="color: black;">能够</span><span style="color: black;">按照</span><span style="color: black;">需要</span>方便的修改。它<span style="color: black;">亦</span><span style="color: black;">供给</span>了多种类型爬虫的基类,如BaseSpider、sitemap爬虫等,最新版本又<span style="color: black;">供给</span>了web2.0爬虫的支持。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">5.1 Scrapy爬虫框架结构</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">Engine:&nbsp;</strong><span style="color: black;">掌控</span>所有模块之间的数据流、<span style="color: black;">按照</span><span style="color: black;">要求</span>触发事件。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">Downloader:&nbsp;</strong><span style="color: black;">按照</span>请求下载网页</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">Scheduler:&nbsp;</strong>对所有爬去请求进行调度管理</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">Spider:&nbsp;</strong>解析Downloader返回的响应、产生爬取项、产生额外的爬去请求。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">Item Pipelines:&nbsp;</strong>以流水线方式处理Spider产生的爬取项、可能<span style="color: black;">包含</span>清理、检验和查重爬取项中的HTML数据、将数据存储到数据库。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">5.2 数据流</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">1.&nbsp;</strong>引擎打开一个网站(open a domain),找到处理该网站的Spider并向该spider请求<span style="color: black;">第1</span>个要爬取的URL(s)。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">2.&nbsp;</strong>引擎从Spider中获取到<span style="color: black;">第1</span>个要爬取的URL并在调度器(Scheduler)以Request调度。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">3.&nbsp;</strong>引擎向调度器请求下一个要爬取的URL。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">4.&nbsp;</strong>调度器返回下一个要爬取的URL给引擎,引擎将URL<span style="color: black;">经过</span>下载中间件(请求(request)方向)转发给下载器(Downloader)。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">5.&nbsp;</strong>一旦页面下载完毕,下载器生成一个该页面的Response,并将其<span style="color: black;">经过</span>下载中间件(返回(response)方向)发送给引擎。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">6.&nbsp;</strong>引擎从下载器中接收到Response并<span style="color: black;">经过</span>Spider中间件(输入方向)发送给Spider处理。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">7.&nbsp;</strong>Spider处理Response并返回爬取到的Item及(跟进的)新的Request给引擎。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">8.&nbsp;</strong>引擎将(Spider返回的)爬取到的Item给Item Pipeline,将(Spider返回的)Request给调度器。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">9.&nbsp;</strong>(从第二步)重复直到调度器中<span style="color: black;">无</span><span style="color: black;">更加多</span>地request,引擎关闭该网站。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">6. 分布式爬虫</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">6.1 多线程爬虫</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">&nbsp;在爬取数据量小的<span style="color: black;">状况</span>下,<span style="color: black;">咱们</span><span style="color: black;">运用</span>的都是串行下载网页的,<span style="color: black;">仅有</span>前一次下载完成之后才会<span style="color: black;">起步</span>新的下载。数据量小的<span style="color: black;">状况</span>下尚可应对。但面对大型网站就会<span style="color: black;">显出</span>性能不足,<span style="color: black;">倘若</span><span style="color: black;">咱们</span><span style="color: black;">能够</span><span style="color: black;">同期</span>下载多个网页,<span style="color: black;">那样</span>下载时间将会得到<span style="color: black;">明显</span>改善。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">咱们</span>将串行下载爬虫扩展成并行下载。需要<span style="color: black;">重视</span>的是<span style="color: black;">倘若</span><span style="color: black;">乱用</span>这一功能,多线程爬虫请求内容过快,可能会<span style="color: black;">导致</span>服务器过载,或是IP<span style="color: black;">位置</span>被封禁。为了避免这一问题,<span style="color: black;">咱们</span>的爬虫就要设置一个delay标识,用于设定请求同一域名时的最小时间间隔。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">在Python中实现多线程是比较简单的,Python中的thread模块是比较底层的模块,Python的threading模块是对thread做了<span style="color: black;">有些</span>封装,<span style="color: black;">能够</span>更加方便的被<span style="color: black;">运用</span>。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">简要的看一下thread模块中含函数和常量:</span></p><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;">
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Thread中常用的函数和对象:</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">通常</span><span style="color: black;">来讲</span>,<span style="color: black;">运用</span>线程有两种模式, 一种是创建线程要执行的函数, 把这个函数传递进Thread对象里,让它来执行. 另一种是直接从Thread继承,创建一个新的class,把线程执行的代码放到这个新的class里。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">实现多进程的代码和例子参考:</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">http://www.jianshu.com/p/86b8e78c418a</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">6.2 多进程爬虫</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Python中的多线程其实并不是真正的多线程,并<span style="color: black;">不可</span>做到充分利用多核CPU资源。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">倘若</span>想要充分利用,在python中大部分情况需要<span style="color: black;">运用</span>多进程,<span style="color: black;">那样</span>这个包就叫做 multiprocessing。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">借助它,<span style="color: black;">能够</span><span style="color: black;">容易</span>完成从单进程到并发执行的转换。multiprocessing支持子进程、通信和共享数据、执行<span style="color: black;">区别</span>形式的同步,<span style="color: black;">供给</span>了Process、Queue、Pipe、Lock等组件。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">Process基本<span style="color: black;">运用</span>:</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">在multiprocessing中,每一个进程都用一个Process类来<span style="color: black;">暗示</span>。<span style="color: black;">首要</span>看下它的API:</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">target&nbsp;</strong><span style="color: black;">暗示</span>调用对象,你<span style="color: black;">能够</span>传入<span style="color: black;">办法</span>的名字</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">args&nbsp;</strong><span style="color: black;">暗示</span>被调用对象的位置参数元组,<span style="color: black;">例如</span>target是函数a,他有两个参数m,n,<span style="color: black;">那样</span>args就传入(m, n)<span style="color: black;">就可</span></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">kwargs&nbsp;</strong><span style="color: black;">暗示</span>调用对象的字典</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">name&nbsp;</strong>是别名,相当于给这个进程取一个名字</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">group&nbsp;</strong>分组,<span style="color: black;">实质</span>上不<span style="color: black;">运用</span></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">咱们</span>先用一个实例来感受一下:</span></p><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;">
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">最简单的创建Process的过程如上所示,target传入函数名,args是函数的参数,是元组的形式,<span style="color: black;">倘若</span><span style="color: black;">仅有</span>一个参数,那<span style="color: black;">便是</span>长度为1的元组。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">而后</span>调用start()<span style="color: black;">办法</span><span style="color: black;">就可</span><span style="color: black;">起步</span>多个进程了。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">另一</span>你还<span style="color: black;">能够</span><span style="color: black;">经过</span> cpu_count() <span style="color: black;">办法</span>还有 active_children() <span style="color: black;">办法</span>获取当前<span style="color: black;">设备</span>的 CPU 核心数量以及得到<span style="color: black;">日前</span>所有的运行的进程。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">经过</span>一个实例来感受一下:</span></p><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;">
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">运行结果:</span></p><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;">
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">经过</span>开启多个进程实现爬虫,会大大缩减爬取信息的速度。<span style="color: black;">仔细</span>介绍请参考:</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">http://cuiqingcai.com/3335.html</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">7. 异步网站数据采集</span></strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">在收集网页信息时<span style="color: black;">咱们</span>会遇到,网页的加载模型为瀑布流形式,页面URL<span style="color: black;">无</span>改变,但依然<span style="color: black;">能够</span>加载出内容。<span style="color: black;">此时</span>候就需要<span style="color: black;">咱们</span>分析网页中JavaScript中的<span style="color: black;">有些</span>代码,从中获取<span style="color: black;">咱们</span>所需要的数据。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">面对<span style="color: black;">运用</span>JS渲染的页面<span style="color: black;">举荐</span><span style="color: black;">运用</span>PhantomJS,无界面,可脚本编程的WebKit浏览器。参考 :&nbsp;<span style="color: black;">http://cuiqingcai.com/2577.html</span></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Selenium一种自动化测试工具。<span style="color: black;">能够</span>方便实现Web界面测试。<span style="color: black;">运用</span>PhantomJS渲染解析JS,Selenium用来驱动以及写与Python的对接,<span style="color: black;">而后</span>Python进行后期处理。参考:&nbsp;<span style="color: black;">http://cuiqingcai.com/2599.html</span></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">8. 爬虫的存储</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">&nbsp;在刚<span style="color: black;">起始</span>接触爬虫的时候,<span style="color: black;">咱们</span>习惯将小的爬虫结果输出在命令行中,看着命令行中一行行的数据<span style="color: black;">显出</span>颇有成就感,<span style="color: black;">然则</span>随着数据的<span style="color: black;">增加</span>,并且需要进行数据分析时,将数据打印到命令行就不是办法了。为了<span style="color: black;">能够</span>远程<span style="color: black;">运用</span>大部分网络爬虫,<span style="color: black;">咱们</span>还是需要将收集的数据存储起来。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">8.1 <span style="color: black;">媒介</span>文件</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">媒介</span>文件<span style="color: black;">平常</span>的有两种存储方式:只获取URL链接,<span style="color: black;">或</span>直接把源文件下载下来。<span style="color: black;">然则</span><span style="color: black;">举荐</span><span style="color: black;">运用</span><span style="color: black;">第1</span>种方式。优点如下:</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">爬虫运行的更快,耗费的流量更少,<span style="color: black;">由于</span>只存储链接,不需要下载文件。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">节省存储空间,<span style="color: black;">由于</span>不需要存储<span style="color: black;">媒介</span>文件。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">存储URL的代码<span style="color: black;">更易</span>写,<span style="color: black;">亦</span>不需要实现文件下载代码</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">不下载文件能够降低<span style="color: black;">目的</span>主机服务器的负载。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">当然<span style="color: black;">这般</span>做<span style="color: black;">亦</span>存在<span style="color: black;">有些</span>缺点:</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">内嵌在<span style="color: black;">咱们</span>网页中的外站链接被<span style="color: black;">叫作</span>为盗链,<span style="color: black;">运用</span>这种链接会让<span style="color: black;">咱们</span>麻烦<span style="color: black;">持续</span>,<span style="color: black;">每一个</span>网站都会实施防盗链<span style="color: black;">办法</span>。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">由于</span>你的链接文件在别人的服务器,<span style="color: black;">因此</span><span style="color: black;">咱们</span>的应用就要跟着别人的节奏运行了。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">盗链很容易改变,<span style="color: black;">倘若</span>把盗链放在博客等地,被对方<span style="color: black;">发掘</span>很可能被恶搞。<span style="color: black;">或</span>是把URL存<span style="color: black;">贮存</span>用,等到用的时候<span style="color: black;">发掘</span>链接<span style="color: black;">已然</span>过期了。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">在现实中网络浏览器不仅<span style="color: black;">能够</span><span style="color: black;">拜访</span>HTML页面并切换页面,它们<span style="color: black;">亦</span>会下载<span style="color: black;">拜访</span>页面上的所有资源。下载文件会让<span style="color: black;">咱们</span>的爬虫看起来更像人在浏览页面。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">8.2 把数据存储到CSV</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">CSV是存储表格数据的常用文件格式。每行都用一个换行符分隔,列与列之间用逗号分隔。Python中的CSV库<span style="color: black;">能够</span>非常简单的修改CSV文件,<span style="color: black;">亦</span><span style="color: black;">能够</span>从零<span style="color: black;">起始</span>创建一个CSV文件:</span></p><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;">
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">咱们</span><span style="color: black;">能够</span><span style="color: black;">运用</span>csv模块<span style="color: black;">供给</span>的功能将爬虫获取的信息存入csv文件中。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">8.3 MySQL</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">针对</span><span style="color: black;">海量</span>的爬虫数据,并且在之后<span style="color: black;">咱们</span>需要反复用来筛选分析的数据,<span style="color: black;">咱们</span><span style="color: black;">选取</span>存储在数据库中。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">MySQL是<span style="color: black;">日前</span>最受欢迎的开源关系型数据库管理系统,它是一种非常灵活、稳定、功能齐全的DBMS,许多顶级网站都在用它,YouTube、Twitter、Facebook等。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Python中<span style="color: black;">无</span>内置的MySQL支持工具,<span style="color: black;">不外</span>,有<span style="color: black;">非常多</span>开源的库<span style="color: black;">能够</span>用来与MySQL做交互,最为出名的<span style="color: black;">便是</span>PyMySQL。</span></p><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;">
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">结合<span style="color: black;">以上</span>过程将爬虫获取到的数据存入数据库中。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">9. 爬虫的<span style="color: black;">平常</span>技巧</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">9.1 模拟登录</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">日前</span>的网站<span style="color: black;">大都是</span>采用cookie跟踪用户<span style="color: black;">是不是</span><span style="color: black;">已然</span>登录的信息。一旦网站验证了你的登录权证,它就会<span style="color: black;">保留</span>在你浏览器的cookie中,里面<span style="color: black;">一般</span><span style="color: black;">包括</span>一个服务器生成的命令牌、登录有效时限和状态跟踪信息。网站会把这个cookie当作信息验证的证据,在<span style="color: black;">咱们</span>浏览网站的<span style="color: black;">每一个</span>页面时出示给服务器。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">经过</span>Chrome等浏览器自带的<span style="color: black;">研发</span>者工具,<span style="color: black;">咱们</span>从Network中获取请求网页的头部和表单,在Header中<span style="color: black;">咱们</span>就<span style="color: black;">能够</span>查看cookie中存储的登录信息,<span style="color: black;">咱们</span><span style="color: black;">能够</span><span style="color: black;">经过</span>Scrapy设置请求网页的头部信息,并将cookie存储在本地,来实现模拟登陆的效果。<span style="color: black;">仔细</span>的操作<span style="color: black;">能够</span>查看博客:<span style="color: black;">http://www.jianshu.com /p/b7f41df6202d</span></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">9.2 网页验证码</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">简单的说,验证码<span style="color: black;">便是</span>一张<span style="color: black;">照片</span>,<span style="color: black;">照片</span>上有字符串。网站是<span style="color: black;">怎样</span>实现的呢?有WEB<span style="color: black;">基本</span>的人可能会<span style="color: black;">晓得</span>,<span style="color: black;">每一个</span>浏览器基本都有cookie,<span style="color: black;">做为</span>这次回话的<span style="color: black;">独一</span>标示。每次<span style="color: black;">拜访</span>网站,浏览器都会把这个cookie发送给服务器。验证码<span style="color: black;">便是</span>和这个cookie绑定到<span style="color: black;">一块</span>的。<span style="color: black;">怎样</span>理解呢?举个例子,<span style="color: black;">此刻</span>有网站W,有A和B两个人,<span style="color: black;">同期</span><span style="color: black;">拜访</span>W,W给A返回的验证码是X,给B返回的验证码是Y,这两个验证码都是正确的,<span style="color: black;">然则</span><span style="color: black;">倘若</span>A输入了B的验证码,肯定验证不<span style="color: black;">经过</span>。那服务器是怎么区分A和B呢,<span style="color: black;">便是</span>用到的cookie。再举个例子,有些网站你登录一次之后,下次继续<span style="color: black;">拜访</span>可能就自动登陆了,<span style="color: black;">亦</span>是用cookie来标示<span style="color: black;">独一</span>身份的,<span style="color: black;">倘若</span>清除了cookie<span style="color: black;">亦</span>就<span style="color: black;">没法</span>自动登陆了。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">针对</span><span style="color: black;">有些</span>简单的验证码<span style="color: black;">咱们</span><span style="color: black;">能够</span><span style="color: black;">经过</span><span style="color: black;">设备</span>识别,<span style="color: black;">然则</span><span style="color: black;">针对</span><span style="color: black;">有些</span>人眼都很难识别的验证码就只能寻找更加<span style="color: black;">繁杂</span>的技术了。简单的验证码识别过程<span style="color: black;">便是</span>对验证码<span style="color: black;">照片</span>的一个处理过程。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">灰度图转换,<span style="color: black;">能够</span>结合opencv中的imread<span style="color: black;">办法</span>。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">图像去噪(均值滤波器、高斯滤波器等等)。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">图像二值化(这个过程中验证码中的字符串已经<span style="color: black;">作为</span>黑色的,底色为白色)。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">运用</span>图像识别方式,识别图中的字符串达到识别验证码的目的。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">举荐</span>阅读:</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">http://www.jianshu.com/p/dd699561671b</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">http://www.cnblogs.com/hearzeus/p/5166299.html(上篇)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">http://www.cnblogs.com/hearzeus/p/5226546.html(下篇)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">9.3 爬虫代理池</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">因为</span>笔者是个爬虫初学者<span style="color: black;">亦</span><span style="color: black;">无</span>用到过这么<span style="color: black;">繁杂</span>的技术,<span style="color: black;">不外</span>笔者在爬虫的过程中的确是体会了被封IP<span style="color: black;">位置</span>的痛苦。<span style="color: black;">因此</span><span style="color: black;">举荐</span><span style="color: black;">大众</span>有精力的<span style="color: black;">能够</span>来学习并完成一个。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">举荐</span>阅读:</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">https://www.zhihu.com/question/47464143</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">10. 防爬虫</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">因为</span>暴力爬虫会对网站的服务器产生很大的压力,<span style="color: black;">因此</span>各个网站对爬虫都有限制,大<span style="color: black;">都数</span>网站会定义robots.txt.文件<span style="color: black;">能够</span>让爬虫<span style="color: black;">认识</span>该网站的限制。限制是<span style="color: black;">做为</span><span style="color: black;">意见</span>给出。<span style="color: black;">然则</span>爬虫前<span style="color: black;">检测</span>该文件<span style="color: black;">能够</span>最小化<span style="color: black;">咱们</span>的爬虫被封禁的可能。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">一篇关于反爬虫的<span style="color: black;">文案</span>:&nbsp;<span style="color: black;">https://segmentfault.com/a/ 1190000005840672<span style="color: black;">(</span></span>来自携程技术中心)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">11. 学习资料</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">举荐</span>书籍:</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">《Python网络数据采集》 陶俊杰、陈小莉 译</span></p><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;">
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">《用Python写网络爬虫》 李斌 译</span></p><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;">
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">举荐</span>博客:</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">崔庆才得个人博客,有<span style="color: black;">海量</span>关于爬虫的<span style="color: black;">文案</span>,<span style="color: black;">况且</span>讲解的比较细致。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">http://cuiqingcai.com/</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数据挖掘与入门实战<span style="color: black;">微X</span>公众号分享的一篇<span style="color: black;">文案</span>,《Python开源爬虫项目代码:抓取淘宝、京东、QQ、知网数据》,有十九个开源的爬虫项目,<span style="color: black;">能够</span>给<span style="color: black;">大众</span><span style="color: black;">供给</span>参考。<span style="color: black;">https://github.com/hlpassion/blog/issues/6</span></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">举荐</span>视频:</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">网易云课堂,例子清晰,<span style="color: black;">能够</span>跟做。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">http://study.163.com/course/introduction.htm?courseId=1002794001#/courseDetail</span></p>
    <h1 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;">Python网络爬虫与信息提取</span></h1>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">http://www.icourse163.org/course/BIT-1001870001</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">作者:</strong><strong style="color: blue;"><span style="color: black;">朱海龙</span></strong><span style="color: black;">,<span style="color: black;">杭州师范大学计算机研究生一枚,常用Python,<span style="color: black;">爱好</span>它的简洁!<span style="color: black;">兴趣</span>徒步、爬山,<span style="color: black;">爱好</span>东野圭吾带感情的推理小说。<span style="color: black;">日前</span>从事智慧医疗健康数据分析,致力于在智慧医疗<span style="color: black;">行业</span>有所<span style="color: black;">做为</span>!<span style="color: black;">期盼</span>跟<span style="color: black;">大众</span><span style="color: black;">一块</span>学习新知识,探索数据的世界!</span></span></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;"><span style="color: black;">本文<span style="color: black;">转载</span>:数据派THU 公众号;</span></span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">END</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;"><span style="color: black;">举荐</span>:</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">优秀人才不缺工作机会,只缺适合自己的好机会。<span style="color: black;">然则</span><span style="color: black;">她们</span><span style="color: black;">常常</span><span style="color: black;">无</span>精力从海量机会中找到最适合的那个。</span><strong style="color: blue;"><span style="color: black;">100offer&nbsp;</span></strong><span style="color: black;">会对平台上的人才和企业进行严格筛选,让「最好的人才」和「最好的<span style="color: black;">机构</span>」相遇。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">扫描下方二维码或点击“&nbsp;<span style="color: black;"><strong style="color: blue;">阅读原文&nbsp;</strong></span><strong style="color: blue;">”</strong>,注册&nbsp;</span><span style="color: black;"><strong style="color: blue;"><span style="color: black;">100offer</span></strong></span><span style="color: black;">,谈谈你对下一份工作的期待。<span style="color: black;">1星期</span>内,收到 5-10 个满足你<span style="color: black;">需求</span>的好机会!</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></p>
    <h1 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><span style="color: black;">相关</span>阅读</span></h1>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">原创系列<span style="color: black;">文案</span>:</span></strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;"><a style="color: black;">1:从0<span style="color: black;">起始</span>搭建自己的数据运营指</a>标体系<strong style="color: blue;"><a style="color: black;">(概括篇)</a></strong></span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;"><a style="color: black;">2 :从0<span style="color: black;">起始</span>搭建自己的数据运营指标体系(定位篇)</a></span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;"><a style="color: black;">3 :从0<span style="color: black;">起始</span>搭建自己的数据运营体系(业务理解篇)</a></span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><a style="color: black;">4 :数据指标的构建流程与<span style="color: black;">规律</span></a></strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">5 :</strong></span><span style="color: black;"><strong style="color: blue;"><a style="color: black;">系列 :从数据指标到数据运营指标体系</a></strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">6: &nbsp;</strong></span><a style="color: black;"><span style="color: black;"><strong style="color: blue;">实战 :为自己的公号搭建一个数据运营指标体系</strong></span></a></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">7:&nbsp;</strong></span><span style="color: black;"><strong style="color: blue;"><a style="color: black;">从0<span style="color: black;">起始</span>搭建自己的数据运营指标体系(运营活动分析)</a></strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">数据运营&nbsp;<span style="color: black;">关联<span style="color: black;">文案</span>阅读:</span></span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><a style="color: black;">运营入门,从0到1搭建数据分析知识体系</a>&nbsp; &nbsp;&nbsp;</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><a style="color: black;"><span style="color: black;">举荐</span> :数据分析师与运营协作的9个好习惯</a></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><a style="color: black;">干货 :手把手教你搭建数据化用户运营体系</a></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><a style="color: black;"><span style="color: black;">举荐</span> :最用心的运营数据指标<span style="color: black;">诠释</span></a></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><a style="color: black;">干货 : <span style="color: black;">怎样</span>构建数据运营指标体系</a></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><a style="color: black;">从零<span style="color: black;">起始</span>,构建数据化运营体系</a></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><a style="color: black;">干货 :<span style="color: black;">诠释</span><span style="color: black;">制品</span>、运营和数据三个基友关系</a></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><a style="color: black;">干货 :从0到1搭建数据运营体系</a></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">数据分析、数据<span style="color: black;">制品</span></strong><span style="color: black;"><span style="color: black;">相关</span><span style="color: black;">文案</span>阅读:</span></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><a style="color: black;">干货 :数据分析团队的搭建和思考</a></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><a style="color: black;">关于用户画像<span style="color: black;">哪些</span>事,看这一<span style="color: black;">文案</span>就够了</a></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><a style="color: black;">数据分析师必需具备的10种分析思维。</a></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><a style="color: black;"><span style="color: black;">怎样</span>构建大数据层级体系,看这一<span style="color: black;">文案</span>就够了</a></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><a style="color: black;">干货 : 聚焦于用户<span style="color: black;">行径</span>分析的数据<span style="color: black;">制品</span></a></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><a style="color: black;"><span style="color: black;">怎样</span>构建大数据层级体系,看这一<span style="color: black;">文案</span>就够了</a></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><a style="color: black;">80%的运营注定了打杂?<span style="color: black;">由于</span>你<span style="color: black;">无</span>搭建出一套有效的</a>用户运营体系</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><a style="color: black;">从底层到应用,<span style="color: black;">哪些</span>数据人的<span style="color: black;">必须</span>技能</a></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><a style="color: black;">读懂用户运营体系:用户分层和分群</a></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><a style="color: black;">做运营必须<span style="color: black;">把握</span>的数据分析思维,你还敢说不会做数据分析</a></p>




听听海 发表于 2024-9-1 15:00:47

我完全同意你的观点,说得太对了。

听听海 发表于 2024-9-8 08:12:21

楼主的文章深得我心,表示由衷的感谢!

听听海 发表于 2024-9-9 03:54:58

系统提示我验证码错误1500次 \~゛,

7wu1wm0 发表于 2024-9-26 18:13:52

论坛的成功是建立在我们诚恳、务实、高效、创新和团结合作基础上,我们要把这种精神传递下去。

m5k1umn 发表于 2024-9-28 23:02:55

我完全赞同你的观点,思考很有深度。

7wu1wm0 发表于 2024-10-1 08:03:48

外链论坛的成功举办,是与各位领导、同仁们的关怀和支持分不开的。在此,我谨代表公司向关心和支持论坛的各界人士表示最衷心的感谢!

wrjc1hod 发表于 2024-10-7 23:54:37

“NB”(牛×的缩写,表示叹为观止)‌

b1gc8v 发表于 2024-11-8 03:38:44

“板凳”(第三个回帖的人)‌
页: [1]
查看完整版本: 超级干货 :一文读懂网络爬虫