33 款可用来抓数据的开源爬虫软件工具
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">(给</span><span style="color: black;">数据分析与<span style="color: black;">研发</span></span><span style="color: black;">加星标,<span style="color: black;">提高</span>数据技能</span><span style="color: black;">)</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">源自</span>:visiontry</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">www.jianshu.com/p/956de3ca0578</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">要玩大数据,<span style="color: black;">无</span>数据怎么玩?<span style="color: black;">这儿</span><span style="color: black;">举荐</span><span style="color: black;">有些</span>33款开源爬虫软件给<span style="color: black;">大众</span>。</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">爬虫,即网络爬虫,是一种自动获取网页内容的程序。是搜索引擎的重要<span style="color: black;">构成</span>部分,<span style="color: black;">因此呢</span>搜索引擎优化很大程度上<span style="color: black;">便是</span>针对爬虫而做出的优化。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">网络爬虫是一个自动提取网页的程序,它为搜索引擎从万维网上下载网页,是搜索引擎的重要<span style="color: black;">构成</span>。传统爬虫从一个或若干初始网页的URL<span style="color: black;">起始</span>,<span style="color: black;">得到</span>初始网页上的URL,在抓取网页的过程中,<span style="color: black;">持续</span>从当前页面上抽取新的URL放入队列,直到满足系统的<span style="color: black;">必定</span>停止<span style="color: black;">要求</span>。聚焦爬虫的工作流程较为<span style="color: black;">繁杂</span>,需要<span style="color: black;">按照</span><span style="color: black;">必定</span>的网页分析算法过滤与主题无关的链接,<span style="color: black;">保存</span>有用的链接并将其放入等待抓取的URL队列。<span style="color: black;">而后</span>,它将<span style="color: black;">按照</span><span style="color: black;">必定</span>的搜索策略从队列中<span style="color: black;">选取</span>下一步要抓取的网页URL,并重复<span style="color: black;">以上</span>过程,直到达到系统的某一<span style="color: black;">要求</span>时停止。<span style="color: black;">另一</span>,所有被爬虫抓取的网页将会被系统存贮,进行<span style="color: black;">必定</span>的分析、过滤,并<span style="color: black;">创立</span>索引,以便之后的<span style="color: black;">查找</span>和检索;<span style="color: black;">针对</span>聚焦爬虫<span style="color: black;">来讲</span>,这一过程所得到的分析结果还可能对以后的抓取过程给出反馈和<span style="color: black;">指点</span>。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">世界上<span style="color: black;">已然</span>成型的爬虫软件多达上百种,本文对较为知名及<span style="color: black;">平常</span>的开源爬虫软件进行梳理,按<span style="color: black;">研发</span>语言进行汇总。虽然搜索引擎<span style="color: black;">亦</span>有爬虫,但<span style="color: black;">这次</span>我汇总的只是爬虫软件,而非大型、<span style="color: black;">繁杂</span>的搜索引擎,<span style="color: black;">由于</span><span style="color: black;">非常多</span>兄弟只是想爬取数据,而非运营一个搜索引擎。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_jpg/vqlbVFl5Jn3EJbqnVYwZmvUWyzR2VV4ms8wOC7mQV6GaBXTHFOYqFSMmic5MrMLqlAHuyovXqicFcNFHLL34aasg/640?wx_fmt=jpeg&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1" style="width: 50%; margin-bottom: 20px;"></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">Java爬虫</span></strong></span></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">1、Arachnid</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Arachnid是一个基于Java的web spider框架.它<span style="color: black;">包括</span>一个简单的HTML剖析器能够分析<span style="color: black;">包括</span>HTML内容的输入流.<span style="color: black;">经过</span>实现Arachnid的子类就能够<span style="color: black;">研发</span>一个简单的Web spiders并能够在Web站上的<span style="color: black;">每一个</span>页面被解析之后<span style="color: black;">增多</span>几行代码调用。 Arachnid的下载包中<span style="color: black;">包括</span>两个spider应用程序例子用于演示<span style="color: black;">怎样</span><span style="color: black;">运用</span>该框架。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:微型爬虫框架,含有一个小型HTML解析器</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">许可证:GPL</span></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">2、crawlzilla</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">crawlzilla 是一个帮你<span style="color: black;">容易</span><span style="color: black;">创立</span>搜索引擎的自由软件,有了它,你就<span style="color: black;">不消</span>依靠<span style="color: black;">商场</span><span style="color: black;">机构</span>的搜索引擎,<span style="color: black;">亦</span><span style="color: black;">不消</span>再<span style="color: black;">懊恼</span><span style="color: black;">机构</span>內部网站资料索引的问题。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">由 nutch 专案为核心,并整合<span style="color: black;">更加多</span><span style="color: black;">关联</span>套件,并卡发设计安装与管理UI,让<span style="color: black;">运用</span>者更方便上手。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">crawlzilla 除了爬取基本的 html 外,还能分析网页上的文件,如( doc、pdf、ppt、ooo、rss )等多种文件格式,让你的搜索引擎不只是网页搜索引擎,而是网站的完整资料索引库。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">持有</span>中文分词能力,让你的搜索更<span style="color: black;">精细</span>。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">crawlzilla的<span style="color: black;">特殊</span>与<span style="color: black;">目的</span>,最<span style="color: black;">重点</span><span style="color: black;">便是</span><span style="color: black;">供给</span><span style="color: black;">运用</span>者一个方便好用易安裝的搜索平台。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议: Apache License 2</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言: Java JavaScript SHELL</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">操作系统: Linux</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">项目主页:https://github.com/shunfa/crawlzilla</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">下载<span style="color: black;">位置</span>:http://sourceforge.net/projects/crawlzilla/</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:安装简易,<span style="color: black;">持有</span>中文分词功能</span></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">3、Ex-Crawler</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Ex-Crawler 是一个网页爬虫,采用 Java <span style="color: black;">研发</span>,该项目分成两部分,一个是<span style="color: black;">保护</span>进程,<span style="color: black;">另一</span>一个是灵活可配置的 Web 爬虫。<span style="color: black;">运用</span>数据库存储网页信息。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议: GPLv3</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言: Java</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">操作系统: 跨平台</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:由<span style="color: black;">保护</span>进程执行,<span style="color: black;">运用</span>数据库存储网页信息</span></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">4、Heritrix</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Heritrix 是一个由 java <span style="color: black;">研发</span>的、开源的网络爬虫,用户<span style="color: black;">能够</span><span style="color: black;">运用</span>它来从网上抓取想要的资源。其最出色之处在于它良好的可扩展性,方便用户实现自己的抓取<span style="color: black;">规律</span>。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Heritrix采用的是模块化的设计,各个模块由一个<span style="color: black;">掌控</span>器类(CrawlController类)来协调,<span style="color: black;">掌控</span>器是整体的核心。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">代码托管:https://github.com/internetarchive/heritrix3</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议: Apache</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言: Java</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">操作系统:跨平台</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:严格<span style="color: black;">按照</span>robots文件的排除指示和META robots标签</span></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">5、heyDr</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/bLmy0N4HIafbZUkNvRibiaAkRiayGMLn1SqOPUsXvS4UmPqT2GicZictRuPgubdiaU15EWTWFqeaZM44PXTe0WndlAOg/640?wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1" style="width: 50%; margin-bottom: 20px;"></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">heyDr是一款基于java的轻量级开源多线程垂直检索爬虫框架,遵循GNU GPL V3协议。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">用户<span style="color: black;">能够</span><span style="color: black;">经过</span>heyDr构建自己的垂直资源爬虫,用于搭建垂直搜索引擎前期的数据准备。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议: GPLv3</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言: Java</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">操作系统: 跨平台</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:轻量级开源多线程垂直检索爬虫框架</span></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">6、ItSucks</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">ItSucks是一个java web spider(web<span style="color: black;">设备</span>人,爬虫)开源项目。支持<span style="color: black;">经过</span>下载模板和正则表达式来定义下载规则。<span style="color: black;">供给</span>一个swing GUI操作界面。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:<span style="color: black;">供给</span>swing GUI操作界面</span></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">7、jcrawl</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">jcrawl是一款小巧性能优良的的web爬虫,它<span style="color: black;">能够</span>从网页抓取<span style="color: black;">各样</span>类型的文件,基于用户定义的符号,<span style="color: black;">例如</span>email、qq。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议: Apache</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言: Java</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">操作系统: 跨平台</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:轻量、性能优良,<span style="color: black;">能够</span>从网页抓取<span style="color: black;">各样</span>类型的文件</span></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">8、JSpider</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">JSpider是一个用Java实现的WebSpider,JSpider的执行格式如下:</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">jspider </span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">URL<span style="color: black;">必定</span>要加上协议名<span style="color: black;">叫作</span>,如:http://,否则会报错。<span style="color: black;">倘若</span>省掉ConfigName,则采用默认配置。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">JSpider 的<span style="color: black;">行径</span><span style="color: black;">是由于</span>配置文件<span style="color: black;">详细</span>配置的,<span style="color: black;">例如</span>采用什么插件,结果存储方式等等都在conf\\目录下设置。JSpider默认的配置种类 很少,用途<span style="color: black;">亦</span>不大。<span style="color: black;">然则</span>JSpider非常容易扩展,<span style="color: black;">能够</span>利用它<span style="color: black;">研发</span>强大的网页抓取与数据分析工具。要做到这些,需要对JSpider的原理有深入的了 解,<span style="color: black;">而后</span><span style="color: black;">按照</span>自己的<span style="color: black;">需要</span><span style="color: black;">研发</span>插件,撰写配置文件。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议: LGPL</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言: Java</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">操作系统: 跨平台</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:功能强大,容易扩展</span></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">9、Leopdo</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">用JAVA编写的web 搜索和爬虫,<span style="color: black;">包含</span>全文和<span style="color: black;">归类</span>垂直搜索,以及分词系统</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议: Apache</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言: Java</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">操作系统: 跨平台</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:<span style="color: black;">包含</span>全文和<span style="color: black;">归类</span>垂直搜索,以及分词系统</span></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;"><span style="color: black;">十、</span>MetaSeeker</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">是一套完整的网页内容抓取、格式化、数据集成、存储管理和搜索<span style="color: black;">处理</span><span style="color: black;">方法</span>。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">网络爬虫有多种实现方法,<span style="color: black;">倘若</span><span style="color: black;">根据</span><span style="color: black;">安排</span>在哪里分,<span style="color: black;">能够</span>分成:</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">(1)服务器侧:<span style="color: black;">通常</span>是一个多线程程序,<span style="color: black;">同期</span>下载多个<span style="color: black;">目的</span>HTML,<span style="color: black;">能够</span>用PHP, Java, Python(当前很流行)等做,<span style="color: black;">能够</span>速度做得<span style="color: black;">火速</span>,<span style="color: black;">通常</span>综合搜索引擎的爬虫<span style="color: black;">这般</span>做。<span style="color: black;">然则</span>,<span style="color: black;">倘若</span>对方讨厌爬虫,很可能封掉你的IP,服务器IP又<span style="color: black;">不易</span> 改,<span style="color: black;">另一</span>耗用的带宽<span style="color: black;">亦</span>是挺贵的。<span style="color: black;">意见</span>看一下Beautiful soap。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">(2)客户端:<span style="color: black;">通常</span>实现定题爬虫,<span style="color: black;">或</span>是聚焦爬虫,做综合搜索引擎<span style="color: black;">不易</span>成功,而垂直搜诉<span style="color: black;">或</span>比价服务<span style="color: black;">或</span><span style="color: black;">举荐</span>引擎,相对容易<span style="color: black;">非常多</span>,这类爬虫不是什么页面都 取的,而是只取你关系的页面,<span style="color: black;">况且</span>只取页面上关心的内容,例如提取黄页信息,商品价格信息,还有提取竞争对手<span style="color: black;">宣传</span>信息的,搜一下Spyfu,<span style="color: black;">特别有</span>趣。这类 爬虫<span style="color: black;">能够</span><span style="color: black;">安排</span><span style="color: black;">非常多</span>,<span style="color: black;">况且</span><span style="color: black;">能够</span><span style="color: black;">特别有</span>侵略性,对方很难封锁。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">MetaSeeker中的网络爬虫就属于后者。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">MetaSeeker工具包利用Mozilla平台的能力,只要是Firefox看到的东西,它都能提取。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">MetaSeeker工具包是免费<span style="color: black;">运用</span>的,下载<span style="color: black;">位置</span>:www.gooseeker.com/cn/node/download/front</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:网页抓取、信息提取、数据抽取工具包,操作简单</span></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">11、Playfish</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">playfish是一个采用java技术,综合应用多个开源java组件实现的网页抓取工具,<span style="color: black;">经过</span>XML配置文件实现高度可定制性与可扩展性的网页抓取工具</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">应用开源jar包<span style="color: black;">包含</span>httpclient(内容读取),dom4j(配置文件解析),jericho(html解析),<span style="color: black;">已然</span>在 war包的lib下。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">这个项目<span style="color: black;">日前</span>还很不成熟,<span style="color: black;">然则</span>功能基本都完<span style="color: black;">成为了</span>。<span style="color: black;">需求</span><span style="color: black;">运用</span>者<span style="color: black;">熟练</span>XML,<span style="color: black;">熟练</span>正则表达式。<span style="color: black;">日前</span><span style="color: black;">经过</span>这个工具<span style="color: black;">能够</span>抓取各类论坛,贴吧,以及各类CMS系统。像Discuz!,phpbb,论坛跟博客的<span style="color: black;">文案</span>,<span style="color: black;">经过</span>本工具都<span style="color: black;">能够</span><span style="color: black;">容易</span>抓取。抓取定义完全采用XML,适合Java<span style="color: black;">研发</span>人员<span style="color: black;">运用</span>。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">运用</span><span style="color: black;">办法</span>:1.下载右边的.war包导入到eclipse中, 2.<span style="color: black;">运用</span>WebContent/sql下的wcc.sql文件<span style="color: black;">创立</span>一个范例数据库, 3.修改src包下wcc.core的dbConfig.txt,将用户名与<span style="color: black;">秘码</span>设置成你自己的mysql用户名<span style="color: black;">秘码</span>。 4.<span style="color: black;">而后</span>运行SystemCore,运行时候会在<span style="color: black;">掌控</span>台,无参数会执行默认的example.xml的配置文件,带参数时候名<span style="color: black;">叫作</span>为配置文件名。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">系统自带了3个例子,分别为baidu.xml抓取百度<span style="color: black;">晓得</span>,example.xml抓取我的javaeye的博客,bbs.xml抓取一个采用 discuz论坛的内容。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议: MIT</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言: Java</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">操作系统: 跨平台</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:<span style="color: black;">经过</span>XML配置文件实现高度可定制性与可扩展性</span></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">12、Spiderman</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Spiderman 是一个基于微内核+插件式架构的网络蜘蛛,它的<span style="color: black;">目的</span>是<span style="color: black;">经过</span>简单的<span style="color: black;">办法</span>就能将<span style="color: black;">繁杂</span>的<span style="color: black;">目的</span>网页信息抓取并解析为自己所需要的业务数据。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">怎么<span style="color: black;">运用</span>?</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">首要</span>,确定好你的<span style="color: black;">目的</span>网站以及<span style="color: black;">目的</span>网页(即某一类你想要获取数据的网页,例如网易<span style="color: black;">资讯</span>的<span style="color: black;">资讯</span>页面)</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">而后</span>,打开<span style="color: black;">目的</span>页面,分析页面的HTML结构,得到你想要数据的XPath,<span style="color: black;">详细</span>XPath怎么获取请看下文。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">最后,在一个xml配置文件里填写好参数,运行Spiderman吧!</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议: Apache</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言: Java</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">操作系统: 跨平台</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:灵活、扩展性强,微内核+插件式架构,<span style="color: black;">经过</span>简单的配置就<span style="color: black;">能够</span>完成数据抓取,无需编写一句代码</span></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">13、webmagic</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">webmagic的是一个无须配置、便于二次<span style="color: black;">研发</span>的爬虫框架,它<span style="color: black;">供给</span>简单灵活的API,只需少量代码<span style="color: black;">就可</span>实现一个爬虫。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_jpg/vqlbVFl5Jn3EJbqnVYwZmvUWyzR2VV4mTvMkdmp1YibYZxliaUibfBXGh62eWdb1PRYRgSgX7YyCCfn8OX87OLj6g/640?wx_fmt=jpeg&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1" style="width: 50%; margin-bottom: 20px;"></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">webmagic采用完全模块化的设计,功能覆盖<span style="color: black;">全部</span>爬虫的生命周期(链接提取、页面下载、内容抽取、持久化),支持多线程抓取,分布式抓取,并支持自动重试、自定义UA/cookie等功能。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_jpg/bLmy0N4HIafbZUkNvRibiaAkRiayGMLn1SquX815KnNaqiaEC9KQp8VhB4IervrpHRclicU20W0q50SdQYxYSHvmb2w/640?wx_fmt=jpeg&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1" style="width: 50%; margin-bottom: 20px;"></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">webmagic<span style="color: black;">包括</span>强大的页面抽取功能,<span style="color: black;">研发</span>者<span style="color: black;">能够</span><span style="color: black;">方便</span>的<span style="color: black;">运用</span>css selector、xpath和正则表达式进行链接和内容的提取,支持多个<span style="color: black;">选取</span>器链式调用。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">webmagic的<span style="color: black;">运用</span>文档:</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">http://webmagic.io/docs/</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">查看源代码:</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">http://git.oschina.net/flashsword20/webmagic</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议: Apache</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言: Java</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">操作系统: 跨平台</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:功能覆盖<span style="color: black;">全部</span>爬虫生命周期,<span style="color: black;">运用</span>Xpath和正则表达式进行链接和内容的提取。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">备注:这是一款国产开源软件,由 黄亿华贡献</span></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">14、Web-Harvest</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Web-Harvest是一个Java开源Web数据抽取工具。它能够收集指定的Web页面并从这些页面中提取有用的数据。Web-Harvest<span style="color: black;">重点</span>是运用了像XSLT,XQuery,正则表达式等这些技术来实现对text/xml的操作。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">其实现原理是,<span style="color: black;">按照</span>预先定义的配置文件用httpclient获取页面的<span style="color: black;">所有</span>内容(关于httpclient的内容,本博有些<span style="color: black;">文案</span>已介绍),<span style="color: black;">而后</span>运用XPath、XQuery、正则表达式等这些技术来实现对text/xml的内容筛选操作,<span style="color: black;">选择</span>精确的数据。前两年比较火的垂直搜索(<span style="color: black;">例如</span>:酷讯等)<span style="color: black;">亦</span>是采用类似的原理实现的。Web-Harvest应用,关键<span style="color: black;">便是</span>理解和定义配置文件,其他的<span style="color: black;">便是</span><span style="color: black;">思虑</span>怎么处理数据的Java代码。当然在爬虫<span style="color: black;">起始</span>前,<span style="color: black;">亦</span><span style="color: black;">能够</span>把Java变量填充到配置文件中,实现动态的配置。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议: BSD</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言: Java</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:运用XSLT、XQuery、正则表达式等技术来实现对Text或XML的操作,<span style="color: black;">拥有</span>可视化的界面</span></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">15、WebSPHINX</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">WebSPHINX是一个Java类包和Web爬虫的交互式<span style="color: black;">研发</span>环境。Web爬虫(<span style="color: black;">亦</span>叫作<span style="color: black;">设备</span>人或蜘蛛)是<span style="color: black;">能够</span>自动浏览与处理Web页面的程序。WebSPHINX由两部分<span style="color: black;">构成</span>:爬虫工作平台和WebSPHINX类包。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议:Apache</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言:Java</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:由两部分<span style="color: black;">构成</span>:爬虫工作平台和WebSPHINX类包</span></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">16、YaCy</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">YaCy基于p2p的分布式Web搜索引擎.<span style="color: black;">同期</span><span style="color: black;">亦</span>是一个Http缓存代理服务器.这个项目是构建基于p2p Web索引网络的一个新<span style="color: black;">办法</span>.它<span style="color: black;">能够</span>搜索你自己的或全局的索引,<span style="color: black;">亦</span><span style="color: black;">能够</span>Crawl自己的网页或<span style="color: black;">起步</span>分布式Crawling等.</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议: GPL</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言: Java Perl</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">操作系统: 跨平台</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:基于P2P的分布式Web搜索引擎</span></p>
<h1 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">Python爬虫</span></strong></span></h1>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">17、QuickRecon</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">QuickRecon是一个简单的信息收集工具,它<span style="color: black;">能够</span><span style="color: black;">帮忙</span>你<span style="color: black;">查询</span>子域名名<span style="color: black;">叫作</span>、perform zone transfe、收集电子邮件<span style="color: black;">位置</span>和<span style="color: black;">运用</span>microformats寻找人际关系等。QuickRecon<span style="color: black;">运用</span>python编写,支持linux和 windows操作系统。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议: GPLv3</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言: Python</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">操作系统: Windows Linux</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:<span style="color: black;">拥有</span><span style="color: black;">查询</span>子域名名<span style="color: black;">叫作</span>、收集电子邮件<span style="color: black;">位置</span>并寻找人际关系等功能</span></p>
<h1 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">18、PyRailgun</span></strong></h1>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">这是一个非常简单易用的抓取工具。支持抓取javascript渲染的页面的简单实用<span style="color: black;">有效</span>的python网页爬虫抓取模块</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议: MIT</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言: Python</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">操作系统: 跨平台 Windows Linux OS X</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:简洁、轻量、<span style="color: black;">有效</span>的网页抓取框架</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">备注:此软件<span style="color: black;">亦</span><span style="color: black;">是由于</span>国人开放</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">github下载:</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">https://github.com/princehaku/pyrailgun#readme</span></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">19、Scrapy</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Scrapy 是一套基于基于Twisted的异步处理框架,纯python实现的爬虫框架,用户只需要定制<span style="color: black;">研发</span>几个模块就<span style="color: black;">能够</span><span style="color: black;">容易</span>的实现一个爬虫,用来抓取网页内容以及<span style="color: black;">各样</span><span style="color: black;">照片</span>,非常之方便~</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议: BSD</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言: Python</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">操作系统: 跨平台</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">github源代码:https://github.com/scrapy/scrapy</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:基于Twisted的异步处理框架,文档齐全</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">C++爬虫</span></strong></span></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">20、hispider</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">HiSpider is a fast and high performance spider with high speed</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">严格说只能是一个spider系统的框架, <span style="color: black;">无</span>细化<span style="color: black;">需要</span>, <span style="color: black;">日前</span>只是能提取URL, URL排重, 异步DNS解析, 队列化任务, 支持N机分布式下载, 支持网站定向下载(需要配置hispiderd.ini whitelist).</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特征和用法:</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">基于unix/linux系统的<span style="color: black;">研发</span></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">异步DNS解析</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">URL排重</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">支持HTTP 压缩编码传输 gzip/deflate</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">字符集判断自动转换成UTF-8编码</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">文档压缩存储</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">支持多下载节点分布式下载</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">支持网站定向下载(需要配置 hispiderd.ini whitelist )</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">可<span style="color: black;">经过</span> http://127.0.0.1:3721/ 查看下载<span style="color: black;">状况</span>统计,下载任务<span style="color: black;">掌控</span>(可停止和恢复任务)</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">依赖基本通信库libevbase 和 libsbase (安装的时候需要先安装这个两个库).</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">工作流程:</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">从中心节点取URL(<span style="color: black;">包含</span>URL对应的任务号, IP和port,<span style="color: black;">亦</span>可能需要自己解析)</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">连接服务器发送请求</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">等待数据头判断<span style="color: black;">是不是</span>需要的数据(<span style="color: black;">日前</span><span style="color: black;">重点</span>取text类型的数据)</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">等待完成数据(有length头的直接等待说明长度的数据否则等待比<span style="color: black;">很强</span>的数字<span style="color: black;">而后</span>设置超时)</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数据完成<span style="color: black;">或</span>超时, zlib压缩数据返回给中心服务器,数据可能<span style="color: black;">包含</span>自己解析DNS信息, 压缩后数据长度+压缩后数据, <span style="color: black;">倘若</span>出错就直接返回任务号以及<span style="color: black;">关联</span>信息</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">中心服务器收到带有任务号的数据, 查看<span style="color: black;">是不是</span><span style="color: black;">包含</span>数据, <span style="color: black;">倘若</span><span style="color: black;">无</span>数据直接置任务号对应的状态为错误, <span style="color: black;">倘若</span>有数据提取数据种link <span style="color: black;">而后</span>存储数据到文档文件.</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">完成后返回一个新的任务.</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议: BSD</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言: C/C++</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">操作系统: Linux</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:支持多机分布式下载, 支持网站定向下载</span></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">21、larbin</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">larbin是一种开源的网络爬虫/网络蜘蛛,由法国的<span style="color: black;">青年</span>人 Sébastien Ailleret独立<span style="color: black;">研发</span>。larbin目的是能够跟踪页面的url进行扩展的抓取,最后为搜索引擎<span style="color: black;">供给</span>广泛的数据<span style="color: black;">源自</span>。Larbin只是一个爬虫,<span style="color: black;">亦</span>就 是说larbin只抓取网页,至于<span style="color: black;">怎样</span>parse的事情则由用户自己完成。<span style="color: black;">另一</span>,<span style="color: black;">怎样</span>存储到数据库以及<span style="color: black;">创立</span>索引的事情 larbin<span style="color: black;">亦</span>不<span style="color: black;">供给</span>。一个简单的larbin的爬虫<span style="color: black;">能够</span><span style="color: black;">每日</span>获取500万的网页。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">利用larbin,<span style="color: black;">咱们</span><span style="color: black;">能够</span>轻易的获取/确定单个网站的所有链接,<span style="color: black;">乃至</span><span style="color: black;">能够</span>镜像一个网站;<span style="color: black;">亦</span><span style="color: black;">能够</span>用它<span style="color: black;">创立</span>url 列表群,例如针对所有的网页进行 url retrive后,进行xml的联结的获取。<span style="color: black;">或</span>是 mp3,<span style="color: black;">或</span>定制larbin,<span style="color: black;">能够</span><span style="color: black;">做为</span>搜索引擎的信息的<span style="color: black;">源自</span>。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议: GPL</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言: C/C++</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">操作系统: Linux</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:高性能的爬虫软件,只负责抓取不负责解析</span></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">22、Methabot</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Methabot 是一个经过速度优化的高可配置的 WEB、FTP、本地文件系统的爬虫软件。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议: 未知</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言: C/C++</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">操作系统: Windows Linux</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:过速度优化、可抓取WEB、FTP及本地文件系统</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">源代码:</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">http://www.oschina.net/code/tag/methabot</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">C#爬虫</span></strong></span></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">23、NWebCrawler</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">NWebCrawler是一款开源,C#<span style="color: black;">研发</span>网络爬虫程序。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特性:</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">可配置:线程数,等待时间,连接超时,<span style="color: black;">准许</span>MIME类型和优先级,下载文件夹。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">统计信息:URL数量,总下载文件,总下载字节数,CPU利用率和可用内存。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Preferential crawler:用户<span style="color: black;">能够</span>设置优先级的MIME类型。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Robust: 10+ URL normalization rules, crawler trap avoiding rules.</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议: GPLv2</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言: C#</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">操作系统: Windows</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">项目主页:http://www.open-open.com/lib/view/home/1350117470448</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:统计信息、执行过程可视化</span></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">24、Sinawler</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">国内<span style="color: black;">第1</span>个针对<span style="color: black;">博客</span>数据的爬虫程序!原名“<span style="color: black;">外链</span><span style="color: black;">博客</span>爬虫”。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">登录后,<span style="color: black;">能够</span>指定用户为起点,以该用户的关注人、粉丝为线索,延人脉关系搜集用户基本信息、<span style="color: black;">博客</span>数据、评论数据。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">该应用获取的数据可<span style="color: black;">做为</span><span style="color: black;">研究</span>、与<span style="color: black;">外链</span><span style="color: black;">博客</span><span style="color: black;">关联</span>的<span style="color: black;">开发</span>等的数据支持,但请勿用于<span style="color: black;">商场</span>用途。该应用基于.NET2.0框架,需SQL SERVER<span style="color: black;">做为</span>后台数据库,并<span style="color: black;">供给</span>了针对SQL Server的数据库脚本文件。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">另一</span>,<span style="color: black;">因为</span><span style="color: black;">外链</span><span style="color: black;">博客</span>API的限制,爬取的数据可能<span style="color: black;">不足</span>完整(如获取粉丝数量的限制、获取<span style="color: black;">博客</span>数量的限制等)</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">本程序版权归作者所有。你可<span style="color: black;">以避免</span>费: 拷贝、分发、呈现和表演当前作品,制作派生作品。 你不可将当前作品用于<span style="color: black;">商场</span>目的。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">5.x版本<span style="color: black;">已然</span>发布! 该版本共有6个后台工作线程:爬取用户基本信息的<span style="color: black;">设备</span>人、爬取用户关系的<span style="color: black;">设备</span>人、爬取用户标签的<span style="color: black;">设备</span>人、爬取微博内容的<span style="color: black;">设备</span>人、爬取<span style="color: black;">博客</span>评论的<span style="color: black;">设备</span>人,以及调节请求频率的<span style="color: black;">设备</span>人。更高的性能!最大限度挖掘爬虫<span style="color: black;">潜能</span>! 以<span style="color: black;">此刻</span>测试的结果看,<span style="color: black;">已然</span>能够满足自用。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">本程序的特点:</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">6个后台工作线程,最大限度挖掘爬虫性能<span style="color: black;">潜能</span>!</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">界面上<span style="color: black;">供给</span>参数设置,灵活方便</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">抛弃app.config配置文件,自己实现配置信息的加密存储,<span style="color: black;">守护</span>数据库帐号信息</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">自动<span style="color: black;">调节</span>请求频率,防止超限,<span style="color: black;">亦</span>避免过慢,降低效率</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">任意对爬虫<span style="color: black;">掌控</span>,可随时暂停、继续、停止爬虫</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">良好的用户体验</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议: GPLv3</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言: C# .NET</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">操作系统: Windows</span></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">25、spidernet</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">spidernet是一个以递归树为模型的多线程web爬虫程序, 支持text/html资源的获取. <span style="color: black;">能够</span>设定爬行深度, 最大下载字节数限制, 支持gzip解码, 支持以gbk(gb2312)和utf8编码的资源; 存储于sqlite数据文件.</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">源码中TODO:标记描述了未完成功能, <span style="color: black;">期盼</span>提交你的代码.</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议: MIT</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言: C#</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">操作系统: Windows</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">github源代码:https://github.com/nsnail/spidernet</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:以递归树为模型的多线程web爬虫程序,支持以GBK (gb2312)和utf8编码的资源,<span style="color: black;">运用</span>sqlite存储数据</span></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">26、Web Crawler</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">mart and Simple Web Crawler是一个Web爬虫框架。集成Lucene支持。该爬虫<span style="color: black;">能够</span>从单个链接或一个链接数组<span style="color: black;">起始</span>,<span style="color: black;">供给</span>两种遍历模式:最大迭代和最大深度。<span style="color: black;">能够</span>设置 过滤器限制爬回来的链接,默认<span style="color: black;">供给</span>三个过滤器ServerFilter、BeginningPathFilter和 RegularExpressionFilter,这三个过滤器可用AND、OR和NOT联合。在解析过程或页面加载前后都<span style="color: black;">能够</span>加监听器。介绍内容来自Open-Open</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言: Java</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">操作系统: 跨平台</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议: LGPL</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:多线程,支持抓取PDF/DOC/EXCEL等文档<span style="color: black;">源自</span></span></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">27、网络矿工</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">网站数据采集软件 网络矿工采集器(原soukey采摘)</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Soukey采摘网站数据采集软件是一款基于.Net平台的开源软件,<span style="color: black;">亦</span>是网站数据采集软件类型中<span style="color: black;">独一</span>一款开源软件。尽管Soukey采摘开源,但并不会影响软件功能的<span style="color: black;">供给</span>,<span style="color: black;">乃至</span>要比<span style="color: black;">有些</span>商用软件的功能还要丰富。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议: BSD</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言: C# .NET</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">操作系统: Windows</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:功能丰富,毫不逊色于<span style="color: black;">商场</span>软件</span></p>
<h1 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">PHP爬虫</span></strong></span></h1>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">28、OpenWebSpider</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">OpenWebSpider是一个开源多线程Web Spider(robot:<span style="color: black;">设备</span>人,crawler:爬虫)和<span style="color: black;">包括</span>许多有趣功能的搜索引擎。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议: 未知</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言: PHP</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">操作系统: 跨平台</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:开源多线程网络爬虫,有许多有趣的功能</span></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">29、PhpDig</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">PhpDig是一个采用PHP<span style="color: black;">研发</span>的Web爬虫和搜索引擎。<span style="color: black;">经过</span>对动态和静态页面进行索引<span style="color: black;">创立</span>一个词汇表。当搜索<span style="color: black;">查找</span>时,它将按<span style="color: black;">必定</span>的排序规则<span style="color: black;">表示</span><span style="color: black;">包括</span>关 键字的搜索结果页面。PhpDig<span style="color: black;">包括</span>一个模板系统并能够索引PDF,Word,Excel,和PowerPoint文档。PHPdig适用于专业化更 强、层次更深的个性化搜索引擎,利用它打造针对某一<span style="color: black;">行业</span>的垂直搜索引擎是最好的<span style="color: black;">选取</span>。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">演示:http://www.phpdig.net/navigation.php?action=demo</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议: GPL</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言: PHP</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">操作系统: 跨平台</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:<span style="color: black;">拥有</span>采集网页内容、提交表单功能</span></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">30、ThinkUp</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">ThinkUp 是一个<span style="color: black;">能够</span>采集推特,facebook等社交网络数据的社会<span style="color: black;">媒介</span>视角引擎。<span style="color: black;">经过</span>采集个人的社交网络账号中的数据,对其存档以及处理的交互分析工具,并将数据图形化以便更直观的查看。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议: GPL</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言: PHP</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">操作系统: 跨平台</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">github源码:https://github.com/ThinkUpLLC/ThinkUp</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:采集推特、脸谱等社交网络数据的社会<span style="color: black;">媒介</span>视角引擎,可进行交互分析并将结果以可视化形式展现</span></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">31、微购</span></strong></span></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">微购社会化购物系统是一款基于ThinkPHP框架<span style="color: black;">研发</span>的开源的购物分享系统,<span style="color: black;">同期</span>它<span style="color: black;">亦</span>是一套针对站长、开源的的淘宝客网站程序,它整合了淘宝、天猫、淘宝客等300多家商品数据采集接口,为广大的淘宝客站长<span style="color: black;">供给</span>傻瓜式淘客建站服务,会HTML就会做程序模板,免费开放下载,是广大淘客站长的首选。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">演示网址:http://tlx.wego360.com</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议: GPL</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言: PHP</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">操作系统: 跨平台</span></p>
<h1 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">ErLang爬虫</span></strong></span></h1>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">32、Ebot</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Ebot 是一个用 ErLang 语言<span style="color: black;">研发</span>的可伸缩的分布式网页爬虫,URLs 被<span style="color: black;">保留</span>在数据库中可<span style="color: black;">经过</span> RESTful 的 HTTP 请求来<span style="color: black;">查找</span>。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议: GPLv3</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言: ErLang</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">操作系统: 跨平台</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">github源代码:https://github.com/matteoredaelli/ebot</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">项目主页:http://www.redaelli.org/matteo/blog/projects/ebot</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:可伸缩的分布式网页爬虫</span></p>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">Ruby爬虫</span></strong></span></h2>
<h2 style="color: black; text-align: left; margin-bottom: 10px;"><strong style="color: blue;"><span style="color: black;">33、Spidr</span></strong></h2>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Spidr 是一个Ruby 的网页爬虫库,<span style="color: black;">能够</span>将<span style="color: black;">全部</span>网站、多个网站、某个链接完全抓取到本地。</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">研发</span>语言: Ruby</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">授权协议:MIT</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">特点:可将一个或多个网站、某个链接完全抓取到本地</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">举荐</span>阅读</strong></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">(点击标题可<span style="color: black;">转</span>阅读)</p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><a style="color: black;">Python 爬虫实践:《战狼2》豆瓣影评分析</a></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">看完本文有收获?请转发分享给<span style="color: black;">更加多</span>人</span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">关注「数据分析与<span style="color: black;">研发</span>」加星标,<span style="color: black;">提高</span>数据技能</span></strong></span></p>
<p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></p>
交流如星光璀璨,点亮思想夜空。 论坛外链网http://www.fok120.com/ 论坛外链网http://www.fok120.com/
页:
[1]