整理了Node.js、PHP、Go、JAVA、Ruby、Python等语言的爬虫框架。不知道读者们都用过什么爬虫框架?爬虫框架的哪些点你觉得好?哪些点觉得不好?
Node.js
https://github.com/bda-research/node-crawler
Github stars = 3802
北京bda资讯公司数据团队的作品
优点:
关于V8引擎
PHP
https://github.com/jae-jae/QueryList
Github stars = 1016
特点
Go
https://github.com/gocolly/colly
Github stars = 5065
Features
https://github.com/henrylee2cn/pholcus
GitHub stars = 4089
支持单机、服务端、客户端三种运行模式,拥有Web、GUI、命令行三种操作界面;规则简单灵活、批量任务并发、输出方式丰富(mysql/mongodb/kafka/csv/excel等)、有大量Demo共享;另外它还支持横纵向两种抓取模式,支持模拟登录和任务暂停、取消等一系列高级功能。
框架特点
JAVA
https://github.com/code4craft/webmagic
Github stars = 6643
webmagic的主要特色:
WebMagic的四个组件:
1.Downloader
Downloader负责从互联网上下载页面,以便后续处理。WebMagic默认使用了Apache HttpClient作为下载工具。
2.PageProcessor
PageProcessor负责解析页面,抽取有用信息,以及发现新的链接。WebMagic使用Jsoup作为HTML解析工具,并基于其开发了解析XPath的工具Xsoup。
在这四个组件中,PageProcessor对于每个站点每个页面都不一样,是需要使用者定制的部分。
3.Scheduler
Scheduler负责管理待抓取的URL,以及一些去重的工作。WebMagic默认提供了JDK的内存队列来管理URL,并用集合来进行去重。也支持使用Redis进行分布式管理。
除非项目有一些特殊的分布式需求,否则无需自己定制Scheduler。
4.Pipeline
Pipeline负责抽取结果的处理,包括计算、持久化到文件、数据库等。WebMagic默认提供了“输出到控制台”和“保存到文件”两种结果处理方案。
Pipeline定义了结果保存的方式,如果你要保存到指定数据库,则需要编写对应的Pipeline。对于一类需求一般只需编写一个Pipeline。
https://github.com/yasserg/crawler4j
GitHub stars = 2944
没有文档,只有git
优点
https://github.com/CrawlScript/WebCollector
GitHub stars = 1883
没有文档,只有git
WebCollector是一个无须配置、便于二次开发的JAVA爬虫框架,它提供精简的的API,只需少量代码即可实现一个功能强大的爬虫。WebCollector-Hadoop是WebCollector的Hadoop版本,支持分布式爬取。
https://github.com/apache/nutch
GitHub stars = 1703
Features
Nutch的优缺点
优点:
Nutch支持分布式抓取,并有Hadoop支持,可以进行多机分布抓取,存储和索引。另外很吸引人的一点在于,它提供了一种插件框架,使得其对各种网页内容的解析、各种数据的采集、查询、集群、过滤等功能能够方便的进行扩展,正是由于有此框架,使得 Nutch 的插件开发非常容易,第三方的插件也层出不穷,极大的增强了 Nutch 的功能和声誉。
缺点:
Nutch的爬虫定制能力比较弱
https://github.com/internetarchive/heritrix3
GitHub stars = 1192
特点
https://github.com/xtuhcy/gecco
GitHub stars = 1171
主要特征
Ruby
Wombat
https://github.com/felipecsl/wombat
Github stars = 1083
Wombat is a simple ruby DSL to scrape webpages on top of the cool Mechanize and Nokogiri gems. It is aimed to be a more high level abstraction if you dont want to dig into the specifics of getting the page and parsing it into your own data structure, which can be a decent amount of work, depending on what you need.
最后,Python
https://github.com/scrapy/scrapy
GitHub stars = 27682
https://github.com/binux/pyspider
GitHub star = 11418
特点
Scheduler
The Scheduler receives tasks from newtask_queue from processor. Decide whether the task is new or requires re-crawl. Sort tasks according to priority and feeding them to fetcher with traffic control (token bucket algorithm). Take care of periodic tasks, lost tasks and failed tasks and retry later.
Note that in current implement of scheduler, only one scheduler is allowed.
Scheduler
The Scheduler receives tasks from newtask_queue from processor. Decide whether the task is new or requires re-crawl. Sort tasks according to priority and feeding them to fetcher with traffic control (token bucket algorithm). Take care of periodic tasks, lost tasks and failed tasks and retry later.
Processor
The Processor is responsible for running the script written by users to parse and extract information. Your script is running in an unlimited environment. Although we have various tools(like PyQuery) for you to extract information and links, you can use anything you want to deal with the response. You may refer to Script Environment and API Reference to get more information about script.
Result Worker (optional)
Result worker receives results from Processor. Pyspider has a built-in result worker to save result to resultdb. Overwrite it to deal with result by your needs.
WebUI
WebUI is a web frontend for everything. It contains:
Maybe webui is the most attractive part of pyspider. With this powerful UI, you can debug your scripts step by step just as pyspider do. Starting or stop a project. Finding which project is going wrong and what request is failed and try it again with debugger.
https://github.com/codelucas/newspaper
GitHub star = 6386
下面这个demo站,可以展示提取标题、正文、关键词等信息。
http://newspaper-demo.herokuapp.com
Features