本文介绍了常见的网络爬虫工具*Scrapy的安装及使用过程*,另外介绍了Scrapy运行时常见问题以及相应解决办法,希望能对您的学习带来帮助。
Scrapy是一个快速高级屏幕抓取和爬行网页的框架,用来抓取的网站,从网页中抽取结构化的数据。它可以用于广泛的用途,从数据挖掘到监控和自动化测试。
官方主页: http://www.scrapy.org/
官方主页:http://www.python.org/
下载地址:http://www.python.org/ftp/python/2.7.3/python-2.7.3.msi
安装目录:D:\Python27
略System Properties -> Advanced -> Environment Variables - >System Variables -> Path -> Edit
T:\>set Path
Path=C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;D:\Rational\common;D:\Rational\ClearCase\bin;D:\Python27;D:\Python27\Scripts
PATHEXT=.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH
T:\>python
Python 2.7.3 (default, Apr 10 2012, 23:31:26) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> exit()
T:\>
官方主页:http://scrapy.org/
下载地址:http://pypi.python.org/packages/source/S/Scrapy/Scrapy-0.14.4.tar.gz
解压过程:略
安装过程:
T:\Scrapy-0.14.4>python setup.py install
……
Installing easy_install-2.7-script.py script to D:\Python27\Scripts
Installing easy_install-2.7.exe script to D:\Python27\Scripts
Installing easy_install-2.7.exe.manifest script to D:\Python27\Scripts
Using d:\python27\lib\site-packages
Finished processing dependencies for Scrapy==0.14.4
T:\Scrapy-0.14.4>
验证安装:
T:\>scrapy
Scrapy 0.14.4 - no active project
Usage:
scrapy <command> [options] [args]
Available commands:
fetch Fetch a URL using the Scrapy downloader
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
Use "scrapy -h" to see more info about a command
T:\>
scrapy提供一个工具来生成项目,生成的项目中预置了一些文件,用户需要在这些文件中添加自己的代码。
打开命令行,执行:scrapy startproject tutorial,生成的项目类似下面的结构
tutorial/
scrapy.cfg
tutorial/
init.py
items.py
pipelines.py
settings.py
spiders/
init.py
…
scrapy.cfg是项目的配置文件
用户自己写的spider要放在spiders目录下面,建立一个dmoz.py文件,如下图
内容如下:
from scrapy.spider import BaseSpider
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://pangjiuzala.github.io/"
]
def parse(self, response):
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body)
name属性很重要,不同spider不能使用相同的name
start_urls是spider抓取网页的起始点,可以包括多个url
parse方法是spider抓到一个网页以后默认调用的callback,避免使用这个名字来定义自己的方法。
当spider拿到url的内容以后,会调用parse方法,并且传递一个response参数给它,response包含了抓到的网页的内容,在parse方法里,你可以从抓到的网页里面解析数据。上面的代码只是简单地把网页内容保存到文件。
开始抓取
你可以打开命令行,进入生成的项目根目录tutorial/,执行 scrapy crawl dmoz, dmoz是spider的name。
解析网页内容
scrapy提供了方便的办法从网页中解析数据,这需要使用到HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://pangjiuzala.github.io/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
for site in sites:
title = site.select('a/text()').extract()
link = site.select('a/@href').extract()
desc = site.select('text()').extract()
print title, link, desc
HtmlXPathSelector使用了Xpath来解析数据
//ul/li表示选择所有的ul标签下的li标签
a/@href表示选择所有a标签的href属性
a/text()表示选择a标签文本
a[@href=”abc”]表示选择所有href属性是abc的a标签
我们可以把解析出来的数据保存在一个scrapy可以使用的对象中,然后scrapy可以帮助我们把这些对象保存起来,而不用我们自己把这些数据存到文件中。我们需要在items.py中添加一些类,这些类用来描述我们要保存的数据
from scrapy.item import Item, Field
class DmozItem(Item):
title = Field()
link = Field()
desc = Field()
然后在spider的parse方法中,我们把解析出来的数据保存在DomzItem对象中。
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from tutorial.items import DmozItem
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://pangjiuzala.github.io/
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.select('a/text()').extract()
item['link'] = site.select('a/@href').extract()
item['desc'] = site.select('text()').extract()
items.append(item)
return items
在命令行执行scrapy的时候,我们可以加两个参数,让scrapy把parse方法返回的items输出到json文件中
scrapy crawl dmoz -o items.json -t json
items.json会被放在项目的根目录
让scrapy自动抓取网页上的所有链接
上面的示例中scrapy只抓取了start_urls里面的两个url的内容,但是通常我们想实现的是scrapy自动发现一个网页上的所有链接,然后再去抓取这些链接的内容。为了实现这一点我们可以在parse方法里面提取我们需要的链接,然后构造一些Request对象,并且把他们返回,scrapy会自动的去抓取这些链接。代码类似:
class MySpider(BaseSpider):
name = 'myspider'
start_urls = (
'http://example.com/page1',
'http://example.com/page2',
)
def parse(self, response):
# collect `item_urls`
for item_url in item_urls:
yield Request(url=item_url, callback=self.parse_item)
def parse_item(self, response):
item = MyItem()
# populate `item` fields
yield Request(url=item_details_url, meta={'item': item},
callback=self.parse_details)
def parse_details(self, response):
item = response.meta['item']
# populate more `item` fields
return item
parse是默认的callback, 它返回了一个Request列表,scrapy自动的根据这个列表抓取网页,每当抓到一个网页,就会调用parse_item,parse_item也会返回一个列表,scrapy又会根据这个列表去抓网页,并且抓到后调用parse_details
为了让这样的工作更容易,scrapy提供了另一个spider基类,利用它我们可以方便的实现自动抓取链接. 我们要用到CrawlSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class MininovaSpider(CrawlSpider):
name = 'mininova.org'
allowed_domains = ['mininova.org']
start_urls = ['http://www.mininova.org/today']
rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+'])),
Rule(SgmlLinkExtractor(allow=['/abc/\d+']), 'parse_torrent')]
def parse_torrent(self, response):
x = HtmlXPathSelector(response)
torrent = TorrentItem()
torrent['url'] = response.url
torrent['name'] = x.select("//h1/text()").extract()
torrent['description'] = x.select("//div[@id='description']").extract()
torrent['size'] = x.select("//div[@id='info-left']/p[2]/text()[2]").extract()
return torrent
相比BaseSpider,新的类多了一个rules属性,这个属性是一个列表,它可以包含多个Rule,每个Rule描述了哪些链接需要抓取,哪些不需要。这是Rule类的文档http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.contrib.spiders.Rule
这些rule可以有callback,也可以没有,当没有callback的时候,scrapy简单的follow所有这些链接.
pipelines.py的使用
在pipelines.py中我们可以添加一些类来过滤掉我们不想要的item,把item保存到数据库。
from scrapy.exceptions import DropItem
class FilterWordsPipeline(object):
"""A pipeline for filtering out items which contain certain words in their
description"""
# put all words in lowercase
words_to_filter = ['politics', 'religion']
def process_item(self, item, spider):
for word in self.words_to_filter:
if word in unicode(item['description']).lower():
raise DropItem("Contains forbidden word: %s" % word)
else:
return item
如果item不符合要求,那么就抛一个异常,这个item不会被输出到json文件中。
要使用pipelines,我们还需要修改settings.py
添加一行
ITEM_PIPELINES = [‘dirbot.pipelines.FilterWordsPipeline’]
现在执行scrapy crawl dmoz -o items.json -t json,不符合要求的item就被过滤掉了,这时在tutorial目录下会生成一个如下图所示的items.json文件
create table book ( title char(15) not null, link varchar(50) COLLATE gb2312_chinese_ci DEFAULT NULL);
如果出现中文乱码问题请将数据库编码格式设置成gb2312_chinese_ci
添加如下代码:
from scrapy import log
from twisted.enterprise import adbapi
from scrapy.http import Request
from scrapy.exceptions import DropItem
from scrapy.contrib.pipeline.images import ImagesPipeline
import time
import MySQLdb
import MySQLdb.cursors
import socket
import select
import sys
import os
import errno
class MySQLStorePipeline(object):
def __init__(self):
self.dbpool = adbapi.ConnectionPool('MySQLdb',
db = 'test',8 #数据库名称
user = 'root', #数据库用户名
passwd = '', #数据库密码
cursorclass = MySQLdb.cursors.DictCursor,
charset = 'utf8',
use_unicode = False
)
def process_item(self, item, spider):
query = self.dbpool.runInteraction(self._conditional_insert, item)
return item
def _conditional_insert(self, tx, item):
if item.get('title'):
for i in range(len(item['title'])):
tx.execute('insert into book values (%s, %s)', (item['title'][i], item['link'][i]))
添加如下代码:
ITEM_PIPELINES = ['tutorial.pipelines.MySQLStorePipeline']
运行效果如下
解决方案:
http://www.crifan.com/python_syntax_error_indentationerror/comment-page-1/
如下:
[{"title": ["\u4e3b\u9875"], "tag": [], "link": ["/"], "desc": []},
{"title": ["\u6587\u7ae0\u5217\u8868"], "tag": [], "link": ["/archives"], "desc": []},
{"title": [], "tag": [], "link": [], "desc": ["\n \t\t\t\t\t\n\t\t\t\t\t", "\n\t\t\t\t\t\n\t\t\t\t\t"]},
{"title": ["Java"], "tag": [], "link": ["/tags/Java/"], "desc": []},
{"title": ["\u7b97\u6cd5"], "tag": [], "link": ["/tags/\u7b97\u6cd5/"], "desc": []},
{"title": ["\u6570\u636e\u6316\u6398"], "tag": [], "link": ["/tags/\u6570\u636e\u6316\u6398/"], "desc": []},
{"title": ["\u7269\u8054\u7f51"], "tag": [], "link": ["/tags/\u7269\u8054\u7f51/"], "desc": []},
{"title": ["C++"], "tag": [], "link": ["/tags/C/"], "desc": []},
{"title": ["openHAB"], "tag": [], "link": ["/tags/openHAB/"], "desc": []},
{"title": ["\u4e91\u8ba1\u7b97"], "tag": [], "link": ["/tags/\u4e91\u8ba1\u7b97/"], "desc": []},
{"title": ["C"], "tag": [], "link": ["/tags/C/"], "desc": []},
{"title": ["\u79fb\u52a8\u4e92\u8054\u7f51"], "tag": [], "link": ["/tags/\u79fb\u52a8\u4e92\u8054\u7f51/"], "desc": []},
{"title": ["GC"], "tag": [], "link": ["/tags/GC/"], "desc": []},
{"title": ["\u5927\u6570\u636e"], "tag": [], "link": ["/tags/\u5927\u6570\u636e/"], "desc": []},
{"title": ["\u5fae\u535a"], "tag": [], "link": ["http://weibo.com/jiayou087"], "desc": ["\n \n \t", "\n \n "]},
{"title": ["CSDN"], "tag": [], "link": ["http://blog.csdn.net/pangjiuzala"], "desc": ["\n \n \t", "\n \n "]},
{"title": ["July 2015"], "tag": [], "link": ["/archives/2015/07/"], "desc": []}]
解决方案:
添加如下代码;
import json
import codecs
class JsonWriterPipeline(object):
def __init__(self):
self.file = codecs.open('items.json', 'w', encoding='utf-8')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line.decode('unicode_escape'))
return item
添加如下代码;
ITEM_PIPELINES = {'tutorial.pipelines.JsonWriterPipeline'}
转化后的数据如下:
[{"title": ["主页"], "tag": [], "link": ["/"], "desc": []},
{"title": ["文章列表"], "tag": [], "link": ["/archives"], "desc": []},
{"title": [], "tag": [], "link": [], "desc": ["\n \t\t\t\t\t\n\t\t\t\t\t", "\n\t\t\t\t\t\n\t\t\t\t\t"]},
{"title": ["Java"], "tag": [], "link": ["/tags/Java/"], "desc": []},
{"title": ["算法"], "tag": [], "link": ["/tags/算法/"], "desc": []},
{"title": ["数据挖掘"], "tag": [], "link": ["/tags/数据挖掘/"], "desc": []},
{"title": ["物联网"], "tag": [], "link": ["/tags/物联网/"], "desc": []},
{"title": ["C++"], "tag": [], "link": ["/tags/C/"], "desc": []},
{"title": ["openHAB"], "tag": [], "link": ["/tags/openHAB/"], "desc": []},
{"title": ["云计算"], "tag": [], "link": ["/tags/云计算/"], "desc": []},
{"title": ["C"], "tag": [], "link": ["/tags/C/"], "desc": []},
{"title": ["移动互联网"], "tag": [], "link": ["/tags/移动互联网/"], "desc": []},
{"title": ["GC"], "tag": [], "link": ["/tags/GC/"], "desc": []},
{"title": ["大数据"], "tag": [], "link": ["/tags/大数据/"], "desc": []},
{"title": ["微博"], "tag": [], "link": ["http://weibo.com/jiayou087"], "desc": ["\n \n \t", "\n \n "]},
{"title": ["CSDN"], "tag": [], "link": ["http://blog.csdn.net/pangjiuzala"], "desc": ["\n \n \t", "\n \n "]},
{"title": ["July 2015"], "tag": [], "link": ["/archives/2015/07/"], "desc": []}]
详情链接:
http://kevinkelly.blog.163.com/blog/static/21390809320133185748442/