Pyspider框架（三）

Pyspider中的API介绍

1.self.crawl

（1）self.crawl(url, **kwargs)

self.crawl是告诉pyspider应该爬取哪个url的主要接口程序。

（2）参数

下面两个参数参数必选

将被爬取的url或url列表

callback

解析返回响应的方法

def on_start(self):
    self.crawl('http://scrapy.org/', callback=self.index_page)

接下来的参数可选

age

任务的有效期。在此期间，页面将被视为未修改。默认：-1(不会再次爬取)

@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
    ...

每个被index_page解析的页面将被视为10天内不会发生改变，如果您在10天内提交的任务，它将被丢弃。

priority

待调度任务的优先级越高优先级越大。默认：0

def index_page(self):
    self.crawl('http://www.example.org/page2.html', callback=self.index_page)
    self.crawl('http://www.example.org/233.html', callback=self.detail_page,
               priority=1)

页面233.html将会比page2.html优先爬取。使用此参数执行广度优先算法从而减少队列中任务的数量，控制内存资源。

exetime
任务的执行时间（时间戳）。默认：0（立即运行），以下代码说明页面爬取将在30分钟后执行。

import time
def on_start(self):
    self.crawl('http://www.example.org/', callback=self.callback,
               exetime=time.time()+30*60)

retries

失败时的重试次数。默认：3（可以进行设置，在前面介绍过）
itag

来自前沿页面的标记，用于显示任务的潜在修改。它将与其最后一个值进行比较，更改后重新爬取。默认：None

def index_page(self, response):
    for item in response.doc('.item').items():
        self.crawl(item.find('a').attr.url, callback=self.detail_page,
                   itag=item.find('.update-time').text())

在上面的例子中，.update-time作为itag。如果没有更改，请求将被丢弃。或者如果希望重新启动所有的任务，你可以通过Handler.crawl_config来设置itag,指定脚本的版本。

class Handler(BaseHandler):
    crawl_config = {
        'itag': 'v223'
    }

修改脚本后更改itag的值，然后再次单击run按钮。如果之前没有设置，也没有关系（itag不是被调用，而是pyspider自动检测每个任务的itag）。

auto_recrawl

当启用时，任务会在每个生命周期重新爬取。默认：False；以下代码每5个小时页面会被重新爬取。

def on_start(self):
    self.crawl('http://www.example.org/', callback=self.callback,
               age=5*60*60, auto_recrawl=True)

method

使用的http请求方法。默认：GET
params

要附加到URL的URL参数字典。

def on_start(self):
    self.crawl('http://httpbin.org/get', callback=self.callback,
               params={'a': 123, 'b': 'c'})
    self.crawl('http://httpbin.org/get?a=123&b=c', callback=self.callback)

这两个请求相同。

data

附加到请求的请求体。如果提供字典，则将进行表单编码。

def on_start(self):
    self.crawl('http://httpbin.org/post', callback=self.callback,
               method='POST', data={'a': 123, 'b': 'c'})

files

要上传的文件，格式为字典{field: {filename: 'content'}}
user_agent

请求的用户代理
headers

要发送的请求头字典。
cookies

要附加到请求的cookie字典
timeout

获取页面的最长时间（秒）。默认：120
allow_redirects

遵循30x重定向，默认：True
taskid

唯一标识任务的id，默认是MD5检查代码的URL，可以被def get_taskid(self, task)方法覆盖

import json
from pyspider.libs.utils import md5string
def get_taskid(self, task):
    return md5string(task['url']+json.dumps(task['fetch'].get('data', '')))

默认情况下，只有url的md5值作为taskid，上面的代码将post请求的数据添加为taskid的一部分。

save

一个传递给回调方法的对象，可以通过response.save访问

def on_start(self):
    self.crawl('http://www.example.org/', callback=self.callback,
               save={'a': 123})

def callback(self, response):
    return response.save['a']

123将在callback里返回。(和scrapy框架中的meta传参数是一样的效果)

2.Response

响应对象的属性。

Response.url

最终的URL
Response.text

unicode编码的响应文本（字符串类型）。

如果Response.encoding为None，charset模块可用，内容编码将会被猜测。
Response.content

二进制格式的响应内容。
Response.doc

响应内容的PyQuery对象。

参考PyQuery的文档:https://pythonhosted.org/pyquery/
Response.etree

响应内容的lxml对象。
Response.json

响应的json编码内容(如果有的话进行输出，没有的话就不显示)。

3.self.send_message

self.send_message(project, msg, [url])

发送消息给其他项目。可以被def on_message(self, project, message)接收。
- project - 项目名
- msg - 任何可以json序列化的对象
- url （可选参数）- 如果有相同的任务结果将被覆盖。在默认情况下send_message会共享当前taskid. 更改它可以实现一个响应返回多个结果。

def detail_page(self, response):
    for i, each in enumerate(response.json['products']):
        self.send_message(self.project_name, {
                "name": each['name'],
                'price': each['prices'],
             }, url="%s#%s" % (response.url, i))

def on_message(self, project, msg):
    return msg

4.@catch_status_code_error

200的响应将被视为fetch失败，不会传递给回调函数。使用此装饰器覆盖此特性。

def on_start(self):
    self.crawl('http://httpbin.org/status/404', self.callback)

@catch_status_code_error  
def callback(self, response):
    ...

当请求失败时，回调将不会执行(使用状态代码404)。使用@catch_status_code_error装饰器，即使请求失败，也会执行回调。

5.@every(minutes=0, seconds=0)

方法将每隔几分钟或几秒调用一次

@every(minutes=24 * 60)
def on_start(self):
    for url in urllist:
        self.crawl(url, callback=self.index_page)

这些url将没24小时重新爬取一次。注意，如果使用age，周期比@every长，抓取请求将被丢弃，因为它被认为没有改变:

@every(minutes=24 * 60)
def on_start(self):
    self.crawl('http://www.example.org/', callback=self.index_page)

@config(age=10 * 24 * 60 * 60)
def index_page(self):
    ...

即使抓取请求每天都被触发，但它被丢弃，并且每10天才重新启动一次。

Pyspider框架（三）

Pyspider中的API介绍

1.self.crawl

（1）self.crawl(url, **kwargs)

（2） 参数

url

callback

age

priority

exetime

retries

itag

auto_recrawl

method

params

data

files

user_agent

headers

cookies

timeout

allow_redirects

taskid

save

2.Response

Response.url

Response.text

Response.content

Response.doc

Response.etree

Response.json

3.self.send_message

self.send_message(project, msg, [url])

4.@catch_status_code_error

5.@every(minutes=0, seconds=0)

你可能感兴趣的:(Pyspider框架（三）)

（2）参数