[scrapy] DEBUG: Filtered offsite request to

[scrapy 常见问题整理] DEBUG: Filtered offsite request to

使用scrapy爬取豆瓣TOP250电影信息在进行自动翻页爬取的时候,出现了一个问题,解析自动翻页之后网页爬取时没有获取到数据。

测试代码:

# -*- coding: utf-8 -*-
import scrapy
from douDanMovie.items import DoudanmovieItem
from scrapy import Request

class DoubanSpiderSpider(scrapy.Spider):
    name = "douban_spider"
    allowed_domains = ["www.douban.com"]
    start_urls = (
        'https://movie.douban.com/top250',
    )
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36',
    }
    def start_requests(self):
        url = 'https://movie.douban.com/top250'
        yield Request(url, headers=self.headers)
    def parse(self, response):
        item = DoudanmovieItem()
        movies = response.xpath('//ol[@class="grid_view"]/li')
        for movie in movies:
            item['ranking'] = movie.xpath('.//div[@class="pic"]/em/text()').extract()[0]
            item['movie_name'] = movie.xpath('.//div[@class="hd"]/a/span[1]/text()').extract()[0]
            item['score'] = movie.xpath('.//div[@class="star"]/span[@class="rating_num"]/text()').extract()[0]
            item['score_num'] = movie.xpath('.//div[@class="star"]/span/text()').re(r'(\d+)人评价')[0]
            yield item
        next_url = response.xpath('//span[@class="next"]/a/@href').extract()
        #此处解析的 next_url数据正常
        if next_url:
          next_url = 'https://movie.douban.com/top250' + next_url[0]
          yield Request(url = next_url,headers=self.headers)

错误信息:

2018-11-24 12:06:01 [scrapy] DEBUG: Filtered offsite request to 'movie.douban.com': 
2018-11-24 12:06:01 [scrapy] INFO: Closing spider (finished)
2018-11-24 12:06:01 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 301,

问题分析:
由于在allowed_domains中定义了“www.douban.com”,在进行翻页二次解析的时候域名“https://movie.douban.com/top250?start=25&filter=>
”和allowed_domains中定义的不一致,因此将该域名给过滤掉了

问题解决:

  1. 将allowed_domains = [‘www.douban.com’]更改为allowed_domains = [‘豆瓣.com’] 即更换为对应的一级域名
  2. 在进行二次request的时候,通过将dont_filter设置为True,不样将request给过滤掉。
    如下为Request的定义:
    class scrapy.http.Request(url[, callback, method=‘GET’, headers, body, cookies, meta, encoding=‘utf-8’, priority=0, dont_filter=False, errback])
    dont_filter 参数说明,其默认为False
    dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.

你可能感兴趣的:(Scrapy)