scrapy 爬虫使用FilesPipeline 下载 出现302

问题描述:

在使用scrapy 爬取QQ邮箱的过程中, 我想把邮件相应的附件同时下载下来。于是我使用了scrapy自带的下载功能FilesPipeline 。
当我使用其爬取邮箱的时候,发现有部分可以爬取而一部分附件反馈为302。于是爬取失败
[scrapy] WARNING: File (code: 302): Error downloading file from

问题解决

    def __init__(self, store_uri, download_func=None, settings=None):
        if not store_uri:
            raise NotConfigured

        if isinstance(settings, dict) or settings is None:
            settings = Settings(settings)

        cls_name = "FilesPipeline"
        self.store = self._get_store(store_uri)
        resolve = functools.partial(self._key_for_pipe,
                                    base_class_name=cls_name,
                                    settings=settings)
        self.expires = settings.getint(
            resolve('FILES_EXPIRES'), self.EXPIRES
        )
        if not hasattr(self, "FILES_URLS_FIELD"):
            self.FILES_URLS_FIELD = self.DEFAULT_FILES_URLS_FIELD
        if not hasattr(self, "FILES_RESULT_FIELD"):
            self.FILES_RESULT_FIELD = self.DEFAULT_FILES_RESULT_FIELD
        self.files_urls_field = settings.get(
            resolve('FILES_URLS_FIELD'), self.FILES_URLS_FIELD
        )
        self.files_result_field = settings.get(
            resolve('FILES_RESULT_FIELD'), self.FILES_RESULT_FIELD
        )

        super(FilesPipeline, self).__init__(download_func=download_func, settings=settings)

这是在FilesPipeline中的初始化方法, 我们可以看到最后调用了父类的init方法进行初始化
FilesPipeline继承至MediaPipeline,于是我们来看看父类的方法

    def __init__(self, download_func=None, settings=None):
        self.download_func = download_func

        if isinstance(settings, dict) or settings is None:
            settings = Settings(settings)
        resolve = functools.partial(self._key_for_pipe,
                                    base_class_name="MediaPipeline",
                                    settings=settings)
        self.allow_redirects = settings.getbool(
            resolve('MEDIA_ALLOW_REDIRECTS'), False
        )
        self._handle_statuses(self.allow_redirects)

从这里我们可以看到,如果在settings文件中没有设置MEDIA_ALLOW_REDIRECTS参数的话,默认会将值赋值成False 及如果在下载的过程中如果有重定向过程,将不再重定向。

于是我再settings文件中 设置 MEDIA_ALLOW_REDIRECTS =True 问题完美解决!!

你可能感兴趣的:(scrapy)