使用scrapy下载文件

  • 使用的scrapy版本是1.8.x
  • 官方文档:https://docs.scrapy.org/en/latest/topics/media-pipeline.html

在scrapy中,提供了两个下载文件的pipeline,分别是:

scrapy.pipelines.images.ImagesPipeline
scrapy.pipelines.files.FilesPipeline

其中scrapy.pipelines.images.ImagesPipeline用于图片下载,scrapy.pipelines.files.FilesPipeline用于文件下载。

在item.py中添加Item

使用FilesPipeline

class FileItem(Item):
    # ... other item fields ...
    file_urls = Field()
    files = Field()

使用ImagesPipeline

class FileItem(Item):
    # ... other item fields ...
    image_urls = Field()
    images = Field()

上述的file_urlsfiles可以在settings.py中进行字段名的修改。

FILES_URLS_FIELD = 'field_name_for_your_files_urls'
FILES_RESULT_FIELD = 'field_name_for_your_processed_files'

上述的image_urlsimages可以在settings.py中进行字段名的修改。

IMAGES_URLS_FIELD = 'field_name_for_your_images_urls'
IMAGES_RESULT_FIELD = 'field_name_for_your_processed_images'

在settings.py中配置ITEM_PIPELINES

ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1,
    'scrapy.pipelines.files.FilesPipeline': 1,
}

重写FilesPipeline

ImagesPipeline的操作类似。

在ipipelines.py中自定义Pipeline

from scrapy import Request
import re
from scrapy.pipelines.files import FilesPipeline

class MyFilesPipeline(FilesPipeline):
    # get_media_requests可以不用修改
    
    
    
            
    # item_completed也可以不用修改
    
    
    
    
    
    
    
    # 直接重写该方法,自定义保存的文件名
    def file_path(self, request, response=None, info=None):
        uid = re.findall('show/(\S+)/timeline', request.url)[0]
        print(uid)
        max_position = re.findall('max_position=(\S+)&', request.url)
        print(max_position)
        max_position = max_position[0] if len(max_position)>0 else '0'
        file_path = uid+'_'+max_position+".json"
        return file_path

重新修改setting.py中的ITEM_PIPELINES

ITEM_PIPELINES = {
    'my.pipelines.MyFilesPipeline': 200,
    'scrapy.pipelines.files.FilesPipeline': None,  ==# 非常重要, 其他资料都没有写,对于初学者非常坑==
}

注意:将默认的FilesPipeline一定要设置为None,不然自定义的pipeline无法获取文件

下载文件成功示例

2019-12-06 15:31:53 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded file from <GET https://twitter.com/i/profiles/show/CCTV/timeline/tweets?include_available_features=1&include_entities=1&max_position=&reset_error_state=false> referred in <None>
2019-12-06 15:31:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://twitter.com/PDChinese>
{'_id': 'PDChinese',
 'crawl_time': 1575617513,
 'followed_num': 493941,
 'home_url': 'https://twitter.com/PDChinese',
 'like_num': 286,
 'posts_num': 36928,
 'watching_num': 1031}
2019-12-06 15:31:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://twitter.com/CCTV>
{'file_urls': ['https://twitter.com/i/profiles/show/CCTV/timeline/tweets?include_available_features=1&include_entities=1&max_position=&reset_error_state=false'],
 'files': [{'checksum': 'b9b6e2962b65e7ed480533397c789155',
            'path': 'CCTV_0.json',
            'url': 'https://twitter.com/i/profiles/show/CCTV/timeline/tweets?include_available_features=1&include_entities=1&max_position=&reset_error_state=false'}]}
PDChinese
2019-12-06 15:31:53 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded file from <GET https://twitter.com/i/profiles/show/PDChinese/timeline/tweets?include_available_features=1&include_entities=1&max_position=&reset_error_state=false> referred in <None>
2019-12-06 15:31:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://twitter.com/PDChinese>
{'file_urls': ['https://twitter.com/i/profiles/show/PDChinese/timeline/tweets?include_available_features=1&include_entities=1&max_position=&reset_error_state=false'],
 'files': [{'checksum': 'cbb47f455155c9fa56f69e389fc6e2bb',
            'path': 'PDChinese_0.json',
            'url': 'https://twitter.com/i/profiles/show/PDChinese/timeline/tweets?include_available_features=1&include_entities=1&max_position=&reset_error_state=false'}]}
2019-12-06 15:31:53 [scrapy.core.engine] INFO: Closing spider (finished)
2019-12-06 15:31:53 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/response_bytes': 1318841,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 5.055676,
 'file_count': 2,
 'file_status_count/uptodate': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 12, 6, 7, 31, 53, 814403),
 'item_scraped_count': 4,
 'log_count/DEBUG': 28,
 'log_count/INFO': 10,
 'memusage/max': 51351552,
 'memusage/startup': 51351552,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2019, 12, 6, 7, 31, 48, 758727)}
2019-12-06 15:31:53 [scrapy.core.engine] INFO: Spider closed (finished)

在上述日志中,以下内容为文件下载状态:

... ... 
{'file_urls': ['https://twitter.com/i/profiles/show/CCTV/timeline/tweets?include_available_features=1&include_entities=1&max_position=&reset_error_state=false'],
 'files': [{'checksum': 'b9b6e2962b65e7ed480533397c789155',
            'path': 'CCTV_0.json',
            'url': 'https://twitter.com/i/profiles/show/CCTV/timeline/tweets?include_available_features=1&include_entities=1&max_position=&reset_error_state=false'}]}
... ...
{'file_urls': ['https://twitter.com/i/profiles/show/PDChinese/timeline/tweets?include_available_features=1&include_entities=1&max_position=&reset_error_state=false'],
 'files': [{'checksum': 'cbb47f455155c9fa56f69e389fc6e2bb',
            'path': 'PDChinese_0.json',
            'url': 'https://twitter.com/i/profiles/show/PDChinese/timeline/tweets?include_available_features=1&include_entities=1&max_position=&reset_error_state=false'}]}
2019-12-06 15:31:53 [scrapy.core.engine] INFO: Closing spider (finished)
... ... 

你可能感兴趣的:(数据采集)