在scrapy中,提供了两个下载文件的pipeline,分别是:
scrapy.pipelines.images.ImagesPipeline
scrapy.pipelines.files.FilesPipeline
其中scrapy.pipelines.images.ImagesPipeline
用于图片下载,scrapy.pipelines.files.FilesPipeline
用于文件下载。
class FileItem(Item):
# ... other item fields ...
file_urls = Field()
files = Field()
class FileItem(Item):
# ... other item fields ...
image_urls = Field()
images = Field()
上述的file_urls
、files
可以在settings.py
中进行字段名的修改。
FILES_URLS_FIELD = 'field_name_for_your_files_urls'
FILES_RESULT_FIELD = 'field_name_for_your_processed_files'
上述的image_urls
、images
可以在settings.py中进行字段名的修改。
IMAGES_URLS_FIELD = 'field_name_for_your_images_urls'
IMAGES_RESULT_FIELD = 'field_name_for_your_processed_images'
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
'scrapy.pipelines.files.FilesPipeline': 1,
}
ImagesPipeline的操作类似。
from scrapy import Request
import re
from scrapy.pipelines.files import FilesPipeline
class MyFilesPipeline(FilesPipeline):
# get_media_requests可以不用修改
# item_completed也可以不用修改
# 直接重写该方法,自定义保存的文件名
def file_path(self, request, response=None, info=None):
uid = re.findall('show/(\S+)/timeline', request.url)[0]
print(uid)
max_position = re.findall('max_position=(\S+)&', request.url)
print(max_position)
max_position = max_position[0] if len(max_position)>0 else '0'
file_path = uid+'_'+max_position+".json"
return file_path
ITEM_PIPELINES = {
'my.pipelines.MyFilesPipeline': 200,
'scrapy.pipelines.files.FilesPipeline': None, ==# 非常重要, 其他资料都没有写,对于初学者非常坑==
}
注意:将默认的FilesPipeline一定要设置为None,不然自定义的pipeline无法获取文件
2019-12-06 15:31:53 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded file from <GET https://twitter.com/i/profiles/show/CCTV/timeline/tweets?include_available_features=1&include_entities=1&max_position=&reset_error_state=false> referred in <None>
2019-12-06 15:31:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://twitter.com/PDChinese>
{'_id': 'PDChinese',
'crawl_time': 1575617513,
'followed_num': 493941,
'home_url': 'https://twitter.com/PDChinese',
'like_num': 286,
'posts_num': 36928,
'watching_num': 1031}
2019-12-06 15:31:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://twitter.com/CCTV>
{'file_urls': ['https://twitter.com/i/profiles/show/CCTV/timeline/tweets?include_available_features=1&include_entities=1&max_position=&reset_error_state=false'],
'files': [{'checksum': 'b9b6e2962b65e7ed480533397c789155',
'path': 'CCTV_0.json',
'url': 'https://twitter.com/i/profiles/show/CCTV/timeline/tweets?include_available_features=1&include_entities=1&max_position=&reset_error_state=false'}]}
PDChinese
2019-12-06 15:31:53 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded file from <GET https://twitter.com/i/profiles/show/PDChinese/timeline/tweets?include_available_features=1&include_entities=1&max_position=&reset_error_state=false> referred in <None>
2019-12-06 15:31:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://twitter.com/PDChinese>
{'file_urls': ['https://twitter.com/i/profiles/show/PDChinese/timeline/tweets?include_available_features=1&include_entities=1&max_position=&reset_error_state=false'],
'files': [{'checksum': 'cbb47f455155c9fa56f69e389fc6e2bb',
'path': 'PDChinese_0.json',
'url': 'https://twitter.com/i/profiles/show/PDChinese/timeline/tweets?include_available_features=1&include_entities=1&max_position=&reset_error_state=false'}]}
2019-12-06 15:31:53 [scrapy.core.engine] INFO: Closing spider (finished)
2019-12-06 15:31:53 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/response_bytes': 1318841,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 5.055676,
'file_count': 2,
'file_status_count/uptodate': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 12, 6, 7, 31, 53, 814403),
'item_scraped_count': 4,
'log_count/DEBUG': 28,
'log_count/INFO': 10,
'memusage/max': 51351552,
'memusage/startup': 51351552,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2019, 12, 6, 7, 31, 48, 758727)}
2019-12-06 15:31:53 [scrapy.core.engine] INFO: Spider closed (finished)
在上述日志中,以下内容为文件下载状态:
... ...
{'file_urls': ['https://twitter.com/i/profiles/show/CCTV/timeline/tweets?include_available_features=1&include_entities=1&max_position=&reset_error_state=false'],
'files': [{'checksum': 'b9b6e2962b65e7ed480533397c789155',
'path': 'CCTV_0.json',
'url': 'https://twitter.com/i/profiles/show/CCTV/timeline/tweets?include_available_features=1&include_entities=1&max_position=&reset_error_state=false'}]}
... ...
{'file_urls': ['https://twitter.com/i/profiles/show/PDChinese/timeline/tweets?include_available_features=1&include_entities=1&max_position=&reset_error_state=false'],
'files': [{'checksum': 'cbb47f455155c9fa56f69e389fc6e2bb',
'path': 'PDChinese_0.json',
'url': 'https://twitter.com/i/profiles/show/PDChinese/timeline/tweets?include_available_features=1&include_entities=1&max_position=&reset_error_state=false'}]}
2019-12-06 15:31:53 [scrapy.core.engine] INFO: Closing spider (finished)
... ...