上面有篇博客专门对scrapy入门爬取进行了一个简单介绍,而且实现了对新闻网站数据的爬取,这次我们将要爬取360上面的美食图片。我们将图片相关的信息保存在MYSQL和MongDB数据库中,首先我们需要安装好MYSQL和MongDB数据库,安装这一块大家可以参考网上的安装教程。
首先我们需要了解我们爬取的目标网站:https://image.so.com/z?ch=food,打开这个网页,我们会发现很多美食图片,这个时候我们打开谷歌的开发者工具,然后切换到XHR选项,不断下拉,会呈现很多Ajax请求,如下图:
我们打开一个请求的详情:
返回的格式是JSON。其中list就是一张张图片的详细信息,包含了30张图片的ID,名称,链接等信息。我们另外观察Ajax请求的参数信息,有一个参数sn一直在变化,sn为30就返回前30张图片,由此类推,其中ch参数代表类别,listtype是排序方式,其他参数不用管,我们在翻页请求的时候就改变sn参数就可以了。
首先我们新建一个项目,在指定的文件夹位置新建,我们利用cmd命令窗口:
cd C:\Users\lixue\Desktop\test
然后创建项目,并创建一个spiders:
scrapy startproject image360
scrapy genspider images.so.com
这两条命令分别运行,运行完就生成了那个文件夹以及spider。
接下来我们定义爬取的页数,我这里爬取30页,每页30张,一共900张图片我们可以先在settings.py里面定义一个变量MAX_PAGE,添加如下定义:
MAX_PAGE=30
接下来我们定义spider中start_requests()方法 ,来生成30次请求:
class ImagesSpider(Spider):
name = 'images'
allowed_domains = ['images.so.com']
start_urls = ['http://images.so.com/']
def start_requests(self):
data ={'ch':'food','listtype':'new'}
base_url ='https://image.so.com/zjl?'
for page in range(1,self.settings.get('MAX_PAGE')+1):
data['sn']= page*30
params =urlencode(data)
url = base_url +params
yield Request(url,self.parse)
我们首先定义初始不定的两个参数ch和listtype,然后sn参数是遍历循环生成的,利用urllencode()将字典转化为URL请求的GET参数,从而构成完整的URL,构造并生成Request,然后还要引入以下模块:
from scrapy import Spider, Request
from urllib.parse import urlencode
当我们后面爬取的时候还需要修改setting中的ROBOTSTXT_OBEY,否则无法抓取:
ROBOTSTXT_OBEY = False
接下来,我们可以试着爬取一下;
scrapy crawl images
我们发现返回的都是200,说明请求正常。
我们首先需要新建一个Item,叫做ImageItem,如下所示:
from scrapy import Item,Field
class ImageItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
collection = table ='images'
id =Field()
url =Field()
title =Field()
thumb =Field()
这里定义了四个字段,采集包括图片的标题,ID,缩略图,链接。另外两个属性collection和table都定义为字符串,分别代表MongDB和MySQL存储的Colllection和表名称。
接下来我们来编写提取上面这几个字段一块的相关信息,将parse()方法改写为如下所示:
def parse(self, response):
result =json.loads(response.text)
for image in result.get('list'):
item =ImageItem()
item['id'] = image.get('id')
item['url'] =image.get('qhimg_url')
item['title']=image.get('title')
item['thumb'] = image.get('qhimg_downurl')
yield item
首先json解析,然后遍历提取相关信息。再对ImangeItem赋值,生成Item对象。
这一块我们用了MongDB和MySQL ,但在这里我只以为MySQL为例做说明,在做后面存储之前,我们首先要确保MySQL的安装和正常使用。
首先新建一个数据库,名字还是以image360,SQL语句为:
CREATE DATABASE images360 DEFAULT CHARACTER SET UF8 COLLATE UTF8_general_ci
然后我们再来创建数据表:
CREATE TABLE `images` (
`id` varchar(255) DEFAULT NULL,
`url` varchar(255) NOT NULL,
`title` varchar(255) DEFAULT NULL,
`thumb` varchar(255) DEFAULT NULL,
PRIMARY KEY (`url`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
执行完SQL语句之后,我们就成功创建好了数据表,接下来我们就可以往表里面存储数据。
我们来实现一个MySQLPipeline,代码如下所示:
class MysqlPipeline():
def __init__(self, host, database, user, password, port):
self.host = host
self.database = database
self.user = user
self.password = password
self.port = port
@classmethod
def from_crawler(cls, crawler):
return cls(
host=crawler.settings.get('MYSQL_HOST'),
database=crawler.settings.get('MYSQL_DATABASE'),
user=crawler.settings.get('MYSQL_USER'),
password=crawler.settings.get('MYSQL_PASSWORD'),
port=crawler.settings.get('MYSQL_PORT'),
)
def open_spider(self, spider):
self.db = pymysql.connect(self.host, self.user, self.password, self.database, charset='utf8',
port=self.port)
self.cursor = self.db.cursor()
def close_spider(self, spider):
self.db.close()
def process_item(self, item, spider):
print(item['title'])
data = dict(item)
keys = ', '.join(data.keys())
values = ', '.join(['%s'] * len(data))
sql = 'insert into %s (%s) values (%s)' % (item.table, keys, values)
self.cursor.execute(sql, tuple(data.values()))
self.db.commit()
return item
这里我们插入数据采取的是动态构造SQL语句的方法,此外我们需要设置MySQL的配置,我们在settings.py里添加几个变量,如下所示:
MONGO_URI = 'localhost'
MONGO_DB = 'image360'
MYSQL_HOST = 'localhost'
MYSQL_DATABASE = 'image360'
MYSQL_USER = 'root'
MYSQL_PASSWORD = '123456'
MYSQL_PORT = 3306
定义了数据库的配置,这样MySQLPipeline就完成了。
我们下面再来看看Image Pipeline的构造,scrapy 专门提供了处理下载的Pipeline,保存文件下载和图片下载,下载原理和爬取网页原理是一样的,下载过程支持多线程和异步,下载十分高效。
我们首先要定义存储文件的路径,需要定义一个IMAGES_STORE 变量,在settings中添加这个:
IMAGES_STORE = './images'
即所有下载图片都存放在这个文件夹中,下面我们来看看我编写的:
class ImagePipeline(ImagesPipeline):
def file_path(self, request, response=None, info=None):
url = request.url
file_name = url.split('/')[-1]
return file_name
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem('Image Downloaded Failed')
return item
def get_media_requests(self, item, info):
yield Request(item['url'])
get_media_requests是调用爬取 的Item对象,我们将它的url字段提取出来,然后直接生成Request对象,在将Request对象加入调度队列中,等待被调度,执行下载。
file_path()主要是为了构造存储后的文件名。
item_completed()是单个Item完成下载时处理方法,并不是每一个都下载成功,我们需要剔除下载失败的就不需要保存这个Item 到数据库中。
这里我就直接给出代码,不做过多介绍:
class MongoPipeline(object):
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DB')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def process_item(self, item, spider):
name = item.collection
self.db[name].insert(dict(item))
return item
def close_spider(self, spider):
self.client.close()
我们还需要改settingst添加一些设置,即是存储到MongDB的链接地址和数据库名称。我们需要在settings添加这两个变量:
MONGO_URI = 'localhost'
MONGO_DB = 'image360'
到这里,三个Item Pipeline的定义就完成了,最后我们启用就行,修改settings.py中ITEM_PIPELINES,如下所示:
ITEM_PIPELINES = {
'image360.pipelines.ImagePipeline': 300,
'image360.pipelines.MongoPipeline': 301,
'image360.pipelines.MysqlPipeline': 302,
}
最后我们来运行程序,进行爬取:
scrapy crawl images
爬虫的输出日志为:
我们还可以看看保存的图片以及数据库存储的信息:
我采集这个图片纯粹就是为了好玩,看到的不要打我
下面这个是数据库存储的信息:
最后我附上修改比较多的模块代码,可能会有一些路径等设置,大家记得要改成和自己电脑的路径一致,然后如果有更好的意见可以和我联系改进这个爬虫。
from scrapy import Spider, Request
from urllib.parse import urlencode
import json
from image360.items import ImageItem
class ImagesSpider(Spider):
name = 'images'
allowed_domains = ['images.so.com']
start_urls = ['http://images.so.com/']
def start_requests(self):
data ={'ch':'food','listtype':'new'}
base_url ='https://image.so.com/zjl?'
for page in range(1,self.settings.get('MAX_PAGE')+1):
data['sn']= page*30
params =urlencode(data)
url = base_url +params
yield Request(url,self.parse)
def parse(self, response):
result =json.loads(response.text)
for image in result.get('list'):
item =ImageItem()
item['id'] = image.get('id')
item['url'] =image.get('qhimg_url')
item['title']=image.get('title')
item['thumb'] = image.get('qhimg_downurl')
yield item
BOT_NAME = 'image360'
MAX_PAGE =30
SPIDER_MODULES = ['image360.spiders']
NEWSPIDER_MODULE = 'image360.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'image360 (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'image360.middlewares.Image360SpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'image360.middlewares.Image360DownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'image360.pipelines.Image360Pipeline': 300,
#}
ITEM_PIPELINES = {
'image360.pipelines.ImagePipeline': 300,
'image360.pipelines.MongoPipeline': 301,
'image360.pipelines.MysqlPipeline': 302,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
MONGO_URI = 'localhost'
MONGO_DB = 'image360'
MYSQL_HOST = 'localhost'
MYSQL_DATABASE = 'image360'
MYSQL_USER = 'root'
MYSQL_PASSWORD = '123456'
MYSQL_PORT = 3306
IMAGES_STORE = './images'
from scrapy import Item,Field
class ImageItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
collection = table ='images'
id =Field()
url =Field()
title =Field()
thumb =Field()
import pymongo
import pymysql
from scrapy import Request
from scrapy.exceptions import DropItem
from scrapy.pipelines.images import ImagesPipeline
class MongoPipeline(object):
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DB')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def process_item(self, item, spider):
name = item.collection
self.db[name].insert(dict(item))
return item
def close_spider(self, spider):
self.client.close()
class MysqlPipeline():
def __init__(self, host, database, user, password, port):
self.host = host
self.database = database
self.user = user
self.password = password
self.port = port
@classmethod
def from_crawler(cls, crawler):
return cls(
host=crawler.settings.get('MYSQL_HOST'),
database=crawler.settings.get('MYSQL_DATABASE'),
user=crawler.settings.get('MYSQL_USER'),
password=crawler.settings.get('MYSQL_PASSWORD'),
port=crawler.settings.get('MYSQL_PORT'),
)
def open_spider(self, spider):
self.db = pymysql.connect(self.host, self.user, self.password, self.database, charset='utf8',
port=self.port)
self.cursor = self.db.cursor()
def close_spider(self, spider):
self.db.close()
def process_item(self, item, spider):
print(item['title'])
data = dict(item)
keys = ', '.join(data.keys())
values = ', '.join(['%s'] * len(data))
sql = 'insert into %s (%s) values (%s)' % (item.table, keys, values)
self.cursor.execute(sql, tuple(data.values()))
self.db.commit()
return item
class ImagePipeline(ImagesPipeline):
def file_path(self, request, response=None, info=None):
url = request.url
file_name = url.split('/')[-1]
return file_name
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem('Image Downloaded Failed')
return item
def get_media_requests(self, item, info):
yield Request(item['url'])