scrapy.Spider
1.创建项目:scrapy startproject readbook
2.跳转到spiders路径:cd /readbook/readbook/spiders
3.创建爬虫文件: scrapy genspider -t crawl read https://www.dushu.com/book/1107_1.html
在read.py中修改代码:
callback只能写函数名字符串, callback='parse_item'
在基本的spider中,如果重新发送请求,那里的callback写的是 callback=self.parse_item
follow=true 是否跟进 就是按照提取连接规则进行提取
rules = (
Rule(LinkExtractor(allow=r'/book/1107_\d+.html'),
callback='parse_item',
follow=True),
)
在items.py文件中,添加代码
import scrapy
class ReadbookItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
src=scrapy.Field()
name=scrapy.Field()
注意:在导入items.py文件需要注意一下:
from ..items import ReadbookItem
当爬取其他页时,URL地址会改变,如果不在allowed_domains
中,就不能爬取,所以要修改allowed_domains
中的URL地址,
allowed_domains = ['www.dushu.com']
start_urls = ['https://www.dushu.com/book/1107_1.html']
def parse_item(self, response):
img_list =response.xpath('//div[@class="bookslist"]//img')
for img in img_list:
src=img.xpath('./@data-original').extract_first()
name=img.xpath('./@alt').extract_first()
book = ReadbookItem(src=src, name=name)
yield book
DB_HOST='127.0.0.1'
DB_PORT=3306
DB_USER='root'
DB_PASSWORD='root'
DB_NAME='spider'
DB_CHARSET='utf8'
并在数据库创建表read,创建字段 src(varchar(255)), name(varchar(255))
使用pymysql连接数据库,并使用管道将数据保存到数据库
pymysql的使用:
1. conn 连接
1. 端口号必须是整型
2. 字符集不允许加-
2. cursor 游标
3. cursor.execute(sql)
4. conn.commit()
5. cursor.close()
6. conn.close()
from scrapy.utils.project import get_project_settings
import pymysql
class SaveDataPipeline(object):
def open_spider(self,spider):
# 会把settings中的所有的等号左边的值当作key 等号右边的值当作value
settings = get_project_settings()
db_host = settings['DB_HOST']
db_port = settings['DB_PORT']
db_user = settings['DB_USER']
db_password = settings['DB_PASSWORD']
db_name = settings['DB_NAME']
db_charset = settings['DB_CHARSET']
self.conn = pymysql.Connect(host=db_host,
user=db_user,
password=db_password,
database=db_name,
# 端口号必须是整型
port=db_port,
# 字符集不允许加-
charset=db_charset)
self.cursor = self.conn.cursor()
def process_item(self,item,spider):
# {}的外面必须要加 “”
sql = 'insert into `read` values ("{}","{}")'.format(item['src'],item['name'])
self.cursor.execute(sql)
self.conn.commit()
return item
def close_spider(self,spider):
self.cursor.close()
self.conn.close()
ITEM_PIPELINES = {
# 'readbook.pipelines.ReadbookPipeline': 300,
'readbook.pipelines.SaveDataPipeline': 299,
}
使用命令:scrapy crawl read
源码已放在Github上,下载后只需修改数据库的user
和password
,并创建对应的数据库,表和字段,安装依赖(requirements.txt)即可:github源码地址