注:大部分内容参考http://www.cnblogs.com/voidsky/p/5490798.html,但原文不是存在数据库中。
首先创建一个项目douban9fen
kuku@ubuntu:~/pachong$ scrapy startproject douban9fen New Scrapy project 'douban9fen', using template directory '/usr/local/lib/python2.7/dist-packages/scrapy/templates/project', created in: /home/kuku/pachong/douban9fen You can start your first spider with: cd douban9fen scrapy genspider example example.com
kuku@ubuntu:~/pachong$ cd douban9fen/
首先,我们要确定所要抓取的信息,包括三个字段:(1)书名,(2)评分,(3)作者
然后,让我们分析下,采用火狐浏览器,进入https://www.douban.com/doulist/1264675/
按F12对上述页面进行调试
分别按照1、2、3 的步骤查看每个对象所属的div,关闭调试窗口
进而,在页面中右击查看页面源代码,在页面源代码中查看搜索3中的div标签下class为bd doulist-subject的地方
根据先大后小的原则,我们先用bd doulist-subject,把每个书找到,然后,循环对里面的信息进行提取
提取书大框架:
'//div[@class="bd doulist-subject"]'
提取题目:
'div[@class="title"]/a/text()'
提取得分:
'div[@class="rating"]/span[@class="rating_nums"]/text()'
提取作者:(这里用正则方便点)
'(.*?)
经过上述分析,接下来进行代码的编写
kuku@ubuntu:~/pachong/douban9fen$ lsdouban9fen scrapy.cfgkuku@ubuntu:~/pachong/douban9fen$ tree douban9fen/douban9fen/ ├── __init__.py ├── items.py ├── pipelines.py ├── settings.py └── spiders └── __init__.pykuku@ubuntu:~/pachong/douban9fen/douban9fen/spiders$ vim db_9fen_spider.py添加以下内容:
# -*- coding:utf8 -*- import scrapy import re class Db9fenSpider(scrapy.Spider): name = "db9fen" allowed_domains = ["douban.com"] start_urls = ["https://www.douban.com/doulist/1264675/"] #解析数据 def parse(self,response): # print response.body ninefenbook = response.xpath('//div[@class="bd doulist-subject"]') for each in ninefenbook: title = each.xpath('div[@class="title"]/a/text()').extract()[0] title = title.replace(' ','').replace('\n','') print title author = re.search('(.*?)
保存。
为方便执行,我们将建立一个main.py文件
kuku@ubuntu:~/pachong/douban9fen/douban9fen/spiders$ cd ../.. kuku@ubuntu:~/pachong/douban9fen$ vim main.py添加以下内容,
# -*- coding:utf8 -*- import scrapy.cmdline as cmd cmd.execute('scrapy crawl db9fen'.split()) #db9fen 对应着db_9fen_spider.py文件中的name变量值保存。
此时,我们可以执行下
kuku@ubuntu:~/pachong/douban9fen$ python main.py但此时只能抓取到当前页面中的信息,查看页面中的后页信息
可以看到是存在标签span中的class="next"下,我们只需要将这个链接提取出来,进而对其进行爬取
'//span[@class="next"]/link/@href'然后提取后 我们scrapy的爬虫怎么处理呢?
可以使用yield,这样爬虫就会自动执行url的命令了,处理方式还是使用我们的parse函数yield scrapy.http.Request(url,callback=self.parse)然后将更改db_9fen_spider.py文件,添加以下内容到for函数中。
nextpage = response.xpath('//span[@class="next"]/link/@href').extract() if nextpage: print nextpage next = nextpage[0] print next yield scrapy.http.Request(next,callback=self.parse)如图所示
可能有些人想问,next = nextpage[0]什么意思,这里可以解释以下,变量nextpage是一个列表,列表里面存的是一个链接字符串,next = nextpage[0]就是将这个链接取出并赋值给变量next。
现在可以在items文件中定义我们要抓取的字段
kuku@ubuntu:~/pachong/douban9fen/douban9fen$ vim items.py编辑item.py文件中的内容是:
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy from scrapy import Field class Douban9FenItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = Field() author = Field() rate = Field()
定义好字段之后,将重新对db_9fen_spider.py进行编辑,将刚才抓取到的三个字段存放在items.py中类的实例中,作为属性值。
kuku@ubuntu:~/pachong/douban9fen/douban9fen$ cd spiders/ kuku@ubuntu:~/pachong/douban9fen/douban9fen/spiders$ vim db_9fen_spider.py# -*- coding:utf8 -*- import scrapy import re from douban9fen.items import Douban9FenItem class Db9fenSpider(scrapy.Spider): name = "db9fen" allowed_domains = ["douban.com"] start_urls = ["https://www.douban.com/doulist/1264675/"] #解析数据 def parse(self,response): # print response.body ninefenbook = response.xpath('//div[@class="bd doulist-subject"]') for each in ninefenbook: item = Douban9FenItem() title = each.xpath('div[@class="title"]/a/text()').extract()[0] title = title.replace(' ','').replace('\n','') print title item['title'] = title author = re.search('(.*?)
编辑setting.py,添加数据库配置信息
USER_AGENT = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.8.1.14) Gecko/20080404 Firefox/44.0.2' # start MySQL database configure setting MYSQL_HOST = 'localhost' MYSQL_DBNAME = 'douban9fen' MYSQL_USER = 'root' MYSQL_PASSWD = 'openstack' # end of MySQL database configure setting ITEM_PIPELINES = { 'douban9fen.pipelines.Douban9FenPipeline': 300, }注意mysql数据库是预先安装进去的,可以看到数据库的名称为douban9fen,因此我们首先需要在数据库中创建douban9fen 数据库
kuku@ubuntu:~/pachong/douban9fen/douban9fen/spiders$ mysql -uroot -p Enter password: Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 46 Server version: 5.5.52-0ubuntu0.14.04.1 (Ubuntu) Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> create database douban9fen; Query OK, 1 row affected (0.00 sec)mysql> show databases;+--------------------+ | Database | +--------------------+ | information_schema | | csvt04 | | douban9fen | | doubandianying | | mysql | | performance_schema | | web08 | +--------------------+ 7 rows in set (0.00 sec)可以看到已经创建数据库成功;
mysql> use douban9fen;接下来创建数据表
mysql> create table douban9fen ( id int(4) not null primary key auto_increment, title varchar(100) not null, author varchar(40) not null, rate varchar(20) not null )CHARACTER SET utf8 COLLATE utf8_general_ci; Query OK, 0 rows affected (0.04 sec)
编辑pipelines.py,将数据储存到数据库中,
kuku@ubuntu:~/pachong/douban9fen/douban9fen$ vim pipelines.py# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html #将数据存储到mysql数据库 from twisted.enterprise import adbapi import MySQLdb import MySQLdb.cursors class Douban9FenPipeline(object): #数据库参数 def __init__(self): dbargs = dict( host = '127.0.0.1', db = 'douban9fen', user = 'root', passwd = 'openstack', cursorclass = MySQLdb.cursors.DictCursor, charset = 'utf8', use_unicode = True ) self.dbpool = adbapi.ConnectionPool('MySQLdb',**dbargs) def process_item(self, item, spider): res = self.dbpool.runInteraction(self.insert_into_table,item) return item #插入的表,此表需要事先建好 def insert_into_table(self,conn,item): conn.execute('insert into douban9fen( title,author,rate) values(%s,%s,%s)', ( item['title'], item['author'], item['rate'] ) )编辑好上面的红色标注的文件后,
kuku@ubuntu:~/pachong/douban9fen/douban9fen$ cd .. kuku@ubuntu:~/pachong/douban9fen$再执行 main.py文件
kuku@ubuntu:~/pachong/douban9fen$ python main.py执行过程如下:
打开mysql ,查看是否已经写入到数据库中;
kuku@ubuntu:~/pachong/douban9fen$ mysql -uroot -p输入密码openstack 登录
mysql> show databases;+--------------------+ | Database | +--------------------+ | information_schema | | csvt04 | | douban9fen | | doubandianying | | mysql | | performance_schema | | web08 | +--------------------+ 7 rows in set (0.00 sec)mysql> use douban9fen;Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changedmysql> show tables;+----------------------+ | Tables_in_douban9fen | +----------------------+ | douban9fen | +----------------------+ 1 row in set (0.00 sec)mysql> select * from douban9fen;显示能够成功写入到数据库中。